An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets
Abstract
"When intense redaction is needed to protect the confidentiality of data subjects' identities and sensitive attributes, statistical agencies can use synthetic data approaches. To create synthetic data, the agency replaces identifying or sensitive values with draws from statistical models estimated from the confidential data. Many agencies are reluctant to implement this idea because (i) the quality of the generated data depends strongly on the quality of the underlying models, and (ii) developing effective synthesis models can be a labor-intensive and difficult task. Recently, there have been suggestions that agencies use nonparametric methods from the machine learning literature to generate synthetic data. These methods can estimate non-linear relationships that might otherwise be missed and can be run with minimal tuning, thus considerably reducing burdens on the agency. Four synthesizers based on machine learning algorithms - classification and regression trees, bagging, random forests, and support vector machines - are evaluated in terms of their potential to preserve analytical validity while reducing disclosure risks. The evaluation is based on a repeated sampling simulation with a subset of the 2002 Uganda census public use sample data. The simulation suggests that synthesizers based on regression trees can result in synthetic datasets that provide reliable estimates and low disclosure risks, and that these synthesizers can be implemented easily by statistical agencies." (Author's abstract, IAB-Doku) ((en))
Cite article
Drechsler, J. & Reiter, J. (2011): An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. In: Computational Statistics and Data Analysis, Vol. 55, No. 12, p. 3232-3243. DOI:10.1016/j.csda.2011.06.006