An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets

Abstract

"When intense redaction is needed to protect data subjects' confidentiality, statistical agencies can release synthetic data, in which identifying or sensitive values are replaced with draws from statistical models estimated from the confidential data. Specifying accurate synthesis models can be a difficult and labor intensive task with standard parametric approaches. We describe and empirically evaluate four easy-to-implement, nonparametric synthesizers based on machine learning algorithms - classification and regression trees, bagging, random forests, and support vector machines - on their potential to preserve analytical validity and reduce disclosure risks. The results suggest that synthesizers based on regression trees can provide high utility with low disclosure risks." (Text excerpt, IAB-Doku) ((en))

Cite article

Drechsler, J. & Reiter, J. (2011): An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. In: Europäische Kommission (Hrsg.) (2011): Proceedings of the Eurostat Conference on New Techniques and Technologies for Statistics (NTTS) 2011, Brussels, p. 1-12.

Download

Free Access

Further information

later released (possibly different) in: Computational Statistics and Data Analysis, Vol. 55, No. 12 (2011), S. 3232-3243