Sampling with synthesis
Abstract
"Many statistical agencies disseminate samples of census microdata, that is, data on individual records, to the public. Before releasing the microdata, agencies typically alter identifying or sensitive values to protect data subjects' confidentiality, for example by coarsening, perturbing, or swapping data. These standard disclosure limitation techniques distort relationships and distributional features in the original data, especially when applied with high intensity. Furthermore, it can be difficult for analysts of the masked public use data to adjust inferences for the effects of the disclosure limitation. Motivated by these shortcomings, we propose an approach to census microdata dissemination called sampling with synthesis. The basic idea is to replace the identifying or sensitive values in the census with multiple imputations, and release samples from these multiply-imputed populations. We demonstrate that sampling with synthesis can improve the quality of public use data relative to sampling followed by standard statistical disclosure limitation; simulation results showing this are available online as supplemental material. We derive methods for analyzing the multiple datasets generated by sampling with synthesis. We present algorithms for selecting which census values to synthesize based on considerations of disclosure risk and data utility. We illustrate sampling with synthesis on a population constructed with data from the U.S. Current Population Survey." (Author's abstract, IAB-Doku) ((en))
Cite article
Drechsler, J. & Reiter, J. (2010): Sampling with synthesis. A new approach for releasing public use census microdata. In: Journal of the American Statistical Association, Vol. 105, No. 492, p. 1347-1357. DOI:10.1198/jasa.2010.ap09480