Improved variance estimation for fully synthetic datasets
Abstract
"Fully synthetic datasets, i.e. datasets that only contain simulated values, arguably provide a very high level of data protection. Since all values are simulated reidentification is almost impossible. This makes the approach especially attractive for the release of very sensitive data such as medical records. However, the established variance estimate for fully synthetic datasets has two major drawbacks. First, it can be positively biased, where the bias is a function of the sampling rate of the original data. Second, it can become negative. In this paper I illustrate the negative effects of these drawbacks on the estimation of the variance and propose an alternative variance estimate that shows less variability, is always unbiased, and can never be negative. This variance estimate is closely related to the variance estimate for partially synthetic datasets." (Author's abstract, IAB-Doku) ((en))
Cite article
Drechsler, J. (2011): Improved variance estimation for fully synthetic datasets. (Joint UNECE/Eurostat work session on statistical data confidentiality 2011. Working paper 18), New York, 13 p.