Comparing fully and partially synthetic data sets for statistical disclosure control in the German IAB Establishment Panel

Beschreibung

"In this paper we discuss the advantages and disadvantages of two approaches that provide disclosure control by generating synthetic data sets: The first, proposed by Rubin (1993), generates fully synthetic data sets while the second suggested by Little (1993) imputes values only for selected variables that bear a high risk of disclosure. Changing only some variables in general will lead to higher analytical validity. However, the disclosure risk will also increase for partially synthetic data sets since true values remain in the data. Thus, agencies willing to release synthetic data sets will have to decide, which of the two methods balances best the trade-off between data utility and disclosure risk for their data. We offer some guidelines to help making this decision. We apply the two methods to a set of variables from the 1997 wave of the German IAB Establishment Panel and evaluate their quality by comparing regression results from the original data with results we achieve for the same analyses run on the data set after the imputation procedures. The results are as expected: In both cases the analytical validity of the synthetic data is high with partially synthetic data sets outperforming fully synthetic data sets in terms of data utility. But this advantage comes at the price of a higher disclosure risk for the partially synthetic data." (Author's abstract, IAB-Doku) ((en))

Zitationshinweis

Drechsler, Jörg, Stefan Bender & Susanne Rässler (2007): Comparing fully and partially synthetic data sets for statistical disclosure control in the German IAB Establishment Panel. (United Nations, Economic Commission for Europe. Working paper 11), New York, 8 S.

Bezugsmöglichkeiten

kostenfreier Zugang

Weitere Informationen

spätere (möglw. abweichende) Version erschienen in: Transactions on Data Privacy, Vol. 1, No. 3 (2008), S. 105-130