Generating Synthetic Data is Complicated: Know Your Data and Know Your Generator
Abstract
"In recent years, more and more synthetic data generators (SDGs) based on various modeling strategies have been implemented as Python libraries or R packages. With this proliferation of ready-made SDGs comes a widely held perception that generating synthetic data is easy. We show that generating synthetic data is a complicated process that requires one to understand both the original dataset as well as the synthetic data generator. We make two contributions to the literature in this topic area. First, we show that it is just as important to pre-process or clean the data as it is to tune the SDG in order to create synthetic data with high levels of utility. Second, we illustrate that it is critical to understand the methodological details of the SDG to be aware of potential pitfalls and to understand for which types of analysis tasks one can expect high levels of analytical validity." (Author's abstract, IAB-Doku, © Springer) ((en))
Cite article
Latner, J., Neunhoeffer, M. & Drechsler, J. (2024): Generating Synthetic Data is Complicated: Know Your Data and Know Your Generator. In: J. Domingo-Ferrer & M. Önen (Hrsg.) (2024): Privacy in Statistical Databases 2024, p. 115-128, accepted on June 21, 2024. DOI:10.1007/978-3-031-69651-0_8