Synthetic data in statistics and computer science - a systematic evaluation and methodological improvements

Project duration: 15.11.2022 to 14.11.2025

Abstract

Providing access to sensitive data has become more difficult in recent years due to increased privacy protection regulations. Synthetic data approaches have become more relevant in this context. The underlying idea is to replace the original data with synthetic values drawn from a model fitted to the original data. Different strategies for generating synthetic data have been developed independently in statistics and computer science. This project aims to systematically compare the different approaches. As the methodological concepts are fundamentally different between the disciplines, we expect new insights that might help improve the approaches in both fields. Additionally, the project aims at developing new methodology for existing approaches, for example, developing adjustments to deal with some of the known weaknesses of computer science approaches based on deep learning models.