Synthetic Tabular Data Generation: A Benchmark of Six GAN-Based Methods on Financial Datasets

Nadia Petrova; Samuel Adewale; Björn Lindström

doi:10.63646/

Open Access PDF

Published 2024-06-30

Nadia Petrova*

Laboratory for Financial Data Science, University of Zurich, Zurich, Switzerland, 8001
nadia.petrova@df.uzh.ch

Samuel Adewale

Department of Statistics, University of Cape Town, Cape Town, South Africa, 7701

Björn Lindström

School of Economics, Stockholm University, Stockholm, Sweden, 10691

DOI: https://doi.org/10.63646/

Abstract

Synthetic data generation for tabular financial datasets presents a distinctive set of challenges relative to image or text synthesis: heterogeneous column types (continuous, categorical, temporal, binary), highly non-Gaussian marginal distributions characteristic of financial variables, complex conditional dependencies including temporal autocorrelations, and stringent privacy requirements driven by financial regulation. This paper presents a systematic benchmark of six GAN-based synthetic tabular data generation methods — CTGAN, TVAE, CopulaGAN, TableGAN, CTAB-GAN+, and REaLTabFormer — across three financial datasets: a retail credit application dataset, an institutional trade order dataset, and a customer transaction dataset. We evaluate fidelity, utility, and privacy across twelve metrics including Wasserstein distance, train-on-synthetic-test-on-real (TSTR) accuracy, statistical feature similarity, and membership inference attack success rate. CTAB-GAN+ achieves the best overall fidelity and utility balance, but no single method dominates across all metrics. We identify a systematic tradeoff between privacy protection and distributional fidelity that is more severe in financial data than reported benchmarks on general tabular datasets, and we discuss the implications for regulated financial data sharing.

Keywords: synthetic data; tabular data generation; GAN; financial data; privacy; data augmentation; CTGAN; benchmark

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Petrova, N., Adewale, S., & Lindström, B. (2024). Synthetic Tabular Data Generation: A Benchmark of Six GAN-Based Methods on Financial Datasets. DATAMIND, 2(2), 1-4. https://doi.org/10.63646/

Download Citation

Article sidebar

Main article

Abstract

Article details

How to Cite