Synthetic Tabular Data Generation: A Benchmark of Six GAN-Based Methods on Financial Datasets
Main article
Abstract
Synthetic data generation for tabular financial datasets presents a distinctive set of challenges relative to image or text synthesis: heterogeneous column types (continuous, categorical, temporal, binary), highly non-Gaussian marginal distributions characteristic of financial variables, complex conditional dependencies including temporal autocorrelations, and stringent privacy requirements driven by financial regulation. This paper presents a systematic benchmark of six GAN-based synthetic tabular data generation methods — CTGAN, TVAE, CopulaGAN, TableGAN, CTAB-GAN+, and REaLTabFormer — across three financial datasets: a retail credit application dataset, an institutional trade order dataset, and a customer transaction dataset. We evaluate fidelity, utility, and privacy across twelve metrics including Wasserstein distance, train-on-synthetic-test-on-real (TSTR) accuracy, statistical feature similarity, and membership inference attack success rate. CTAB-GAN+ achieves the best overall fidelity and utility balance, but no single method dominates across all metrics. We identify a systematic tradeoff between privacy protection and distributional fidelity that is more severe in financial data than reported benchmarks on general tabular datasets, and we discuss the implications for regulated financial data sharing.
