MetaSchema: A Schema Discovery and Harmonization Toolkit for Heterogeneous Research Databases

Yuxiang  Liang; Tianhao  Qin; Beibei  Hu; Zhenyu  Hou

doi:10.63646/datamind.2023.010203

Open Access PDF

Published 2023-06-30

Yuxiang Liang

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China

Tianhao Qin

School of Software, Shandong Normal University, Jinan 250358, China

Beibei Hu

School of Information Management, Heilongjiang University, Harbin 150080, China

Zhenyu Hou*

School of Mathematics and Computer Science, Yan'an University, Yan'an 716000, China
zhenyu.hou@yau.edu.cn

DOI: https://doi.org/10.63646/datamind.2023.010203

Abstract

Research databases are growing in both number and heterogeneity. Even within a single discipline, analysts routinely confront relational stores, document collections, graph databases, vector indexes, and lakehouse-style tables, each carrying its own conventions for field naming, type encoding, key declaration, and provenance recording. The result is a recurring bottleneck: a disproportionate share of any data analysis project is spent re-discovering schema structure that, in principle, has already been recorded by someone else. This article presents MetaSchema, a schema discovery and harmonization toolkit aimed at this bottleneck. MetaSchema is organised as a four-stage pipeline — automatic schema profiling, large-language-model-assisted field annotation, cross-source entity and field matching, and reviewer-in-the-loop version control — that transforms a collection of heterogeneous databases into a unified, queryable schema graph with a field dictionary, cross-source mapping tables, and a reproducible query interface. We describe the design decisions that make the toolkit practically deployable, including its hybrid matching layer, its structured human-review protocol, and its semantic-version log. An empirical evaluation on a benchmark of twelve heterogeneous databases, totalling 2,418 tables and 27,640 fields, shows that MetaSchema achieves a field-type recovery accuracy of 86.4%, a cross-source field matching F1-score of 0.821, and a 67% reduction in median reviewer time per 100 fields compared with a careful manual baseline. The toolkit scales close to linearly up to 5,000 tables and integrates with relational, graph, vector, and lakehouse storage layers. MetaSchema is released as open-source software together with the benchmark, the evaluation scripts, and a reproducible query API designed to support automated analysis, model evaluation, and downstream decision tools.

Keywords: schema discovery; data harmonization; large language models; database mapping; reproducible research; data engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Liang, Y., Qin, T., Hu, . B., & Hou, Z. (2023). MetaSchema: A Schema Discovery and Harmonization Toolkit for Heterogeneous Research Databases. DATAMIND, 1(2), 16-32. https://doi.org/10.63646/datamind.2023.010203

Download Citation

Article sidebar

Main article

Abstract

Article details

How to Cite