MetaSchema: A Schema Discovery and Harmonization Toolkit for Heterogeneous Research Databases
Main article
Abstract
Research databases are growing in both number and heterogeneity. Even within a single discipline, analysts routinely confront relational stores, document collections, graph databases, vector indexes, and lakehouse-style tables, each carrying its own conventions for field naming, type encoding, key declaration, and provenance recording. The result is a recurring bottleneck: a disproportionate share of any data analysis project is spent re-discovering schema structure that, in principle, has already been recorded by someone else. This article presents MetaSchema, a schema discovery and harmonization toolkit aimed at this bottleneck. MetaSchema is organised as a four-stage pipeline — automatic schema profiling, large-language-model-assisted field annotation, cross-source entity and field matching, and reviewer-in-the-loop version control — that transforms a collection of heterogeneous databases into a unified, queryable schema graph with a field dictionary, cross-source mapping tables, and a reproducible query interface. We describe the design decisions that make the toolkit practically deployable, including its hybrid matching layer, its structured human-review protocol, and its semantic-version log. An empirical evaluation on a benchmark of twelve heterogeneous databases, totalling 2,418 tables and 27,640 fields, shows that MetaSchema achieves a field-type recovery accuracy of 86.4%, a cross-source field matching F1-score of 0.821, and a 67% reduction in median reviewer time per 100 fields compared with a careful manual baseline. The toolkit scales close to linearly up to 5,000 tables and integrates with relational, graph, vector, and lakehouse storage layers. MetaSchema is released as open-source software together with the benchmark, the evaluation scripts, and a reproducible query API designed to support automated analysis, model evaluation, and downstream decision tools.
