DataLineageX: A Provenance Graph Database for End-to-End Data Science Workflows

Mingzhu Qian; Dengfeng Xu; Yunxiao Shao

doi:10.63646/datamind.2023.010302

Open Access PDF

Published 2023-09-30

Mingzhu Qian

School of Computer Science and Software Engineering, Tianjin University of Technology, Tianjin 300384, China

Dengfeng Xu

Department of Information Engineering, Hebei University of Engineering, Handan 056038, China

Yunxiao Shao*

School of Data Science and Artificial Intelligence, Wenzhou University, Wenzhou 325035, China
yunxiao.shao@wzu.edu.cn

DOI: https://doi.org/10.63646/datamind.2023.010302

Abstract

Data science workflows span heterogeneous artefacts—datasets, preprocessing code, trained models, hyperparameter configurations, evaluation results, and deployment environments—whose interdependencies are rarely captured in a machine-readable and queryable form. The absence of systematic data lineage infrastructure leads to unreproducible experiments, undetected model staleness, opaque audit trails, and costly debugging across distributed teams. This paper introduces DataLineageX, a provenance graph database designed to capture, store, and query the complete end-to-end lineage of data science workflows. DataLineageX models provenance as a directed acyclic graph (DAG) over eight typed node classes—Dataset, Code, Execution, Model, Parameter, Result, Experiment, and Audit—and twelve typed edge predicates representing causal and structural dependencies. An API instrumentation layer automatically harvests lineage events from Jupyter notebooks, MLflow tracking servers, Apache Airflow pipelines, and Git commit hooks without requiring manual annotation. The provenance graph is persisted in a property graph store (Neo4j-compatible) with traversal-optimized composite indexes. Experiments on 265 heterogeneous data science workflows demonstrate lineage completeness of 93.2%, experiment replay success of 89.8%, and median path-query latency of 14 ms at graph sizes of 10,000 nodes. DataLineageX is released as open-source software with a REST and GraphQL API, a Python SDK, and a browser-based visualization interface, providing researchers and practitioners with a reusable infrastructure for reproducible AI and automated data governance.

Keywords: data lineage; provenance graph; reproducibility; data science workflows; graph database; MLOps; audit trail; knowledge graph

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Qian, M., Xu, D., & Shao, Y. (2023). DataLineageX: A Provenance Graph Database for End-to-End Data Science Workflows. DATAMIND, 1(3), 5-18. https://doi.org/10.63646/datamind.2023.010302

Download Citation

Article sidebar

Main article

Abstract

Article details

How to Cite