DataLineageX: A Provenance Graph Database for End-to-End Data Science Workflows
Main article
Abstract
Data science workflows span heterogeneous artefacts—datasets, preprocessing code, trained models, hyperparameter configurations, evaluation results, and deployment environments—whose interdependencies are rarely captured in a machine-readable and queryable form. The absence of systematic data lineage infrastructure leads to unreproducible experiments, undetected model staleness, opaque audit trails, and costly debugging across distributed teams. This paper introduces DataLineageX, a provenance graph database designed to capture, store, and query the complete end-to-end lineage of data science workflows. DataLineageX models provenance as a directed acyclic graph (DAG) over eight typed node classes—Dataset, Code, Execution, Model, Parameter, Result, Experiment, and Audit—and twelve typed edge predicates representing causal and structural dependencies. An API instrumentation layer automatically harvests lineage events from Jupyter notebooks, MLflow tracking servers, Apache Airflow pipelines, and Git commit hooks without requiring manual annotation. The provenance graph is persisted in a property graph store (Neo4j-compatible) with traversal-optimized composite indexes. Experiments on 265 heterogeneous data science workflows demonstrate lineage completeness of 93.2%, experiment replay success of 89.8%, and median path-query latency of 14 ms at graph sizes of 10,000 nodes. DataLineageX is released as open-source software with a REST and GraphQL API, a Python SDK, and a browser-based visualization interface, providing researchers and practitioners with a reusable infrastructure for reproducible AI and automated data governance.
