DataMindFlow: An Open-Source Orchestration System for Reproducible Data Engineering Pipelines

Daniel  R. Okafor; Mei-Ling  Tan; Sofia  Almeida; Rajiv  Menon

doi:10.63646/datamind.2024.020306

Open Access PDF

Received 2024-04-18

Accepted 2024-08-22

Published 2024-09-30

Daniel R. Okafor

Data Systems Group, Department of Computer Science, Riverside Institute of Technology, Riverside, CA 92507, USA

Mei-Ling Tan*

School of Computing and Data Science, National Polytechnic University, Singapore 138632, Singapore
mei-ling.tan@npu.edu.sg

Sofia Almeida

Data Systems Group, Department of Computer Science, Riverside Institute of Technology, Riverside, CA 92507, USA

Rajiv Menon

Centre for Computational Discovery, Trans-Pacific Research Laboratory, Vancouver, BC V6T 1Z4, Canada

DOI: https://doi.org/10.63646/datamind.2024.020306

Abstract

Modern data engineering increasingly relies on multi-stage pipelines that ingest, clean, transform, and model heterogeneous data before any analysis can begin. Although general-purpose workflow schedulers have made these pipelines easier to express and operate, they were not designed to guarantee that a pipeline, once executed, can be reproduced exactly on a different machine, at a later date, or by a different person. The result is a persistent reproducibility gap: re-running the same pipeline frequently yields non-identical artifacts because code, data, parameters, and the software environment are not captured as a single coherent unit. This article presents DataMindFlow, an open-source orchestration system that treats reproducibility as a first-class engineering property rather than an afterthought. DataMindFlow compiles a declarative pipeline into a directed acyclic graph of tasks, derives a deterministic content-addressed cache key for every task from the hash of its code, its resolved inputs, its parameters, and a digest of its execution environment, and persists a complete lineage record in a relational metadata store. A list-scheduling executor based on the Heterogeneous Earliest-Finish-Time heuristic places tasks across local and distributed workers, while a memoisation layer skips any task whose cache key already exists, enabling correct incremental recomputation after partial edits. We describe the system architecture, the metadata schema, the cache-key derivation algorithm, and the public application programming interface, and we evaluate the implementation against five widely used baselines on synthetic graphs of up to five thousand tasks and three realistic pipelines drawn from genomics, natural-language processing, and tabular machine learning. DataMindFlow reduces per-task scheduling overhead by between 2.1 and 29 times relative to the baselines, recovers up to 99.2 percent bit-identical artifacts across three independent hosts, and sustains near-linear throughput scaling to thirty-two workers. We argue that deterministic content addressing, environment capture, and durable lineage should be standard components of data engineering infrastructure, and we release the system, its data dictionary, and all evaluation artifacts under a permissive licence.

Keywords: Data engineering; workflow orchestration; reproducibility; data lineage; content-addressed storage; pipeline scheduling; AI data infrastructure

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

R. Okafor, D. ., Tan, M.-L., Almeida, S. ., & Menon, . R. (2024). DataMindFlow: An Open-Source Orchestration System for Reproducible Data Engineering Pipelines. DATAMIND, 2(3), 73-93. https://doi.org/10.63646/datamind.2024.020306

Download Citation

Article sidebar

Main article

Abstract

Article details

How to Cite