Main article

Daniel R. Okafor
Data Systems Group, Department of Computer Science, Riverside Institute of Technology, Riverside, CA 92507, USA
Mei-Ling Tan*
School of Computing and Data Science, National Polytechnic University, Singapore 138632, Singapore
mei-ling.tan@npu.edu.sg
Sofia Almeida
Data Systems Group, Department of Computer Science, Riverside Institute of Technology, Riverside, CA 92507, USA
Rajiv Menon
Centre for Computational Discovery, Trans-Pacific Research Laboratory, Vancouver, BC V6T 1Z4, Canada

DOI: https://doi.org/10.63646/datamind.2024.020306

Abstract

Modern data engineering increasingly relies on multi-stage pipelines that ingest, clean, transform, and model heterogeneous data before any analysis can begin. Although general-purpose workflow schedulers have made these pipelines easier to express and operate, they were not designed to guarantee that a pipeline, once executed, can be reproduced exactly on a different machine, at a later date, or by a different person. The result is a persistent reproducibility gap: re-running the same pipeline frequently yields non-identical artifacts because code, data, parameters, and the software environment are not captured as a single coherent unit. This article presents DataMindFlow, an open-source orchestration system that treats reproducibility as a first-class engineering property rather than an afterthought. DataMindFlow compiles a declarative pipeline into a directed acyclic graph of tasks, derives a deterministic content-addressed cache key for every task from the hash of its code, its resolved inputs, its parameters, and a digest of its execution environment, and persists a complete lineage record in a relational metadata store. A list-scheduling executor based on the Heterogeneous Earliest-Finish-Time heuristic places tasks across local and distributed workers, while a memoisation layer skips any task whose cache key already exists, enabling correct incremental recomputation after partial edits. We describe the system architecture, the metadata schema, the cache-key derivation algorithm, and the public application programming interface, and we evaluate the implementation against five widely used baselines on synthetic graphs of up to five thousand tasks and three realistic pipelines drawn from genomics, natural-language processing, and tabular machine learning. DataMindFlow reduces per-task scheduling overhead by between 2.1 and 29 times relative to the baselines, recovers up to 99.2 percent bit-identical artifacts across three independent hosts, and sustains near-linear throughput scaling to thirty-two workers. We argue that deterministic content addressing, environment capture, and durable lineage should be standard components of data engineering infrastructure, and we release the system, its data dictionary, and all evaluation artifacts under a permissive licence.

Article details

How to Cite

R. Okafor, D. ., Tan, M.-L., Almeida, S. ., & Menon, . R. (2024). DataMindFlow: An Open-Source Orchestration System for Reproducible Data Engineering Pipelines. DATAMIND, 2(3), 73-93. https://doi.org/10.63646/datamind.2024.020306