Adaptive Multi-Modal Data Lakehouse Architecture for Low-Latency AI Feature Retrieval
Main article
Abstract
Modern artificial-intelligence applications increasingly depend on retrieving two kinds of signal at inference time: precomputed structured features keyed by an entity, and nearest-neighbour embeddings of unstructured content such as text, images, and audio. The data-lakehouse paradigm has unified the storage of these heterogeneous assets under open columnar formats, but lakehouse engines are optimised for high-throughput analytical scans rather than for the millisecond-scale point and similarity lookups that online inference requires. In practice, teams bridge this gap by copying features into a separate low-latency key-value store and embeddings into a dedicated vector database, producing duplicated storage, operational overhead, and train–serve skew. This paper formulates low-latency multi-modal feature retrieval from a lakehouse as a concrete systems-engineering problem and presents AdaLH, an adaptive serving architecture that keeps a single open-format source of truth while exposing a unified retrieval interface that joins a structured point lookup with an approximate-nearest-neighbour search in one request. AdaLH introduces three mechanisms: an access-aware tier controller that promotes hot entities and embeddings into an in-memory row plus graph index while demoting cold data to object storage; a cost-based retrieval planner that estimates per-tier latency and routes the structured and vector legs of a request independently before fusing them; and an incremental materialisation pipeline that preserves point-in-time consistency between offline training and online serving. We evaluate AdaLH against four baselines—a two-tier lakehouse-plus-cache stack, a lakehouse-only scan engine, a split vector-database design, and a monolithic in-memory store—on a workload that mixes a ninety-five-feature view with top-ten embedding search over a corpus of one hundred million vectors. AdaLH attains a p99 end-to-end latency of 8.7 ms, a 3.1× improvement over the split design and a 20.7× improvement over the lakehouse-only engine, while sustaining 410 thousand requests per second and maintaining recall@10 above 0.95. An ablation shows that the planner, the controller, and co-located fusion each contribute materially to tail latency, and a calibration study confirms that the planner predicts per-request latency with a mean absolute percentage error of 7.6 percent. All code, configurations, datasets, and a data dictionary are released openly.
