Vector Database Optimization for E-Commerce Search Logs: A Data-Driven Study of Latency, Recall, and Revenue Signals
Main article
Abstract
Approximate nearest-neighbor (ANN) retrieval over dense vector embeddings has become the standard architectural pattern for semantic product search in large-scale e-commerce platforms. Yet a systematic, data-driven comparison of how different vector index strategies interact with real search logs—affecting retrieval quality, serving latency, and downstream commercial signals—remains absent from the literature. This paper introduces an open, reproducible database of 4.2 million search-session records drawn from a mid-sized Chinese e-commerce platform and uses it to benchmark five vector index configurations: exact flat search, two IVF-PQ variants, and two HNSW configurations. The database captures query embeddings, item embeddings, ranked result lists, click events, cart additions, purchase outcomes, revenue, and per-query serving latency. We design a controlled A/B simulation that layers bi-encoder query encoding, HNSW indexing, learning-to-rank (LTR) re-ranking, and position-bias correction, measuring Recall@K, p50/p99 latency, click-through rate, conversion rate, and average revenue per session at each step. The full system achieves Recall@10 = 0.979 and p50 latency = 6.2 ms, corresponding to a 22.1% uplift in revenue per session compared with the BM25-sorted IVF-PQ baseline. Ablation experiments identify the re-ranker and query embedding steps as the two largest individual contributors to revenue uplift. The database schema, data pipeline, and reproducibility protocols are described in full, and the anonymised dataset is released under a CC-BY 4.0 licence.
