Main article

Peng Liu
School of Information Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China
Jing Wang
Department of Computer Science and Technology, Jiangxi University of Finance and Economics, Nanchang 330013, China
Hao Xu*
College of Artificial Intelligence, Shenyang University of Technology, Shenyang 110870, China
haoxu@sut.edu.cn

DOI: https://doi.org/10.63646/datamind.2023.010105

Abstract

Approximate nearest-neighbor (ANN) retrieval over dense vector embeddings has become the standard architectural pattern for semantic product search in large-scale e-commerce platforms. Yet a systematic, data-driven comparison of how different vector index strategies interact with real search logs—affecting retrieval quality, serving latency, and downstream commercial signals—remains absent from the literature. This paper introduces an open, reproducible database of 4.2 million search-session records drawn from a mid-sized Chinese e-commerce platform and uses it to benchmark five vector index configurations: exact flat search, two IVF-PQ variants, and two HNSW configurations. The database captures query embeddings, item embeddings, ranked result lists, click events, cart additions, purchase outcomes, revenue, and per-query serving latency. We design a controlled A/B simulation that layers bi-encoder query encoding, HNSW indexing, learning-to-rank (LTR) re-ranking, and position-bias correction, measuring Recall@K, p50/p99 latency, click-through rate, conversion rate, and average revenue per session at each step. The full system achieves Recall@10 = 0.979 and p50 latency = 6.2 ms, corresponding to a 22.1% uplift in revenue per session compared with the BM25-sorted IVF-PQ baseline. Ablation experiments identify the re-ranker and query embedding steps as the two largest individual contributors to revenue uplift. The database schema, data pipeline, and reproducibility protocols are described in full, and the anonymised dataset is released under a CC-BY 4.0 licence.

Article details

How to Cite

Liu, P., Wang, J. ., & Xu, H. . (2023). Vector Database Optimization for E-Commerce Search Logs: A Data-Driven Study of Latency, Recall, and Revenue Signals. DATAMIND, 1(1), 45-56. https://doi.org/10.63646/datamind.2023.010105