Vector Database Optimization for E-Commerce Search Logs: A Data-Driven Study of Latency, Recall, and Revenue Signals

Peng  Liu; Jing  Wang; Hao  Xu

doi:10.63646/datamind.2023.010105

Open Access PDF

Published 2023-03-30

Peng Liu

School of Information Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China

Jing Wang

Department of Computer Science and Technology, Jiangxi University of Finance and Economics, Nanchang 330013, China

Hao Xu*

College of Artificial Intelligence, Shenyang University of Technology, Shenyang 110870, China
haoxu@sut.edu.cn

DOI: https://doi.org/10.63646/datamind.2023.010105

Abstract

Approximate nearest-neighbor (ANN) retrieval over dense vector embeddings has become the standard architectural pattern for semantic product search in large-scale e-commerce platforms. Yet a systematic, data-driven comparison of how different vector index strategies interact with real search logs—affecting retrieval quality, serving latency, and downstream commercial signals—remains absent from the literature. This paper introduces an open, reproducible database of 4.2 million search-session records drawn from a mid-sized Chinese e-commerce platform and uses it to benchmark five vector index configurations: exact flat search, two IVF-PQ variants, and two HNSW configurations. The database captures query embeddings, item embeddings, ranked result lists, click events, cart additions, purchase outcomes, revenue, and per-query serving latency. We design a controlled A/B simulation that layers bi-encoder query encoding, HNSW indexing, learning-to-rank (LTR) re-ranking, and position-bias correction, measuring Recall@K, p50/p99 latency, click-through rate, conversion rate, and average revenue per session at each step. The full system achieves Recall@10 = 0.979 and p50 latency = 6.2 ms, corresponding to a 22.1% uplift in revenue per session compared with the BM25-sorted IVF-PQ baseline. Ablation experiments identify the re-ranker and query embedding steps as the two largest individual contributors to revenue uplift. The database schema, data pipeline, and reproducibility protocols are described in full, and the anonymised dataset is released under a CC-BY 4.0 licence.

Keywords: vector database; approximate nearest neighbor search; FAISS; HNSW; e-commerce search; learning to rank; click-through rate; revenue signal; reproducible research

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Liu, P., Wang, J. ., & Xu, H. . (2023). Vector Database Optimization for E-Commerce Search Logs: A Data-Driven Study of Latency, Recall, and Revenue Signals. DATAMIND, 1(1), 45-56. https://doi.org/10.63646/datamind.2023.010105

Download Citation

Article sidebar

Main article

Abstract

Article details

How to Cite