Main article

Zainab Al-Rashidi*
Department of Computing, Imperial College London, London, UK, SW7 2AZ
z.alrashidi@imperial.ac.uk
Pavel Novotný
Faculty of Informatics, Masaryk University, Brno, Czech Republic, 60200
Min-Ji Kim
School of Computer Science, KAIST, Daejeon, South Korea, 34141

Abstract

Code search — retrieving semantically relevant code given a natural language query — is a foundational capability for developer tooling, code review assistance, and automated programming. Recent work has advanced single-function code search substantially, but repository-level retrieval — finding relevant code across entire codebases that may contain millions of tokens — presents distinct challenges that single-function benchmarks do not capture: cross-file dependencies, project-specific idioms, and the need to retrieve code fragments at varying granularities. This paper evaluates six embedding strategies for repository-level code retrieval: TF-IDF with BM25 (lexical baseline), GraphCodeBERT (structural), CodeT5+ (generative), UniXcoder (multi-modal), Voyage Code 3 (proprietary dense), and a late-interaction architecture adapted from ColBERT (ColCode). We construct a new evaluation benchmark (RepoSearch-1K) consisting of 1,000 repository-search queries across five programming languages and eight domains, with relevance annotations from professional software engineers. Results show that late-interaction approaches substantially outperform single-vector dense retrieval on cross-file dependency queries, and that structural embeddings (GraphCodeBERT) retain an advantage over purely semantic approaches on queries involving abstract syntax tree relationships. We release RepoSearch-1K as a community resource.

Article details

How to Cite

Al-Rashidi, Z., Novotný, P., & Kim, M.-J. (2026). Neural Code Search: Evaluating Embedding Strategies for Repository-Level Code Retrieval. DATAMIND, 3(3), 1-4. https://doi.org/10.63646/