Neural Code Search: Evaluating Embedding Strategies for Repository-Level Code Retrieval
Main article
Abstract
Code search — retrieving semantically relevant code given a natural language query — is a foundational capability for developer tooling, code review assistance, and automated programming. Recent work has advanced single-function code search substantially, but repository-level retrieval — finding relevant code across entire codebases that may contain millions of tokens — presents distinct challenges that single-function benchmarks do not capture: cross-file dependencies, project-specific idioms, and the need to retrieve code fragments at varying granularities. This paper evaluates six embedding strategies for repository-level code retrieval: TF-IDF with BM25 (lexical baseline), GraphCodeBERT (structural), CodeT5+ (generative), UniXcoder (multi-modal), Voyage Code 3 (proprietary dense), and a late-interaction architecture adapted from ColBERT (ColCode). We construct a new evaluation benchmark (RepoSearch-1K) consisting of 1,000 repository-search queries across five programming languages and eight domains, with relevance annotations from professional software engineers. Results show that late-interaction approaches substantially outperform single-vector dense retrieval on cross-file dependency queries, and that structural embeddings (GraphCodeBERT) retain an advantage over purely semantic approaches on queries involving abstract syntax tree relationships. We release RepoSearch-1K as a community resource.
