retrieval | indexing | embedding | research

Building a Lightweight Semantic Search Engine for arXiv

Search over scientific papers gets difficult once the query stops looking like a title fragment and starts looking like an intent: a method, a problem, or an application area. Many relevant papers are closely related in meaning while using different terminology, which makes it a useful setting for semantic retrieval. Similarly, seemingly related papers using the same terminology may not even be from the same disciplines. For instance, "quantization" might refer to low-bit compression in machine learning, energy quantization in physics, quantization noise in signal processing, or quantization schemes in numerical methods.

In this article, we build and evaluate a retrieval system over a corpus of 2.8M+ arXiv papers using 200,000 synthetic intent-style queries, and compare sparse, dense, and hybrid retrieval across four query types with different levels of semantic difficulty.

The goal is to make this system feasible to run locally or on a small CPU instance with limited resources, as a lightweight retrieval layer for literature-oriented RAG and research assistance.

A number of websites already make scientific literature on arXiv searchable through semantic similarity. Most are designed for interactive human use rather than as retrieval backends for local or agentic workflows. It is also often hard to find public details on how retrieval is implemented or evaluated.

Dataset and Evaluation

Evaluating semantic retrieval over arXiv requires a dataset of query-document pairs covering a meaningful subset of the corpus. We started with version 281 of the continually updated Cornell arXiv dataset published on Kaggle containing 2.8M+ abstracts. We chose a subset of 50k paper abstracts at random from the corpus and generated a total of 200k synthetic queries for them with Gemini 3.0 Flash. Four queries with differing levels of semantic specificity were generated per abstract as illustrated in the table below.

Note that this benchmark is synthetic and single-target by construction: each query is generated from one paper abstract and evaluated against that paper as the target. It therefore measures a useful form of intent-style known-item retrieval under paraphrase, but does not fully capture open-ended literature search or real user behavior.

Query Type Example
core_concept_query "multi-head self-attention mechanism without RNN or CNN architecture"
problem_solution_query "improving training speed and parallelization in sequence to sequence models"
exploratory_query "attention based deep learning architectures for natural language processing 2017"
application_query "WMT 2014 English to German translation BLEU state of the art transformer paper"

To fine-tune the EmbeddingGemma model, the 200k queries were split into a 180k training set and 20k eval set. The split was performed at the paper level: 45k source papers were used for training-query generation and 5k held-out papers for evaluation.

We used EmbeddingGemma as the base embedding model because it offers a compact 768-dimensional representation and a small enough parameter count to make fine-tuning and deployment practical on modest hardware.

Alongside the embedding model, we use a BM25 index and merge results with reciprocal rank fusion (RRF) for validation.

Results

For the baseline results indicated in Fig 1. below we used a BM25 index of titles + abstracts and embeddings of the same with a stock (not finetuned) EmbeddingGemma-300m. Since in particular the core-concept query stays closest to the terminology and structure of the original abstract, it is expected that BM25 does quite well here and outperforms the dense model (See also Fig 2.). Even without fine-tuning, the hybrid baseline already outperforms either sparse or dense retrieval in isolation, indicating that the two methods capture complementary relevance signals.

Description
Fig 1. Recall@10 evolution across model iterations for sparse, dense, hybrid, and reranked retrieval across baseline and fine-tuning stages.

Fine-tuning with in-batch negatives

A first fine-tuning run was based on the Multiple Negatives Ranking Loss (MNRL) and used random negatives from the same batch. Fine-tuning with in-batch negatives improves both dense retrieval and the final hybrid system, with the largest gains appearing on the harder problem-solution and application-style queries.

Description
Fig 2. Recall@10 by query type across retrieval stages, showing that core-concept queries remain easiest while exploratory and application-style queries benefit most from fine-tuning.

Fine-tuning with hard negatives

Even though in-batch negatives provided a boost in recall, the approach to select them is random and depends on the dataloader. We improved upon it by instead sampling from similar abstracts in the BM25 index. To reduce the risk of including relevant abstracts as negatives, we exclude BM25 ranks 2-14 and instead sample uniformly from BM25 ranks 15-50 to select one negative example for each positive sample.

This led to a hybrid recall@10 that was 9% higher compared to the hybrid baseline and 5% higher compared to random negatives. A stock reranker improved the in-batch-negative setup, but not the final hard-negative pipeline. Because it also adds substantial latency, we exclude it from the final deployment-oriented system.

Hard-negative sampling yields the best overall hybrid results as shown below, especially at Recall@10, even though it does not produce the strongest dense-only retriever in isolation.

PIPELINE STAGE RECALL@1 RECALL@10 RECALL@100
Sparse (BM25) 28.46% 49.98% 69.16%
Dense (Gemma) 26.83%(+2.63pp) 52.17%(+4.11pp) 75.20%(+4.53pp)
Hybrid RRF 36.05%(+7.18pp) 63.80%(+9pp) 82.16%(+4.51pp)
Reranked 36.65% 62.48%(+0.83pp) 82.16% (+4.51pp)

Latency and deployment

To reason about recall gains without the influence of other system components, we've used brute force search on GPU against the 768-dimensional embeddings of the entire 2.8M corpus in the preceding sections. For use in an interactive RAG pipeline, however, latency matters more than losing a small amount of recall.

To reduce retrieval latency for deployment, we use approximate nearest neighbors and create an HNSW index with FAISS. We picked efSearch=512 to closely match brute force search in recall. Approximate speedup figures are included in the last column to show how much lookup time can be reduced while preserving nearly all of the original recall.

Nearest Neighbor method RECALL@1 RECALL@10 RECALL@100 Speedup (CPU)
Brute Force (exact search) 36.65% 62.48% 82.16 1x
FAISS HNSWFlat (efSearch=64) 33.78% 61.69% 81.29% 1000x
FAISS HNSWFlat (efSearch=256) 35.88% 63.42% 82.36% 400x
FAISS HNSWFlat (efSearch=512) 36.38% 63.85% 82.52% 200x

For the planned deployment on CPU (AMD x86 8 core, 32GB ram) we further went with ONNX for model serving and reached the following results for the end-to-end retriever:

Description
Fig 3. End-to-end latency under increasing query load on an 8-core CPU system(no GPU used for inference), with a clear throughput knee beyond roughly 22 QPS.
Stage latencies at 22 QPS P50 P90 P99
sparse (bm25) 10.9ms 12.2ms 13.3ms
dense (gemma) 28.3ms 30.3ms 31.2ms
faiss 3.0ms 4.2 ms 4.2ms
End-to-end 42.9ms 44.6ms 45.3ms

Given the figures above, the final system already meets the needs of our intended deployment: a lightweight local retrieval layer for literature-oriented RAG. On the 8-core CPU system, it sustains roughly 20 QPS with low tail latency, which is sufficient for an interactive single-user setup. On an ASUS GX10 (NVIDIA GB10), around 100 QPS can be achieved using batched inference.

If higher throughput on CPU were needed, the most promising next step would be to optimize embedding inference, since it accounts for the largest share of end-to-end latency in the current setup.