May 5, 202611 min read

Magpie-RS — Semantic Search That Survives Real Workloads

A vector database in Rust, and what 22,771 R packages taught it.

RustSemantic SearchRAGRCRAN

Summary for Decision Makers

The Problem: Most vector databases work great in a notebook demo, then fall apart at scale. They blow up memory on large corpora, lose all progress when something crashes, and force teams to rewrite their indexing pipeline when moving from prototype to production.

The Solution: Magpie-RS is a Rust vector database built for production from day one. It uses streaming DiskANN with mmap-backed storage to keep memory bounded, persists every embedding to SQLite immediately so crashes lose only seconds of work, and exposes the same API for 50 documents as for 50 million.

The Proof: We indexed the entire CRAN — 22,771 R packages, ~1 million functions — into a single 2–4 GB index. Peak memory dropped from a theoretical 150 GB to 1–2 GB. The result is a recommendation engine that finds packages by purpose, not just keywords, with 41 coherent clusters at modularity 0.59.

The Bottom Line: If you have between 10,000 and 10 million chunks to embed and you want one tool that handles both ends without rewrites — and that doesn't lose work when something goes wrong — this is what we built.

Schnelle Systeme. Saubere Indexe.

Vector search libraries are easy to write and hard to ship. The distance between a notebook demo and a production index over millions of documents is paved with OOM kills, half-finished runs, and weekend embedding bills. Magpie-RS is the answer I built when I got tired of paying that toll.

The Problem with Most Vector Databases

Most embedding pipelines look great until the corpus crosses a few hundred thousand chunks. Then three things happen, in order:

RAM blows up. HNSW-style indexes accumulate every embedding in a HashMap until build(), and the build call clones them. A 15M passage corpus at 768-dim float32 is ~47 GB before you start; with the build-time copy you're looking at 100–150 GB peak.
A crash throws away the embedding bill. OOM-kill, accidental ^C, or a node panic at 95% completion — and the next run starts from passage zero. On a single GPU, embedding 10k PDFs with embeddinggemma takes 60–90 minutes. The third time you eat that on a Friday evening you start writing your own indexer.
The "production" path bifurcates from the "tutorial" path. Streaming, batching, resumability, and metadata filtering are bolted on as an afterthought, with different APIs and different failure modes than the in-memory toy. Code that was clean in demo gets ugly in prod.

Magpie-RS is a Rust vector database that takes those three failure modes seriously from the start. It does the unglamorous work — bounded memory, atomic checkpoints, durable resume — so the API surface that handles 50 documents is the same as the one that handles 50 million.

What It Is, in Three Sentences

A library plus a CLI. The library is MagpieBuilder → MagpieSearcher: stream documents in, ship a queryable artifact out. The CLI wraps the library and adds a REST server, an MCP endpoint for Claude, an IPC server, and a WebAssembly build for the browser — five deployment shapes from a single Rust core.

Under the hood:

Two index backends. DiskANN (default, streaming, mmap-backed) for any corpus you'd actually call "large." HNSW (kept for compatibility) for small/in-process work.
Hybrid retrieval out of the box. Vector cosine ranked alongside BM25, with reranking and metadata filters available without flipping feature flags.
A real chunker. AST-aware code chunking for Python, JavaScript, TypeScript, Rust, Java, C#, and R; markdown chunking that respects header hierarchy; PDF/DOCX/57-other-format extraction with OCR fallback. The parser pipeline isn't a research curiosity — it's what makes search relevance non-embarrassing on real documentation.
SQLite-backed passages by default. Every passage flushes to a SQLite WAL the moment it's embedded. Builder RAM stays bounded; resume across crashes is implicit; concurrent readers are free.
Optional Q&A and advanced RAG patterns. Ollama, OpenAI, Anthropic, Groq, Together. HyDE, Multi-Query, Corrective RAG — recipes you'd otherwise hand-roll, available behind a feature flag.

Application: Indexing the Entire CRAN

22,771 R packages. ~1M functions. One index.

CRAN — the Comprehensive R Archive Network — is the canonical R package repository. It's also a perfect Magpie test bed: ~22k packages, ~50 functions per package on average, structured metadata (DESCRIPTION, NAMESPACE), and a real "what is similar to what?" question that researchers and package maintainers actually ask.

The end-to-end recipe is two scripts: fetch-all.sh downloads every CRAN tarball into a local mirror; build-index.sh walks the mirror, chunks each package's documentation and source, and streams the result through the embedder into a single index file.

Provider matters more than parallelism here

Embedding provider	Model	Throughput	Full-index time	Cost
Ollama (local CPU)	`mxbai-embed-large` (1024d)	~3 pkg/s	4–6 h	free, ~4 GB RAM
Ollama, more workers	`mxbai-embed-large`	~6 pkg/s	2–4 h	free, ~8 GB RAM
OpenAI API	`text-embedding-3-small` (1536d)	~20 pkg/s	1–2 h	~$5–10

The index lands at ~2–4 GB on disk. From there:

cran_search.sh

# What's similar to ggplot2?
magpie cran similar cran-full ggplot2 --top-k 10

# Search across all of CRAN, semantically
magpie cran search cran-full "machine learning classification" \
    --provider ollama --model mxbai-embed-large --top-k 20

# Cluster the whole corpus
magpie cran cluster cran-full --algorithm louvain --output clusters.json

That last command is the interesting one. Magpie builds a k-NN graph (k=50) from the HNSW index and runs Louvain community detection on it. On the full CRAN snapshot the result is 41 clusters at modularity 0.59, which is a solid score for a real-world graph of this size — strong community structure without forcing artificial boundaries.

The Clusters Tell a Coherent Story

Cluster 1 — Statistical Modeling & ML (3,162 packages, cohesion 0.867). The expected core: regression, classification, GAMs, ensemble methods.
Cluster 6 — Statistical Inference & Hypothesis Testing (2,472 packages, cohesion 0.852). Tests, CIs, bootstrap, survey methodology.
Cluster 2 — R Infrastructure & Development Tools (2,018 packages, cohesion 0.828). The metaprogramming tier: package authoring, documentation generators, R-Markdown, build tooling.

The cohesion numbers — average intra-cluster cosine similarity — are what make this useful as a recommendation primitive. A cluster at 0.85+ is dense enough that "show me three packages similar to {x}" returns answers a working statistician would actually endorse, not just keyword-shaped neighbours.

A taste of what's inside one cluster, picked from cluster-descriptions.md:

Cluster 1 includes glmnetcr (penalized regression), pamm (generalized additive models), RRPP (linear models), semnova (SEM), BoltzMM (Boltzmann machines), DNMF (non-negative matrix factorization).

These don't share many keywords. They share purpose. That's the point.

The Engineering Story Behind "It Doesn't Crash"

The CRAN run is a happy path. The interesting work is the near-disaster path — what happens when something goes wrong at hour five of a six-hour build.

Magpie's design answer:

with_output_path() opens a SQLite database immediately and every embedded passage flushes to it before the vector goes into the graph. RAM stays bounded by the active batch.
DiskANN is streaming end-to-end. Each batch goes into the delta layer; the delta auto-compacts to an mmap-backed base file every 50k vectors (configurable). No accumulate-then-build phase exists.
Resume is implicit. Reopen the same path and next_idx is restored from SQLite, the on-disk graph is reattached, and document-level checkpointing skips files already processed in the previous run.
PRAGMA wal_checkpoint(TRUNCATE) on save flushes the WAL before the SQLite file is copied to its final location, so the durable artefact contains every embedding the indexer thought it had committed.
WAL mode + the streaming graph mean another process can open the index read-only and search it while the indexer is still ingesting — useful when you want to start prototyping queries before the long run finishes.

The result: an OOM kill at hour five costs you the last batch (a few hundred passages), not the run. A ^C is recoverable. A power failure is recoverable with one extra line of code (with_fsync_on_flush(true)).

For the 15M-passage workload this was originally designed for, peak RAM dropped from ~150 GB to ~1–2 GB. That's the kind of number that turns "needs a 256 GB box" into "runs on a developer laptop."

Traceable References: The Anti-Hallucination Layer

When LLMs are involved in retrieval, hallucination is the killer objection — especially in regulated industries. Decision-makers ask the same question every time: "How do I know the answer is grounded in our actual documents?"

Magpie-RS answers that question structurally. Every passage in the index carries precise source coordinates:

Document path — which file the passage came from
Page number — for PDFs and DOCX
Line numbers — for code, markdown, and plain text
Section/header path — for structured documents
Byte offsets — for exact reconstruction

When the LLM cites a passage, the answer doesn't just say "according to our docs" — it says "according to SOP-2024-Q3.pdf, page 47, lines 12–18." Users can click through to the exact location in the original document and verify the claim themselves.

This is the difference between "an AI told us" and "an AI cited the contract clause we already approved." For pharma SOPs, legal contracts, financial filings, and engineering specifications, this trust layer is non-negotiable. Magpie-RS makes it the default, not an afterthought.

Use Cases

CRAN was the test bed. The same engine fits a number of real-world problems where the corpus is too large for in-memory toys but doesn't justify a hosted vendor:

Internal Documentation Search

Index your company's Confluence, SharePoint, internal wikis, and PDF playbooks into a single semantic search backend. AST-aware chunking handles code snippets in docs; markdown chunking respects header hierarchy. Deploy as a REST endpoint behind your VPN, an MCP server for engineers using Claude, or an IPC server for desktop integrations — all from the same artefact.

Code Search Across Monorepos

Magpie ships AST-aware chunkers for Python, JavaScript, TypeScript, Rust, Java, C#, and R. Index a 5 million-line monorepo, then query it with natural language: "find functions that retry HTTP calls with exponential backoff" returns the actual implementations, not just keyword matches. Hybrid retrieval (cosine + BM25) means exact identifier searches still work.

Regulated Document Retrieval (Pharma, Legal, Finance)

The streaming SQLite-backed indexer keeps documents on-premise — nothing leaves the machine. Combine with local Ollama embeddings for full air-gap operation. Useful for GMP-validated documentation, contract retrieval, regulatory filings, and SOPs where sending content to a hosted vector DB is a non-starter.

RAG Backends for Customer-Facing Products

Build a Q&A endpoint over your product documentation. Magpie's optional Q&A layer supports HyDE, Multi-Query, and Corrective RAG patterns out of the box, with provider switching between Ollama, OpenAI, Anthropic, Groq, and Together. The WASM build means you can ship the entire retrieval system to the browser for offline use.

Recommendation & Discovery Engines

The CRAN clustering experiment is a prototype for any "find similar items" problem: research papers, product catalogs, support tickets, scientific datasets. The Louvain community detection on the k-NN graph reveals natural groupings without manual taxonomy work — a recommendation system that updates as your catalog grows.

Long-Running Embedding Pipelines

If you've ever lost 8 hours of GPU time to an OOM kill at 95% completion, this is the problem Magpie was designed to solve. Atomic checkpoints mean a crash costs you the last batch, not the run. Resume is implicit — reopen the same path and continue from where you left off. Useful for any embedding-intensive workflow where compute time costs real money.

Edge & Offline Search

The WebAssembly build ships the entire vector database to the browser. Useful for documentation sites that need offline search, technical product manuals embedded in industrial control panels, and field-service apps that work without connectivity. The same code that runs on a server runs in a tab — no API rewrites.

When to Reach for It

Magpie-RS is the right fit if:

You have between 10k and 10M+ chunks to embed and you want one tool that handles both ends without rewrites.
You want semantic search over a code or documentation corpus with chunking that respects ASTs and headings, not just newlines.
You need verifiable, traceable answers with exact source citations (page, line, byte offset) — not just plausible-sounding text.
You need a deployable artefact, not a notebook — CLI, REST, MCP, IPC, or WASM, picked from a single binary.
You care that the indexer doesn't lose work when something goes wrong, because something always goes wrong.

It is not the right fit if you need first-class hosted-vendor parity (Pinecone-style filtering DSL, server-side autoscaling) or if your corpus fits in 100 MB of RAM and you're happy with a HashMap. Use the right tool for the job.

Interested in Magpie-RS for Your Organization?

Magpie-RS is currently a private project, available through consulting engagements. If you have a corpus that needs production-grade semantic search — internal documentation, regulated documents, code repositories, or scientific datasets — let's talk about whether it's the right fit.

Engagements typically include integration into your existing infrastructure, custom chunking for your document formats, a deployment shape that matches your security requirements (on-prem, air-gapped, edge), and the traceable-reference layer that builds trust with your stakeholders.

Book a 30-min Call

— Simon

Read on Medium ← All Articles

Need help implementing this?

Dr. Simon Müller builds production forecasting systems for manufacturing and pharma companies. If your team is dealing with the challenges described in this article, let's talk.

Book_30min_Call See Live Demo

Newsletter

Get new articles in your inbox

Practical articles on supply chain forecasting, statistical modeling, and high-performance Rust systems. No spam, no marketing — just new posts when they're published. Unsubscribe anytime.