Wesley.dev

Building a Production RAG Pipeline

From naive chunking to hybrid retrieval: the engineering decisions that separate toy demos from production systems.

Wesley Sum · · 3 min read

Every RAG demo retrieves three chunks and calls it done. Production RAG is a different beast. Here’s what actually matters.

Why naive RAG fails

The typical demo:

  1. Split doc into 512-token chunks
  2. Embed + store in vector DB
  3. Retrieve top-3 by cosine similarity
  4. Stuff into prompt

This breaks in three ways: chunking ignores semantics, sparse signals are discarded, and there’s no quality signal on the retrieved context.

A better chunking strategy

from langchain.text_splitter import RecursiveCharacterTextSplitter
def semantic_chunk(text: str, chunk_size: int = 800, overlap: int = 150):
"""
Prefer splitting on paragraph boundaries, then sentences, then words.
Overlap ensures context isn't lost at chunk edges.
"""
splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n", ". ", " ", ""],
chunk_size=chunk_size,
chunk_overlap=overlap,
length_function=len,
)
return splitter.split_text(text)
Chunk size matters

Smaller chunks → higher precision, lower recall. Larger chunks → higher recall, more noise. Start at 800 tokens with 150 overlap and tune from eval data.

Hybrid retrieval: dense + sparse

Vector search alone misses exact keyword matches. BM25 alone misses semantic similarity. Combine them:

from rank_bm25 import BM25Okapi
import numpy as np
def hybrid_retrieve(
query: str,
dense_results: list[tuple[str, float]], # (doc, score)
bm25: BM25Okapi,
corpus: list[str],
alpha: float = 0.5,
top_k: int = 5,
) -> list[str]:
"""Reciprocal Rank Fusion of dense + BM25 scores."""
# BM25 ranking
tokenized_query = query.lower().split()
bm25_scores = bm25.get_scores(tokenized_query)
bm25_ranked = sorted(enumerate(bm25_scores), key=lambda x: x[1], reverse=True)
# RRF fusion
rrf_scores: dict[int, float] = {}
k = 60 # RRF constant
for rank, (idx, _) in enumerate(bm25_ranked):
rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (k + rank)
dense_ids = {id(doc): i for i, (doc, _) in enumerate(dense_results)}
for rank, (doc, _) in enumerate(dense_results):
doc_idx = corpus.index(doc) if doc in corpus else -1
if doc_idx >= 0:
rrf_scores[doc_idx] = rrf_scores.get(doc_idx, 0) + 1 / (k + rank)
ranked = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
return [corpus[idx] for idx, _ in ranked[:top_k]]

Reranking with a cross-encoder

After retrieval, re-rank with a cross-encoder — it’s expensive but worth it for the top-5:

from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query: str, candidates: list[str], top_k: int = 3) -> list[str]:
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked[:top_k]]
Warning

Cross-encoders are ~10–50× slower than bi-encoders. Only apply them to the top-20 candidates from initial retrieval.

Evaluating retrieval quality

Don’t ship without evals:

def context_precision(retrieved: list[str], relevant: set[str]) -> float:
"""What fraction of retrieved docs are actually relevant?"""
hits = sum(1 for doc in retrieved if doc in relevant)
return hits / len(retrieved) if retrieved else 0.0
def context_recall(retrieved: list[str], relevant: set[str]) -> float:
"""What fraction of relevant docs did we retrieve?"""
hits = sum(1 for doc in relevant if doc in retrieved)
return hits / len(relevant) if relevant else 0.0

Build a golden dataset of 50–100 (query, relevant_docs) pairs before you tune any hyperparameter.

The production checklist

  • Semantic chunking with overlap
  • Hybrid retrieval (dense + BM25)
  • Cross-encoder reranking on top-20
  • Eval dataset with precision/recall metrics
  • Metadata filtering (date, source, category)
  • Guardrails: cite sources, handle “I don’t know”

That last point matters more than any retrieval trick. A system that confidently hallucinates is worse than one that says it doesn’t know.