Every RAG demo retrieves three chunks and calls it done. Production RAG is a different beast. Here’s what actually matters.
Why naive RAG fails
The typical demo:
- Split doc into 512-token chunks
- Embed + store in vector DB
- Retrieve top-3 by cosine similarity
- Stuff into prompt
This breaks in three ways: chunking ignores semantics, sparse signals are discarded, and there’s no quality signal on the retrieved context.
A better chunking strategy
from langchain.text_splitter import RecursiveCharacterTextSplitter
def semantic_chunk(text: str, chunk_size: int = 800, overlap: int = 150): """ Prefer splitting on paragraph boundaries, then sentences, then words. Overlap ensures context isn't lost at chunk edges. """ splitter = RecursiveCharacterTextSplitter( separators=["\n\n", "\n", ". ", " ", ""], chunk_size=chunk_size, chunk_overlap=overlap, length_function=len, ) return splitter.split_text(text)Smaller chunks → higher precision, lower recall. Larger chunks → higher recall, more noise. Start at 800 tokens with 150 overlap and tune from eval data.
Hybrid retrieval: dense + sparse
Vector search alone misses exact keyword matches. BM25 alone misses semantic similarity. Combine them:
from rank_bm25 import BM25Okapiimport numpy as np
def hybrid_retrieve( query: str, dense_results: list[tuple[str, float]], # (doc, score) bm25: BM25Okapi, corpus: list[str], alpha: float = 0.5, top_k: int = 5,) -> list[str]: """Reciprocal Rank Fusion of dense + BM25 scores.""" # BM25 ranking tokenized_query = query.lower().split() bm25_scores = bm25.get_scores(tokenized_query) bm25_ranked = sorted(enumerate(bm25_scores), key=lambda x: x[1], reverse=True)
# RRF fusion rrf_scores: dict[int, float] = {} k = 60 # RRF constant
for rank, (idx, _) in enumerate(bm25_ranked): rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (k + rank)
dense_ids = {id(doc): i for i, (doc, _) in enumerate(dense_results)} for rank, (doc, _) in enumerate(dense_results): doc_idx = corpus.index(doc) if doc in corpus else -1 if doc_idx >= 0: rrf_scores[doc_idx] = rrf_scores.get(doc_idx, 0) + 1 / (k + rank)
ranked = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True) return [corpus[idx] for idx, _ in ranked[:top_k]]Reranking with a cross-encoder
After retrieval, re-rank with a cross-encoder — it’s expensive but worth it for the top-5:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query: str, candidates: list[str], top_k: int = 3) -> list[str]: pairs = [(query, doc) for doc in candidates] scores = reranker.predict(pairs) ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True) return [doc for doc, _ in ranked[:top_k]]Cross-encoders are ~10–50× slower than bi-encoders. Only apply them to the top-20 candidates from initial retrieval.
Evaluating retrieval quality
Don’t ship without evals:
def context_precision(retrieved: list[str], relevant: set[str]) -> float: """What fraction of retrieved docs are actually relevant?""" hits = sum(1 for doc in retrieved if doc in relevant) return hits / len(retrieved) if retrieved else 0.0
def context_recall(retrieved: list[str], relevant: set[str]) -> float: """What fraction of relevant docs did we retrieve?""" hits = sum(1 for doc in relevant if doc in retrieved) return hits / len(relevant) if relevant else 0.0Build a golden dataset of 50–100 (query, relevant_docs) pairs before you tune any hyperparameter.
The production checklist
- Semantic chunking with overlap
- Hybrid retrieval (dense + BM25)
- Cross-encoder reranking on top-20
- Eval dataset with precision/recall metrics
- Metadata filtering (date, source, category)
- Guardrails: cite sources, handle “I don’t know”
That last point matters more than any retrieval trick. A system that confidently hallucinates is worse than one that says it doesn’t know.