Building a Production RAG Pipeline

Every RAG demo retrieves three chunks and calls it done. Production RAG is a different beast. Here’s what actually matters.

Why naive RAG fails

The typical demo:

Split doc into 512-token chunks
Embed + store in vector DB
Retrieve top-3 by cosine similarity
Stuff into prompt

This breaks in three ways: chunking ignores semantics, sparse signals are discarded, and there’s no quality signal on the retrieved context.

A better chunking strategy

from langchain.text_splitter import RecursiveCharacterTextSplitter

def semantic_chunk(text: str, chunk_size: int = 800, overlap: int = 150):
    """
    Prefer splitting on paragraph boundaries, then sentences, then words.
    Overlap ensures context isn't lost at chunk edges.
    """
    splitter = RecursiveCharacterTextSplitter(
        separators=["\n\n", "\n", ". ", " ", ""],
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        length_function=len,
    )
    return splitter.split_text(text)

Chunk size matters

Smaller chunks → higher precision, lower recall. Larger chunks → higher recall, more noise. Start at 800 tokens with 150 overlap and tune from eval data.

Hybrid retrieval: dense + sparse

Vector search alone misses exact keyword matches. BM25 alone misses semantic similarity. Combine them:

from rank_bm25 import BM25Okapi
import numpy as np

def hybrid_retrieve(
    query: str,
    dense_results: list[tuple[str, float]],  # (doc, score)
    bm25: BM25Okapi,
    corpus: list[str],
    alpha: float = 0.5,
    top_k: int = 5,
) -> list[str]:
    """Reciprocal Rank Fusion of dense + BM25 scores."""
    # BM25 ranking
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_ranked = sorted(enumerate(bm25_scores), key=lambda x: x[1], reverse=True)

    # RRF fusion
    rrf_scores: dict[int, float] = {}
    k = 60  # RRF constant

    for rank, (idx, _) in enumerate(bm25_ranked):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (k + rank)

    dense_ids = {id(doc): i for i, (doc, _) in enumerate(dense_results)}
    for rank, (doc, _) in enumerate(dense_results):
        doc_idx = corpus.index(doc) if doc in corpus else -1
        if doc_idx >= 0:
            rrf_scores[doc_idx] = rrf_scores.get(doc_idx, 0) + 1 / (k + rank)

    ranked = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return [corpus[idx] for idx, _ in ranked[:top_k]]

Reranking with a cross-encoder

After retrieval, re-rank with a cross-encoder — it’s expensive but worth it for the top-5:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query: str, candidates: list[str], top_k: int = 3) -> list[str]:
    pairs = [(query, doc) for doc in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked[:top_k]]

Warning

Cross-encoders are ~10–50× slower than bi-encoders. Only apply them to the top-20 candidates from initial retrieval.

Evaluating retrieval quality

Don’t ship without evals:

def context_precision(retrieved: list[str], relevant: set[str]) -> float:
    """What fraction of retrieved docs are actually relevant?"""
    hits = sum(1 for doc in retrieved if doc in relevant)
    return hits / len(retrieved) if retrieved else 0.0

def context_recall(retrieved: list[str], relevant: set[str]) -> float:
    """What fraction of relevant docs did we retrieve?"""
    hits = sum(1 for doc in relevant if doc in retrieved)
    return hits / len(relevant) if relevant else 0.0

Build a golden dataset of 50–100 (query, relevant_docs) pairs before you tune any hyperparameter.

The production checklist

Semantic chunking with overlap
Hybrid retrieval (dense + BM25)
Cross-encoder reranking on top-20
Eval dataset with precision/recall metrics
Metadata filtering (date, source, category)
Guardrails: cite sources, handle “I don’t know”

That last point matters more than any retrieval trick. A system that confidently hallucinates is worse than one that says it doesn’t know.