TLDR: RAG fails in retrieval, not generation. Chunking strategy, hybrid search with reciprocal rank fusion, reranking, and query transformation are where most pipelines go wrong. The RAGAS evaluation triad tells you whether retrieval is working before your users tell you it isn't.
I spent three weeks building a documentation chatbot that confidently gave users wrong answers. Not hallucination in the spooky sci-fi sense — just wrong facts, wrong version numbers, a feature that had been deprecated six months ago but the model had no idea. The model itself was fine. My retrieval was broken.
That's the thing about RAG — Retrieval-Augmented Generation — that nobody tells you upfront: the generation step is almost never your problem. It's the retrieval. And retrieval is a frontend engineering problem as much as a backend one, because the decisions you make upstream (how you chunk, what metadata you attach, how you query) determine what your users experience on the other end.
RAG is the most practical pattern in applied AI right now. It solves the core LLM limitation — the model doesn't know your data and has a knowledge cutoff — without the cost and complexity of fine-tuning. Here's how to build it properly.
Fine-tuning vs RAG: The honest answer
Every team asks this question. Here's a simple frame:
| RAG | Fine-tuning | |
|---|---|---|
| Data freshness | Real-time | Stale at training time |
| Cost | Low — inference only | High — GPU compute |
| Iteration speed | Fast — update docs | Slow — re-train |
| Explainability | Can cite sources | Black box |
| Best for | Dynamic data, large corpora | Tone, style, domain vocabulary |
Fine-tune for how the model talks. RAG for what the model knows.
If you're building a chatbot over your product's documentation and that documentation changes every sprint — RAG, no question.
The pipeline, end to end
INDEXING (happens offline)
Documents → chunk → embed → store in vector DB
RETRIEVAL (happens per request)
User query → embed → vector search → top-k chunks
GENERATION (happens per request)
[system prompt] + [retrieved chunks] + [user query] → LLM → answer
Simple on paper. Let's talk about where each step breaks.
Step 1: Indexing
Parsing — the boring part that kills you
Your documents aren't clean. They're PDFs with scanned tables, HTML with nav menus embedded in every page, Word docs with tracked changes, Confluence exports with garbled formatting. Parsing quality is the foundation everything else is built on.
I've spent entire mornings just on this step for a single document type. Watch out for: scanned PDFs (you need OCR, not just text extraction), tables (most parsers mangle them into unreadable row soup), and headers/footers that repeat on every page and pollute your embeddings with noise.
Chunking — where most pipelines fail
This is it. This is where I see teams get the most wrong.
Chunks too small: you lose the context that makes a passage make sense. A sentence about "the retry mechanism" without the surrounding paragraph explaining what is being retried is useless.
Chunks too large: they dilute relevance. You retrieve a 3,000-token block when the answer lives in 50 tokens of it, and the model drowns in irrelevant context trying to find it.
Fixed-size chunking (the baseline):
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_documents(docs)
The 64-token overlap matters more than you'd expect. Without it, a chunk boundary that splits a sentence in half destroys the meaning on both sides.
Hierarchical / parent-child chunking (what I'd use in production):
Store parent chunks (full sections) and child chunks (individual paragraphs) separately. Retrieve by child — they're precise. Return the parent — it has full context.
Parent chunk: Full section on "Authentication"
Child 1: "OAuth 2.0 setup"
Child 2: "JWT validation"
Child 3: "Session management"
Query matches Child 2 precisely. You inject the Parent into the prompt so the model has full context. This pattern alone cut my chatbot's hallucination rate significantly.
Metadata — don't skip this
Every chunk needs metadata. Not for fun — for filtering.
{
"text": "...",
"metadata": {
"source": "auth-docs.pdf",
"page": 12,
"section": "Authentication",
"date": "2025-01-15",
"doc_type": "technical_spec",
"version": "v2.3"
}
}
If a user asks about v2 of your API, you want to filter out v1 docs. If they're asking about a feature that launched after a certain date, you can filter by date. Metadata is what makes retrieval precise instead of just directionally correct.
Embedding model choice
Not all embedding models are equal, and the difference shows in production.
| Model | Dims | Quality | Cost |
|---|---|---|---|
| text-embedding-3-small | 1536 | Good | Low |
| text-embedding-3-large | 3072 | Better | Medium |
| Cohere embed-v3 | 1024 | Better (retrieval-tuned) | Medium |
| BGE-M3 (local) | 1024 | Excellent | Free |
Test on your actual data. I've seen text-embedding-3-small beat 3-large on domain-specific retrieval because the smaller model's representations happened to cluster better for that particular vocabulary. Never assume.
Step 2: Retrieval
Vector search is not enough
Pure vector search works great for semantic queries. It completely falls apart for exact matches — product codes, function names, technical error strings. If a user asks about ERR_INVALID_TOKEN_0x43F, a semantic search will retrieve vaguely related content about authentication. You need exact term matching too.
Hybrid search — what I actually ship:
def hybrid_search(query: str, k: int = 10) -> list[Chunk]:
vector_results = vector_db.search(embed(query), k=k*2)
bm25_results = bm25_index.search(query, k=k*2)
# Reciprocal Rank Fusion: combine both ranked lists
return rrf_merge(vector_results, bm25_results, k=k)
Reciprocal Rank Fusion is the merge strategy that works. A document that appears in both ranked lists gets boosted:
def rrf_merge(list1, list2, k=60) -> list:
scores = {}
for rank, doc in enumerate(list1):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
for rank, doc in enumerate(list2):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
Reranking — the step teams skip
After retrieval, you have 20-50 candidate chunks sorted by vector similarity. Similarity is a decent proxy for relevance but it's not precision. A cross-encoder reranker scores each (query, chunk) pair directly:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, chunks: list[Chunk], top_k: int = 5) -> list[Chunk]:
pairs = [(query, chunk.text) for chunk in chunks]
scores = reranker.predict(pairs)
ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
return [chunk for chunk, _ in ranked[:top_k]]
The two-stage pattern: vector search for recall (get 20-50 candidates fast), reranker for precision (pick the best 5-10 from those). This is materially better than single-stage retrieval.
Query transformation — your query is usually bad
Users don't write retrieval-optimized queries. "How do I set this up?" is a terrible vector search query. "What does error code 403 mean?" will retrieve everything about 403s, not the specific context the user needs.
HyDE (Hypothetical Document Embedding): Instead of embedding the question, use the LLM to generate a hypothetical perfect answer, then embed that for retrieval. You're searching for the answer space, not the question space.
Query expansion: Generate 3 different phrasings of the user's question and retrieve for all of them, then merge. Catches vocabulary mismatches between user language and documentation language.
Step 3: Generation
Structure your prompt so the model knows exactly what to do with the context:
System: You are a technical support assistant. Answer questions using ONLY
the provided documentation. If the answer isn't in the docs, say
"I don't have that information in the current documentation."
Always cite the document source for any claim you make.
Retrieved context:
<doc id="1" source="api-reference.md" relevance="0.92">
{{ chunk_1 }}
</doc>
<doc id="2" source="quickstart.md" relevance="0.87">
{{ chunk_2 }}
</doc>
User question: {{ query }}
Answer (cite [doc id] for each claim):
The explicit citation instruction does two things: it makes hallucination auditable (if there's no citation, be suspicious), and it forces the model to ground its answer in the retrieved docs.
The "lost in the middle" problem: Models attend more strongly to content at the beginning and end of long contexts. The most relevant chunk buried in the middle of 50k tokens of context will underperform. Put your highest-relevance chunks at positions 0 and -1 in the context block.
What breaks in production
Stale index. If your documentation updates and your index doesn't, users get confident wrong answers. Build incremental indexing — only re-embed documents that have actually changed:
def should_reindex(doc: Document, existing: IndexedDoc) -> bool:
return doc.updated_at > existing.indexed_at or doc.hash != existing.hash
Embedding version lock. When you upgrade your embedding model, every vector in your index is incompatible with the new model. This isn't a warning — you'll start getting random retrieval failures. If you upgrade the embedding model, you re-embed the entire corpus. Plan for it.
Multi-tenancy. If you're serving multiple customers from one system, always filter by tenant at the retrieval step. A metadata filter is the minimum viable isolation:
results = vector_db.search(
query_vector=embed(query),
filter={"tenant_id": current_user.tenant_id},
k=10
)
Evaluation — don't skip this
The RAG triad is what I use to measure system quality:
| Dimension | What it checks |
|---|---|
| Faithfulness | Is the answer grounded in retrieved docs? |
| Answer relevance | Does the answer address the question? |
| Context relevance | Were the right docs retrieved? |
RAGAS automates this with LLM-as-judge scoring. Run it on a test set of 50-100 queries before shipping and add it to CI. The teams who skip evals don't know their RAG is quietly failing until users complain.
RAG systems fail silently. A chatbot that returns wrong information with high confidence is worse than one that says "I don't know" — users trust the confident wrong answer. The difference between a trustworthy RAG system and a liability is almost always in the retrieval layer, not the model. Invest in chunking strategy, ship hybrid search, and measure with evals before you let users near it.