RAG (Retrieval‑Augmented Generation) retrieves evidence before generation and injects it into the context so the model answers “with citations.” It offers freshness (incremental index), control (traceable sources) and cost efficiency, complementing pure fine‑tuning. This engineering guide gives an end‑to‑end path from indexing to production.
1) When is RAG better than fine‑tuning?
- Rapidly changing knowledge: laws, prices, docs change often — update the index instead of retraining.
- Provenance required: citations and auditability are mandatory.
- Long‑tail Q&A: broad questions with sparse data; retrieval is more robust.
- Privacy isolation: keep private data in the index, not in model weights.
2) Before indexing: cleaning and chunking
Chunking sets your recall ceiling. Recommended pipeline: extract main text → strip styles/scripts → structure headings → chunk → de‑duplicate and de‑noise → add metadata. Common strategies:
- Fixed windows with overlap: e.g., 500–800 tokens, 50–100 overlap.
- Semantic paragraphs: cut by headings/paragraphs; helps re‑ranking.
- Multi‑view chunks: index both short and long windows to improve robustness.
3) Embeddings and indexing: sparse/dense/hybrid
- Dense vectors: map chunks to embedding space; great for semantic recall. Use FAISS, Milvus, PGVector or Weaviate.
- Sparse BM25: strong for exact terms and entities; complementary to dense.
- Hybrid retrieval: combine normalized BM25 and dense scores; excels in terminology‑heavy domains.
- Multi‑channel recall: query expansion, synonyms and spell‑correction.
4) Re‑ranking and de‑noising
First‑stage recall is noisy. Use cross‑encoders (e.g., bge‑reranker, cohere‑rerank) to rank Top‑k, then de‑noise:
- Near‑duplicate removal with Jaccard/MinHash or cosine thresholds.
- “Similar but wrong” filter: compare gist summaries, drop off‑topic chunks.
- Metadata filters: source/time/tags to remove outdated or low‑trust content.
5) Query rewriting and multi‑hop retrieval
Users ask short or ambiguous questions. Let a small model expand queries with missing constraints (time/region/model), or produce several candidates. For complex tasks, do multi‑hop: first hop for background, second hop to refine, then retrieve evidence.
6) Feeding evidence to the model: prompting and hallucination control
- Structured context: list “sources” first (links/filenames/time), then “evidence”, then “answer only using evidence, do not invent.”
- Citations: mark evidence as [1][2]… in the answer for audit.
- Refusal policy: if evidence is insufficient, answer “uncertain” with next‑step suggestions.
- Answer structure: conclusion → evidence → limits/uncertainty; avoid pretty but empty prose.
7) RAG vs. fine‑tuning: the trade‑off
- RAG: freshness, provenance, long‑tail Q&A; fast to ship, low cost.
- Fine‑tuning: strong formatting/style/tool planning; for private small data, use LoRA/Adapters.
- Hybrid: RAG + light tuning or system prompts is often best value.
8) Evals and gates: don’t trust vibes
- Retrieval: Recall@k, MRR, nDCG; compare before/after query expansion.
- Faithfulness: answer derived from evidence only; Citation Precision/Recall.
- Task success: actionable answers; human A/B with significance (p < 0.05).
- Safety: harmful rate/privacy/IP; failed cases become a red‑team suite.
- Gate example: Faithfulness ≥ 0.98, Citation Precision ≥ 0.95, jailbreak pass‑rate ≥ 99%.
9) Engineering optimization checklist
- Caching: cache embeddings and re‑rank results to cut latency and cost.
- Index updates: offline batch + online incremental; add soft‑delete and versions.
- Adaptive hybrid weights: adjust BM25/dense weights by terminology density.
- Diversity: drop highly similar chunks to improve coverage.
- Observability: log queries, scores, chunk IDs, final answers and citations for replay.
10) Common pitfalls
- Chunks too large/small → noise or broken context; use multi‑view + re‑rank.
- Recall/re‑rank mismatch → too few candidates; widen first‑stage to 50–200.
- Context stuffing → exceeds effective attention; keep only strongest evidence.
- Ignoring time → old content overrides new rules; filter by time/version.
11) Shipping to production
- Offline eval passes gates (retrieval/faithfulness/safety).
- Canary 1–5% with monitoring for latency/harmful/citation accuracy.
- Gradually ramp; define rollback triggers; add incident cases to the testset.
Further reading
Prompt Engineering · Evals & Launch Gates · Transformer Explained