Teams typically face a three‑way choice: fine‑tune a large model, use a small model, or RAG + light tuning. A poor choice inflates budget or ruins UX. This guide lays out the trade‑offs across cost, quality, control, safety, and delivery cadence, and provides a decision tree and eval checklist.
1) Cost model: one‑off vs. ongoing
- Fine‑tuning (SFT/LoRA): one‑time training with periodic refresh; inference cost depends on base model. Great for scale reuse.
- Small models: low training/inference cost but unstable on complex chains; require stronger engineering guardrails.
- RAG + light tuning: index maintenance plus small parameter updates; often the lowest total cost.
2) Quality and control
- Fine‑tuned LLMs: better general reasoning and cross‑domain transfer; unified style/format/tools, but beware over‑fit and data leakage.
- Small models: competitive in narrow domains, weaker on long chains or multi‑tool orchestration; require tight scope boundaries.
- RAG: provenance and citations help compliance; combined with light tuning yields stable outputs.
3) Data needs and safety
- Volume: 10k–100k instruction pairs is common for SFT; cover counter‑examples and failure cases for dialog tasks.
- Sensitive data: anonymize and tier; prefer private training environments with access audit.
- Copyright: ensure data licensing; RAG enables source traceability for generated outputs.
4) Evals and launch gates
- Human + automated evals: accuracy/completeness/safety/actionability; p < 0.05 for A/B significance.
- Jailbreak and hallucination suites: pass‑rate ≥ 99%, harmful rate < 0.05%.
- Canary rollout: 1–5% traffic until stable; define auto‑rollback thresholds.
5) Proven architecture patterns
- Marketing/Support: RAG + small model → evidence‑driven answers at low cost; route to a larger model for hard intents.
- Compliance templates: RAG + large model (or small + re‑rank) → cite sources and limits before generation.
- Automation agents: large models plan and orchestrate tools; small models execute narrow steps (OCR, classification).
6) Decision tree (condensed)
- Need traceable evidence? Yes → pick RAG; No → continue.
- Strong format/style with scale reuse? Yes → fine‑tune; No → continue.
- Budget very tight and scope narrow? Yes → small model; otherwise → RAG + light tuning.
7) Common pitfalls and fixes
- Fine‑tuning without eval gates → unstable at launch; require red‑teaming and gates.
- Small‑model overreach → scope creep collapses quality; add intent routing and pre‑checks.
- RAG context stuffing → dilutes attention; re‑rank and de‑noise, keep only strongest evidence.
8) Launch and operations
- Versioning: one version for model/prompt/index/policy with replayability.
- Observability: dashboards for latency/refusals/harmful rate/top failed prompts.
- Closure loop: incidents → testset → policy/data/model updates.
Further reading
RAG Complete Guide · Evals & Launch Gates · Prompt Engineering