Audit before spend
Week one: benchmark RAG against the use case, name the gap fine-tuning is actually closing, write down the cost of being wrong. Often the audit is the deliverable.
Fine-tuning is expensive, often unnecessary, and sometimes the only thing that works. We benchmark RAG and prompt optimization first, then name the method that fits the data, the budget, and the compliance posture. LoRA at rank 16 with DoRA is the 2026 default. We argue for or against everything else.
Production fine-tuning is a pipeline with an audit at the front and an eval gate at the back. Each stage has a published 2026 best practice. We name the choice and the trade for every one.
Week one: benchmark RAG against the use case, name the gap fine-tuning is actually closing, write down the cost of being wrong. Often the audit is the deliverable.
Sourcing (production logs, synthetic, human-curated), PII redaction (Presidio), synthetic generation (Distilabel, Nemotron), DEITA quality scoring, MinHash and SemDedup, labeling vendor, feedback capture.
LoRA at rank 16 with DoRA is the 2026 default. QLoRA when VRAM is tight. Full SFT only when benchmarked. DPO for alignment, GRPO/RFT for reasoning, CPT for foreign vocabulary, distillation for small students.
Golden set frozen before training. Standard benchmarks (MMLU-Pro, IFEval, MT-Bench, Arena-Hard) plus a domain set. LLM-as-judge with calibration. Safety eval (HarmBench, JailbreakBench). Eval-gated promotion.
vLLM with multi-LoRA for hundreds of tenants on one base. Predibase or Together AI for serverless. Daily drift watch, weekly golden-set run, monthly red-team round.
Pick the angle that matches the question you came in with. Every spoke is full-depth, with engineering detail and named tools.
SFT, LoRA, QLoRA, DoRA, DPO, SimPO, ORPO, KTO, GRPO/RFT, distillation, model merging. Every named technique with when it earns the build.
Read more02 · FINE-TUNINGSourcing, PII redaction (Presidio), synthetic data (Distilabel, Nemotron), DEITA quality scoring, MinHash + SemDedup, labeling vendors, feedback loops.
Read more03 · FINE-TUNINGOpenAI RFT, Anthropic on Bedrock, Vertex, Azure Foundry, Databricks Mosaic, Together, Predibase, NeMo Customizer, Modal, Lambda. Side-by-side with our take.
Read more04 · FINE-TUNINGUnder 1k examples to over 1M, single A10G to 128 B200. Indicative cost, recommended method, hardware tier.
Read more05 · FINE-TUNINGContinued pretraining, SFT, preference optimization (DPO, SimPO, ORPO), reasoning distillation (R1 lineage), model merging (TIES, DARE). The full build pipeline.
Read more06 · FINE-TUNINGEU AI Act (Article 25 substantial-modification trap), GDPR, HIPAA, FedRAMP, Colorado AI Act, India DPDP, China GenAI Measures. Region-by-region for tuned LLMs.
Read moreAnswer up to five questions. The wizard recommends a single method, or sends you back to RAG or prompt optimization where those are the right answer. The first question filters out the majority of would-be fine-tunes.
If RAG can solve it, stay with RAG. If a verifier exists, GRPO. If preferences exist, DPO. Otherwise LoRA at rank 16 with DoRA.
The shape every production fine-tune follows in 2026. The data-pipeline spoke goes layer by layer.
From corpus to signed adapter. Each stage is named, each tool is named, each gate is named.
Production logs + synthetic + human
Presidio PII, DEITA, SemDedup
PEFT LoRA / DoRA on FSDP
Golden set, IFEval, HarmBench
Sigstore + vLLM multi-LoRA
These are the failure modes we see on incoming audits. None of them are subtle. All of them ship to production.
Facts get stale, hallucinations stay. Knowledge belongs in RAG. Behaviour belongs in weights.
Shipping a fine-tune without a frozen golden set is the most common way to ship a worse model than the base.
SFT on an already-Instruct model with raw chat data destroys alignment. Either start from base, or use very low LR and DPO/KTO on top of Instruct.
Under 1k examples with rank 64 and 10 epochs is a memorization recipe. Drop rank, add dropout, fewer epochs.
No thumbs, no edit deltas, no production traces means no second-round DPO data. The first fine-tune is then the only fine-tune.
MMLU gains do not equal CSAT gains. Tie eval to the business KPI before training, not after.