★ Fine-tuningDirect LLM · benchmark-firstProduction grade

FINE-TUNING · POST-TRAINING

Production fine-tuning. Twelve methods, one honest decision tree.

Fine-tuning is expensive, often unnecessary, and sometimes the only thing that works. We benchmark RAG and prompt optimization first, then name the method that fits the data, the budget, and the compliance posture. LoRA at rank 16 with DoRA is the 2026 default. We argue for or against everything else.

PyTorchPEFTTRLvLLMCohereOpenAI

Start a conversation →LLM services →

Cycle

Audit, then sprint or program

Stack

PyTorch · PEFT · TRL · vLLM

Output

Adapter or model + eval suite + feedback loop

Default

Argue for RAG before any training spend

[THE STACK WE BUILD]

Five stages. Every stage gated.

Production fine-tuning is a pipeline with an audit at the front and an eval gate at the back. Each stage has a published 2026 best practice. We name the choice and the trade for every one.

Audit before spend

Week one: benchmark RAG against the use case, name the gap fine-tuning is actually closing, write down the cost of being wrong. Often the audit is the deliverable.

Data pipeline

Sourcing (production logs, synthetic, human-curated), PII redaction (Presidio), synthetic generation (Distilabel, Nemotron), DEITA quality scoring, MinHash and SemDedup, labeling vendor, feedback capture.

Method choice

LoRA at rank 16 with DoRA is the 2026 default. QLoRA when VRAM is tight. Full SFT only when benchmarked. DPO for alignment, GRPO/RFT for reasoning, CPT for foreign vocabulary, distillation for small students.

Evaluation

Golden set frozen before training. Standard benchmarks (MMLU-Pro, IFEval, MT-Bench, Arena-Hard) plus a domain set. LLM-as-judge with calibration. Safety eval (HarmBench, JailbreakBench). Eval-gated promotion.

Serving + drift

vLLM with multi-LoRA for hundreds of tenants on one base. Predibase or Together AI for serverless. Daily drift watch, weekly golden-set run, monthly red-team round.

[GO DEEP]

Six specialty topics.

Pick the angle that matches the question you came in with. Every spoke is full-depth, with engineering detail and named tools.

01 · FINE-TUNING

Methods

SFT, LoRA, QLoRA, DoRA, DPO, SimPO, ORPO, KTO, GRPO/RFT, distillation, model merging. Every named technique with when it earns the build.

Data pipeline

Sourcing, PII redaction (Presidio), synthetic data (Distilabel, Nemotron), DEITA quality scoring, MinHash + SemDedup, labeling vendors, feedback loops.

Platforms

OpenAI RFT, Anthropic on Bedrock, Vertex, Azure Foundry, Databricks Mosaic, Together, Predibase, NeMo Customizer, Modal, Lambda. Side-by-side with our take.

By data + compute scale

Under 1k examples to over 1M, single A10G to 128 B200. Indicative cost, recommended method, hardware tier.

Custom model build

Continued pretraining, SFT, preference optimization (DPO, SimPO, ORPO), reasoning distillation (R1 lineage), model merging (TIES, DARE). The full build pipeline.

Compliance

EU AI Act (Article 25 substantial-modification trap), GDPR, HIPAA, FedRAMP, Colorado AI Act, India DPDP, China GenAI Measures. Region-by-region for tuned LLMs.

[FINE-TUNE OR NOT]

The honest decision tree.

Answer up to five questions. The wizard recommends a single method, or sends you back to RAG or prompt optimization where those are the right answer. The first question filters out the majority of would-be fine-tunes.

Pick by the shape of the problem, not the hype.

If RAG can solve it, stay with RAG. If a verifier exists, GRPO. If preferences exist, DPO. Otherwise LoRA at rank 16 with DoRA.

Step 1

No answers yet

Question 1

Does the model need to reference information that changes (policies, prices, catalogues, docs)?

[THE PIPELINE]

Data in, adapter out.

The shape every production fine-tune follows in 2026. The data-pipeline spoke goes layer by layer.

What we run on every fine-tuning job.

From corpus to signed adapter. Each stage is named, each tool is named, each gate is named.

Raw data

Production logs + synthetic + human

Clean + dedup

Presidio PII, DEITA, SemDedup

Train

PEFT LoRA / DoRA on FSDP

Eval gate

Golden set, IFEval, HarmBench

Sign + ship

Sigstore + vLLM multi-LoRA

[ANTI-PATTERNS]

How enterprise fine-tunes go wrong.

These are the failure modes we see on incoming audits. None of them are subtle. All of them ship to production.

Fine-tuning to teach knowledge

Facts get stale, hallucinations stay. Knowledge belongs in RAG. Behaviour belongs in weights.

Skipping the eval gate

Shipping a fine-tune without a frozen golden set is the most common way to ship a worse model than the base.

Tuning the Instruct, not the base

SFT on an already-Instruct model with raw chat data destroys alignment. Either start from base, or use very low LR and DPO/KTO on top of Instruct.

Overfit on small data

Under 1k examples with rank 64 and 10 epochs is a memorization recipe. Drop rank, add dropout, fewer epochs.

No feedback capture

No thumbs, no edit deltas, no production traces means no second-round DPO data. The first fine-tune is then the only fine-tune.

Optimising for benchmarks, not outcomes

MMLU gains do not equal CSAT gains. Tie eval to the business KPI before training, not after.

[WHAT YOU GET]

What's live at handoff.

1 audit

RAG vs fine-tune decision documented

1 adapter

LoRA / DoRA shipped, eval-gated

Continuous

Feedback loop feeding the next round

Compliance

Article 25 + DPIA + model card written

[COMMON QUESTIONS]

What buyers ask before they sign.

Should we fine-tune or stick with RAG?: Almost always RAG for facts that change, fine-tune for behaviour that does not. If your answer changes weekly (prices, policies, inventory), use RAG. If you need consistent voice, schema, terminology, or a small fast model that beats a generalist on a narrow task, fine-tune. The strongest 2026 production systems combine both.
Which method should we start with?: LoRA at rank 16 with DoRA. Under $500 per run for a 7B fine-tune, 50 to 200 MB adapter, hot-swappable per tenant. We benchmark anything bigger (full SFT, multi-stage CPT+SFT+DPO) against LoRA before considering it.
What is GRPO / RFT and why is it 2025's breakout?: Group Relative Policy Optimization, the algorithm DeepSeek used to train R1 and OpenAI uses in its Reinforcement Fine-Tuning API. Sample N attempts per prompt, score against a verifier, normalize rewards within the group, update the policy to favour above-group answers. Works for any task with a verifiable reward: math, code, tool use, structured extraction.
Do we need our own GPUs?: Almost never. For LoRA and QLoRA, Modal, Lambda 1-Click Clusters, and Together AI cover the 1 to 128 GPU range at per-second billing. For managed: OpenAI RFT ($100/hr), Together AI per-token, Predibase serverless, Bedrock managed. On-prem makes sense at sustained ~2 GPU-years of usage or when data residency forces it.
What does an enterprise fine-tuning engagement look like?: Week one is a benchmark audit (RAG vs few-shot vs LoRA). Weeks two to five build the data pipeline, train the first LoRA, eval-gate it. Weeks six to eight ship the adapter, wire feedback capture, document compliance posture (EU AI Act Article 25 assessment, GDPR DPIA, model card). Beyond that, ongoing partnership for second-round DPO/KTO, drift watch, and re-evaluation.
Will fine-tuning trigger the EU AI Act?: Possibly. The Commission's July 2025 Guidelines on GPAI Providers interpret 'significant modification' as using more than one-third of the upstream model's training compute. Most enterprise fine-tunes do not hit that threshold, but Article 25 also flips you to provider status if you rebrand, change intended purpose so the system becomes high-risk, or put your name on the model. We run the substantial-modification assessment in week one. See /llm/fine-tuning/compliance/ for the full picture.

FINE-TUNING · KENSINK LABS

Bring the use case. We will name the method.

Senior engineers, eval suite at handoff, full source ownership. We benchmark the cheap method first and only deploy the expensive one when the numbers force it.

Start a conversation →All fine-tuning topics