AI & Machine Learning26 May 2026·14 min read

LLM Fine-tuning in 2025: LoRA, QLoRA, and When to Actually Do It

Fine-tuning vs prompting vs RAG, LoRA and QLoRA explained, building eval pipelines, OpenAI fine-tune API, Hugging Face + PEFT setup, and common mistakes to avoid.

LLM Fine-tuningLoRAQLoRAHugging FaceOpenAIMachine LearningAIPEFT

When Off-the-Shelf Models Aren't Enough

GPT-4o, Claude 3.5, and Gemini are remarkable general-purpose models. They're trained on the internet — which means they're excellent at common patterns and poor at your company's specific domain.

Ask a base model to classify support tickets in your proprietary taxonomy, extract entities from your industry-specific documents, or write in your brand's exact voice — and you'll spend months engineering prompts to compensate for the gap between "what the model knows" and "what your business needs."

Fine-tuning closes that gap by updating the model's weights on your data. The result is a model that behaves like it was trained for your exact use case — because it was.

Fine-tuning vs Prompting vs RAG

These three techniques are complementary, not mutually exclusive. Choosing the right one (or combination) depends on what's actually wrong:

ProblemBest approach
Model doesn't know your factsRAG (add context at query time)
Model doesn't follow your formatFine-tuning
Model doesn't match your tone/styleFine-tuning
Model makes domain-specific errorsFine-tuning + RAG
Model needs recent informationRAG
Long prompts are too slow/expensiveFine-tuning (shorter prompts needed)

RAG is cheaper and faster to iterate. Fine-tuning is more powerful but requires training data and compute. Many production systems use both: a fine-tuned base for style and format, RAG for factual grounding.

The Two Main Approaches

Full Fine-tuning

Update all model weights on your dataset. Gives maximum customisation but requires significant GPU compute and risks catastrophic forgetting (the model forgets general capabilities as it learns yours).

Full fine-tuning is rarely the right choice for most product teams in 2025. The cost and complexity don't justify the benefits unless you're operating at very large scale with very specialised requirements.

LoRA / QLoRA (Low-Rank Adaptation)

LoRA adds small trainable matrices alongside the frozen original weights. Instead of updating billions of parameters, you update millions. QLoRA additionally quantises the base model to 4-bit precision, reducing GPU memory requirements dramatically.

This is the practical choice for most teams:

Train on a single A100 (or even a consumer GPU for small models)
No catastrophic forgetting — base weights are frozen
LoRA adapters are tiny files (10-100MB) that can be swapped in/out
Merge back into the base model for inference with zero overhead

What You Need Before You Start

Quality Training Data

Fine-tuning amplifies patterns in your data. If your data has errors, inconsistencies, or bias — the model learns those too. Before anything else:

100-1000 examples: is enough for style/format fine-tuning. Classification tasks need more.
Format: typically `{"prompt": "...", "completion": "..."}` for instruction fine-tuning
Diversity matters more than volume: 500 high-quality, varied examples outperform 5,000 near-duplicate ones
Remove duplicates, correct errors, ensure consistent formatting

Evaluation Set

Reserve 10-20% of your data for evaluation. Never train on your eval set. Define metrics before training — what does "better" actually mean for your task? Accuracy, F1, BLEU, human preference? Without a clear eval, you're flying blind.

Compute

Model sizeMinimum GPURecommended
7B parameters (LoRA)1× RTX 3090 (24GB)1× A100 40GB
13B parameters (QLoRA)1× A100 40GB2× A100 40GB
70B parameters (QLoRA)4× A100 80GB8× H100
OpenAI fine-tune APINo GPU neededManaged, pay-per-token

The OpenAI Fine-tuning API

If you're fine-tuning GPT-3.5-turbo or GPT-4o-mini, OpenAI's API handles infrastructure entirely. You upload a JSONL file, trigger a job, and get a model ID back:

# Upload training data
openai files upload --purpose fine-tune training_data.jsonl

# Start fine-tuning job
openai fine_tuning.jobs create \
  --training-file file-abc123 \
  --model gpt-4o-mini

# Monitor progress
openai fine_tuning.jobs list

Cost: ~$8 per 1M training tokens for GPT-4o-mini. A 1,000-example dataset of typical length costs $5-20 to train. Inference on fine-tuned models costs 3-4× more than the base model — factor this into your production economics.

Open Source Fine-tuning with Hugging Face

For Llama 3, Mistral, Qwen, or Gemma models, the Hugging Face ecosystem is the standard:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B-Instruct")

lora_config = LoraConfig(
  r=16,  # rank — higher = more capacity, more compute
  lora_alpha=32,
  target_modules=["q_proj", "v_proj"],
  lora_dropout=0.05,
  task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

trainer = SFTTrainer(
  model=model,
  train_dataset=dataset,
  dataset_text_field="text",
  max_seq_length=2048,
)

trainer.train()

Building the Eval Pipeline First

Most teams start fine-tuning before building an eval pipeline. This is backwards. Before you train a single epoch:

Define your success metrics precisely
Write automated eval scripts that can score any model checkpoint
Run your eval on the base model to establish a baseline
Set a target score that justifies the training cost

Without an eval pipeline, you don't know if fine-tuning helped. With one, you can compare checkpoints objectively and stop training when you've hit your target.

Common Mistakes

Too little data: 50 examples fine-tunes a model to memorise, not generalise
No eval set: You can't measure what you don't measure
Overfitting: Training loss goes down, eval loss goes up — stop training earlier
Wrong base model: Fine-tuning a 7B model when a 70B base would be more appropriate for the task
Skipping data cleaning: Garbage in, garbage out — especially with fine-tuning
Fine-tuning when prompting would suffice: If a well-crafted system prompt gets you 90% of the way, the remaining 10% rarely justifies fine-tuning costs

When We Recommend Fine-tuning

For client projects, we recommend fine-tuning when:

The task has a specific output format the base model consistently gets wrong
Domain vocabulary and terminology are highly specialised (legal, medical, finance)
Prompt engineering alone requires >2,000 tokens of context per call (cost problem)
Response quality needs to be consistent across thousands of calls without human review

We've built fine-tuning pipelines for classification, entity extraction, and structured data generation use cases. The tooling is mature, the results are measurable, and the economics work at scale.

Thinking about fine-tuning for your product? Let's talk through the use case — we'll tell you honestly whether fine-tuning is the right call or whether there's a cheaper path to the same outcome.

BH

The Beyond Horizon Team

Engineering-led digital studio based in India. We build production-grade web apps, mobile apps, AI systems, and SaaS platforms — and write about what we learn along the way.

Have a project in mind?

We build fast, production-grade web, mobile, and AI applications.

Get a Free Consultation