AI & Machine Learning26 May 2026·14 min read

LLM Fine-tuning in 2025: LoRA, QLoRA, and When to Actually Do It

Fine-tuning vs prompting vs RAG, LoRA and QLoRA explained, building eval pipelines, OpenAI fine-tune API, Hugging Face + PEFT setup, and common mistakes to avoid.

LLM Fine-tuningLoRAQLoRAHugging FaceOpenAIMachine LearningAIPEFT

When Off-the-Shelf Models Aren't Enough

GPT-4o, Claude 3.5, and Gemini are remarkable general-purpose models. They're trained on the internet — which means they're excellent at common patterns and poor at your company's specific domain.

Ask a base model to classify support tickets in your proprietary taxonomy, extract entities from your industry-specific documents, or write in your brand's exact voice — and you'll spend months engineering prompts to compensate for the gap between "what the model knows" and "what your business needs."

Fine-tuning closes that gap by updating the model's weights on your data. The result is a model that behaves like it was trained for your exact use case — because it was.

Fine-tuning vs Prompting vs RAG

These three techniques are complementary, not mutually exclusive. Choosing the right one (or combination) depends on what's actually wrong:

Problem	Best approach
Model doesn't know your facts	RAG (add context at query time)
Model doesn't follow your format	Fine-tuning
Model doesn't match your tone/style	Fine-tuning
Model makes domain-specific errors	Fine-tuning + RAG
Model needs recent information	RAG
Long prompts are too slow/expensive	Fine-tuning (shorter prompts needed)

RAG is cheaper and faster to iterate. Fine-tuning is more powerful but requires training data and compute. Many production systems use both: a fine-tuned base for style and format, RAG for factual grounding.

The Two Main Approaches

Full Fine-tuning

Update all model weights on your dataset. Gives maximum customisation but requires significant GPU compute and risks catastrophic forgetting (the model forgets general capabilities as it learns yours).

Full fine-tuning is rarely the right choice for most product teams in 2025. The cost and complexity don't justify the benefits unless you're operating at very large scale with very specialised requirements.

LoRA / QLoRA (Low-Rank Adaptation)

LoRA adds small trainable matrices alongside the frozen original weights. Instead of updating billions of parameters, you update millions. QLoRA additionally quantises the base model to 4-bit precision, reducing GPU memory requirements dramatically.

This is the practical choice for most teams:

Train on a single A100 (or even a consumer GPU for small models)

No catastrophic forgetting — base weights are frozen

LoRA adapters are tiny files (10-100MB) that can be swapped in/out

Merge back into the base model for inference with zero overhead

What You Need Before You Start

Quality Training Data

Fine-tuning amplifies patterns in your data. If your data has errors, inconsistencies, or bias — the model learns those too. Before anything else:

100-1000 examples: is enough for style/format fine-tuning. Classification tasks need more.

Format: typically `{"prompt": "...", "completion": "..."}` for instruction fine-tuning

Diversity matters more than volume: 500 high-quality, varied examples outperform 5,000 near-duplicate ones

Remove duplicates, correct errors, ensure consistent formatting

Evaluation Set

Reserve 10-20% of your data for evaluation. Never train on your eval set. Define metrics before training — what does "better" actually mean for your task? Accuracy, F1, BLEU, human preference? Without a clear eval, you're flying blind.

Compute

Model size	Minimum GPU	Recommended
7B parameters (LoRA)	1× RTX 3090 (24GB)	1× A100 40GB
13B parameters (QLoRA)	1× A100 40GB	2× A100 40GB
70B parameters (QLoRA)	4× A100 80GB	8× H100
OpenAI fine-tune API	No GPU needed	Managed, pay-per-token

The OpenAI Fine-tuning API

If you're fine-tuning GPT-3.5-turbo or GPT-4o-mini, OpenAI's API handles infrastructure entirely. You upload a JSONL file, trigger a job, and get a model ID back:

# Upload training data
openai files upload --purpose fine-tune training_data.jsonl

# Start fine-tuning job
openai fine_tuning.jobs create \
  --training-file file-abc123 \
  --model gpt-4o-mini

# Monitor progress
openai fine_tuning.jobs list

Cost: ~$8 per 1M training tokens for GPT-4o-mini. A 1,000-example dataset of typical length costs $5-20 to train. Inference on fine-tuned models costs 3-4× more than the base model — factor this into your production economics.

Open Source Fine-tuning with Hugging Face

For Llama 3, Mistral, Qwen, or Gemma models, the Hugging Face ecosystem is the standard:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B-Instruct")

lora_config = LoraConfig(
  r=16,  # rank — higher = more capacity, more compute
  lora_alpha=32,
  target_modules=["q_proj", "v_proj"],
  lora_dropout=0.05,
  task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

trainer = SFTTrainer(
  model=model,
  train_dataset=dataset,
  dataset_text_field="text",
  max_seq_length=2048,
)

trainer.train()

Building the Eval Pipeline First

Most teams start fine-tuning before building an eval pipeline. This is backwards. Before you train a single epoch:

Define your success metrics precisely

Write automated eval scripts that can score any model checkpoint

Run your eval on the base model to establish a baseline

Set a target score that justifies the training cost

Without an eval pipeline, you don't know if fine-tuning helped. With one, you can compare checkpoints objectively and stop training when you've hit your target.

Common Mistakes

Too little data: 50 examples fine-tunes a model to memorise, not generalise

No eval set: You can't measure what you don't measure

Overfitting: Training loss goes down, eval loss goes up — stop training earlier

Wrong base model: Fine-tuning a 7B model when a 70B base would be more appropriate for the task

Skipping data cleaning: Garbage in, garbage out — especially with fine-tuning

Fine-tuning when prompting would suffice: If a well-crafted system prompt gets you 90% of the way, the remaining 10% rarely justifies fine-tuning costs

When We Recommend Fine-tuning

For client projects, we recommend fine-tuning when:

The task has a specific output format the base model consistently gets wrong

Domain vocabulary and terminology are highly specialised (legal, medical, finance)

Prompt engineering alone requires >2,000 tokens of context per call (cost problem)

Response quality needs to be consistent across thousands of calls without human review

We've built fine-tuning pipelines for classification, entity extraction, and structured data generation use cases. The tooling is mature, the results are measurable, and the economics work at scale.

Thinking about fine-tuning for your product? Let's talk through the use case — we'll tell you honestly whether fine-tuning is the right call or whether there's a cheaper path to the same outcome.

The Beyond Horizon Team

Engineering-led digital studio based in India. We build production-grade web apps, mobile apps, AI systems, and SaaS platforms — and write about what we learn along the way.

About Us →Our Work →

Keep Reading

All Articles →

AI & Machine Learning

Building AI Agents for Production: A Practical Guide

How AI Agents work, when to use them, tool use with Claude and GPT-4o, multi-step planning, human-in-the-loop design, and the stack we use to ship production agents.

13 min readRead →

AI & Machine Learning

Model Context Protocol (MCP): The Standard for Connecting AI to Your Data

What MCP is, how it differs from direct tool use, building your first MCP server in TypeScript, security best practices, and the growing ecosystem around it.

12 min readRead →

AI & Machine Learning

AI Agents for Business: Real Costs and What Makes Them Production-Ready

What AI agents actually cost to build, which use cases work, what makes them production-ready vs a demo, and when to hire an AI agent development team vs build in-house.

11 min readRead →

Have a Project in Mind?

We build fast, SEO-ready web and mobile applications.