LLM Fine-tuning in 2025: LoRA, QLoRA, and When to Actually Do It
Fine-tuning vs prompting vs RAG, LoRA and QLoRA explained, building eval pipelines, OpenAI fine-tune API, Hugging Face + PEFT setup, and common mistakes to avoid.
When Off-the-Shelf Models Aren't Enough
GPT-4o, Claude 3.5, and Gemini are remarkable general-purpose models. They're trained on the internet — which means they're excellent at common patterns and poor at your company's specific domain.
Ask a base model to classify support tickets in your proprietary taxonomy, extract entities from your industry-specific documents, or write in your brand's exact voice — and you'll spend months engineering prompts to compensate for the gap between "what the model knows" and "what your business needs."
Fine-tuning closes that gap by updating the model's weights on your data. The result is a model that behaves like it was trained for your exact use case — because it was.
Fine-tuning vs Prompting vs RAG
These three techniques are complementary, not mutually exclusive. Choosing the right one (or combination) depends on what's actually wrong:
| Problem | Best approach |
| Model doesn't know your facts | RAG (add context at query time) |
| Model doesn't follow your format | Fine-tuning |
| Model doesn't match your tone/style | Fine-tuning |
| Model makes domain-specific errors | Fine-tuning + RAG |
| Model needs recent information | RAG |
| Long prompts are too slow/expensive | Fine-tuning (shorter prompts needed) |
RAG is cheaper and faster to iterate. Fine-tuning is more powerful but requires training data and compute. Many production systems use both: a fine-tuned base for style and format, RAG for factual grounding.
The Two Main Approaches
Full Fine-tuning
Update all model weights on your dataset. Gives maximum customisation but requires significant GPU compute and risks catastrophic forgetting (the model forgets general capabilities as it learns yours).
Full fine-tuning is rarely the right choice for most product teams in 2025. The cost and complexity don't justify the benefits unless you're operating at very large scale with very specialised requirements.
LoRA / QLoRA (Low-Rank Adaptation)
LoRA adds small trainable matrices alongside the frozen original weights. Instead of updating billions of parameters, you update millions. QLoRA additionally quantises the base model to 4-bit precision, reducing GPU memory requirements dramatically.
This is the practical choice for most teams:
What You Need Before You Start
Quality Training Data
Fine-tuning amplifies patterns in your data. If your data has errors, inconsistencies, or bias — the model learns those too. Before anything else:
Evaluation Set
Reserve 10-20% of your data for evaluation. Never train on your eval set. Define metrics before training — what does "better" actually mean for your task? Accuracy, F1, BLEU, human preference? Without a clear eval, you're flying blind.
Compute
| Model size | Minimum GPU | Recommended |
| 7B parameters (LoRA) | 1× RTX 3090 (24GB) | 1× A100 40GB |
| 13B parameters (QLoRA) | 1× A100 40GB | 2× A100 40GB |
| 70B parameters (QLoRA) | 4× A100 80GB | 8× H100 |
| OpenAI fine-tune API | No GPU needed | Managed, pay-per-token |
The OpenAI Fine-tuning API
If you're fine-tuning GPT-3.5-turbo or GPT-4o-mini, OpenAI's API handles infrastructure entirely. You upload a JSONL file, trigger a job, and get a model ID back:
# Upload training data
openai files upload --purpose fine-tune training_data.jsonl
# Start fine-tuning job
openai fine_tuning.jobs create \
--training-file file-abc123 \
--model gpt-4o-mini
# Monitor progress
openai fine_tuning.jobs listCost: ~$8 per 1M training tokens for GPT-4o-mini. A 1,000-example dataset of typical length costs $5-20 to train. Inference on fine-tuned models costs 3-4× more than the base model — factor this into your production economics.
Open Source Fine-tuning with Hugging Face
For Llama 3, Mistral, Qwen, or Gemma models, the Hugging Face ecosystem is the standard:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B-Instruct")
lora_config = LoraConfig(
r=16, # rank — higher = more capacity, more compute
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
)
trainer.train()Building the Eval Pipeline First
Most teams start fine-tuning before building an eval pipeline. This is backwards. Before you train a single epoch:
Without an eval pipeline, you don't know if fine-tuning helped. With one, you can compare checkpoints objectively and stop training when you've hit your target.
Common Mistakes
When We Recommend Fine-tuning
For client projects, we recommend fine-tuning when:
We've built fine-tuning pipelines for classification, entity extraction, and structured data generation use cases. The tooling is mature, the results are measurable, and the economics work at scale.
Thinking about fine-tuning for your product? Let's talk through the use case — we'll tell you honestly whether fine-tuning is the right call or whether there's a cheaper path to the same outcome.
The Beyond Horizon Team
Engineering-led digital studio based in India. We build production-grade web apps, mobile apps, AI systems, and SaaS platforms — and write about what we learn along the way.
Keep Reading
All Articles →Building AI Agents for Production: A Practical Guide
How AI Agents work, when to use them, tool use with Claude and GPT-4o, multi-step planning, human-in-the-loop design, and the stack we use to ship production agents.
Model Context Protocol (MCP): The Standard for Connecting AI to Your Data
What MCP is, how it differs from direct tool use, building your first MCP server in TypeScript, security best practices, and the growing ecosystem around it.
Next.js vs React: Choosing the Right Framework for Your 2025 Web Project
A practical comparison of Next.js and plain React for web development projects. Learn when to choose each and why most production apps benefit from Next.js.
Have a project in mind?
We build fast, production-grade web, mobile, and AI applications.
Get a Free Consultation→