LLM Fine-Tuning in 2026: LoRA, QLoRA, and When to Actually Do It
on Ai, Llm, Fine-tuning, Lora, Machine learning, Mlops
LLM Fine-Tuning in 2026: LoRA, QLoRA, and When to Actually Do It
Fine-tuning LLMs gets cargo-culted constantly. Teams spend weeks on fine-tuning when better prompts would have solved it in an afternoon—or they dismiss fine-tuning when it’s genuinely the right tool. This post gives you a decision framework and practical implementation guide.
Photo by Igor Omilaev on Unsplash
The Decision Framework First
Before touching any training code, answer these questions:
Is your problem about knowledge? → Use RAG
Is your problem about behavior/format/style? → Fine-tuning might help
Is your problem about reasoning quality? → Use a bigger model or better prompting
Is your problem reproducible with prompt engineering? → Do that first
Fine-tuning wins when:
- Consistent output format that resists prompt injection
- Domain-specific vocabulary or jargon (legal, medical, code in obscure languages)
- Latency requirements prevent long system prompts
- Deploying a smaller model with specialist capability beats a large generalist
Fine-tuning loses when:
- Your dataset is < 500 examples (overfit risk is high)
- You need current knowledge (fine-tuning doesn’t update facts)
- Your task changes frequently (retraining cadence is expensive)
- A 3-shot prompt already achieves your quality threshold
The LoRA Architecture
LoRA (Low-Rank Adaptation) is now the standard approach. Understanding it makes debugging much easier.
How LoRA Works
Instead of updating all W parameters, LoRA adds two small matrices:
# Original: y = Wx + b (freeze W)
# LoRA: y = Wx + b + α * (B * A) * x
# Where:
# W is (d_model × d_model), e.g., (4096 × 4096) = 16M params
# A is (d_model × r), e.g., (4096 × 16) = 65K params
# B is (r × d_model), e.g., (16 × 4096) = 65K params
# r = rank (4–64), controls capacity
# α = scaling factor (typically r or 2*r)
LoRA adapter at rank 16 on Llama 3 8B: ~40M trainable params vs 8B total.
Which Layers to Target?
# Empirical best practice (2026)
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj", # attention
"gate_proj", "up_proj", "down_proj" # MLP (add for harder tasks)
]
# For code/format-heavy tasks, also add:
# "lm_head" — output projection
# "embed_tokens" — input embeddings (use modules_to_save, not LoRA)
Practical QLoRA Implementation
QLoRA = quantize base model to 4-bit + train LoRA adapters in 16-bit. Fits a 70B model on a single A100 80GB.
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import Dataset
# 1. Load quantized base model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normal Float 4 — best for LLMs
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Nested quantization saves ~0.4GB
)
model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2" # 2x faster attention
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# 2. Prepare for QLoRA (cast LayerNorm to fp32)
model = prepare_model_for_kbit_training(model)
# 3. LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
# RSLoRA: better gradient scaling than classic LoRA
use_rslora=True,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,212,480 || trainable%: 0.52%
# 4. Dataset formatting
def format_instruction(sample):
return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{sample['system']}<|eot_id|><|start_header_id|>user<|end_header_id|>
{sample['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{sample['output']}<|eot_id|>"""
# 5. Training
training_args = TrainingArguments(
output_dir="./checkpoints",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch = 16
gradient_checkpointing=True, # save VRAM at cost of speed
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
bf16=True,
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
load_best_model_at_end=True,
# Optimizer
optim="paged_adamw_32bit",
# Weights & Biases
report_to="wandb",
run_name=f"llama3-8b-qlora-{model_id.split('/')[-1]}",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
formatting_func=format_instruction,
max_seq_length=2048,
packing=True, # Pack multiple short examples per sequence (2-3x throughput)
)
trainer.train()
Data Quality: The Real Bottleneck
Training code is a solved problem. Data quality isn’t.
Minimum Viable Dataset
# Quality checklist for each example
def validate_example(example: dict) -> bool:
checks = [
len(example["instruction"]) > 20, # Not trivially short
len(example["output"]) > 50, # Substantive response
not has_pii(example["output"]), # No sensitive data
not is_duplicate(example, seen_hashes), # Deduplication
response_follows_format(example["output"]), # Matches target format
]
return all(checks)
# Deduplication by MinHash
from datasketch import MinHash, MinHashLSH
def get_minhash(text: str, num_perm: int = 128) -> MinHash:
m = MinHash(num_perm=num_perm)
for word in text.lower().split():
m.update(word.encode("utf8"))
return m
Synthetic Data Generation
For most tasks, GPT-4.1 or Claude 3.7 can generate training data:
async def generate_training_example(
task_description: str,
few_shot_examples: list[dict],
llm_client
) -> dict:
prompt = f"""Generate a training example for this task: {task_description}
Examples of good training data:
{json.dumps(few_shot_examples[:3], indent=2)}
Generate a NEW, diverse example in the same JSON format.
Vary the style, length, and content. Do not copy the examples."""
response = await llm_client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.9 # High temp for diversity
)
return json.loads(response.choices[0].message.content)
Warning: Model-generated data inherits the teacher model’s biases and errors. Always human-review a sample (10-20%) before training.
Evaluation: Don’t Skip This
Fine-tuning without evaluation is flying blind.
from rouge_score import rouge_scorer
import bert_score
class FineTuneEvaluator:
def __init__(self, model, tokenizer, eval_dataset):
self.model = model
self.tokenizer = tokenizer
self.eval_dataset = eval_dataset
def generate_predictions(self, max_new_tokens=512) -> list[str]:
predictions = []
for example in self.eval_dataset:
inputs = self.tokenizer(
example["prompt"],
return_tensors="pt"
).to(self.model.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.1,
do_sample=True
)
pred = self.tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)
predictions.append(pred)
return predictions
def evaluate(self) -> dict:
predictions = self.generate_predictions()
references = [ex["output"] for ex in self.eval_dataset]
# ROUGE scores
scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
rouge_scores = [
scorer.score(ref, pred)["rougeL"].fmeasure
for ref, pred in zip(references, predictions)
]
# BERTScore (semantic similarity)
P, R, F1 = bert_score.score(predictions, references, lang="en")
# Task-specific: format compliance
format_compliance = sum(
1 for pred in predictions
if self.check_format(pred)
) / len(predictions)
return {
"rouge_l": sum(rouge_scores) / len(rouge_scores),
"bert_score_f1": F1.mean().item(),
"format_compliance": format_compliance,
}
Cost Estimates (2026 Cloud Pricing)
| Setup | GPU | Training Time (8B, 1K examples) | Cost |
|---|---|---|---|
| QLoRA 4-bit | 1× A10G (24GB) | ~2h | ~$3 |
| QLoRA 4-bit | 1× A100 (40GB) | ~45min | ~$2 |
| Full LoRA bf16 | 4× A100 (80GB) | ~1h | ~$12 |
| Full fine-tune | 8× H100 | ~3h | ~$50 |
For most use cases, QLoRA on a single A100 is optimal. Use vast.ai or Lambda Labs to avoid reserved instance commitments for one-off training runs.
Serving Fine-Tuned Models
# Merge LoRA into base model for deployment
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./checkpoints/final")
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
# Deploy with vLLM
# vllm serve ./merged-model --tensor-parallel-size 1 --max-model-len 8192
Conclusion
Fine-tuning is a precision tool, not a default path. The teams that use it well are those who first exhaust prompt engineering and RAG, then reach for fine-tuning with a clear quality metric, clean data, and a defined evaluation suite.
QLoRA democratized fine-tuning to the point where the barrier is no longer compute—it’s data quality and evaluation rigor. Invest in those, and a fine-tuned 8B model can outperform GPT-4 on your specific task at 1/100th the cost per token.
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
