LLM Fine-Tuning in 2026: LoRA, QLoRA, and When to Actually Do It



LLM Fine-Tuning in 2026: LoRA, QLoRA, and When to Actually Do It

Fine-tuning LLMs gets cargo-culted constantly. Teams spend weeks on fine-tuning when better prompts would have solved it in an afternoon—or they dismiss fine-tuning when it’s genuinely the right tool. This post gives you a decision framework and practical implementation guide.

Machine Learning Photo by Igor Omilaev on Unsplash

The Decision Framework First

Before touching any training code, answer these questions:

Is your problem about knowledge? → Use RAG
Is your problem about behavior/format/style? → Fine-tuning might help
Is your problem about reasoning quality? → Use a bigger model or better prompting
Is your problem reproducible with prompt engineering? → Do that first

Fine-tuning wins when:

  • Consistent output format that resists prompt injection
  • Domain-specific vocabulary or jargon (legal, medical, code in obscure languages)
  • Latency requirements prevent long system prompts
  • Deploying a smaller model with specialist capability beats a large generalist

Fine-tuning loses when:

  • Your dataset is < 500 examples (overfit risk is high)
  • You need current knowledge (fine-tuning doesn’t update facts)
  • Your task changes frequently (retraining cadence is expensive)
  • A 3-shot prompt already achieves your quality threshold

The LoRA Architecture

LoRA (Low-Rank Adaptation) is now the standard approach. Understanding it makes debugging much easier.

How LoRA Works

Instead of updating all W parameters, LoRA adds two small matrices:

# Original: y = Wx + b  (freeze W)
# LoRA:     y = Wx + b + α * (B * A) * x

# Where:
#   W is (d_model × d_model), e.g., (4096 × 4096) = 16M params
#   A is (d_model × r),       e.g., (4096 × 16)   = 65K params
#   B is (r × d_model),       e.g., (16 × 4096)   = 65K params
#   r = rank (4–64), controls capacity
#   α = scaling factor (typically r or 2*r)

LoRA adapter at rank 16 on Llama 3 8B: ~40M trainable params vs 8B total.

Which Layers to Target?

# Empirical best practice (2026)
target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",  # attention
    "gate_proj", "up_proj", "down_proj"        # MLP (add for harder tasks)
]

# For code/format-heavy tasks, also add:
# "lm_head"  — output projection
# "embed_tokens"  — input embeddings (use modules_to_save, not LoRA)

Practical QLoRA Implementation

QLoRA = quantize base model to 4-bit + train LoRA adapters in 16-bit. Fits a 70B model on a single A100 80GB.

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import Dataset

# 1. Load quantized base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",       # Normal Float 4 — best for LLMs
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Nested quantization saves ~0.4GB
)

model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"  # 2x faster attention
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# 2. Prepare for QLoRA (cast LayerNorm to fp32)
model = prepare_model_for_kbit_training(model)

# 3. LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    # RSLoRA: better gradient scaling than classic LoRA
    use_rslora=True,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,212,480 || trainable%: 0.52%

# 4. Dataset formatting
def format_instruction(sample):
    return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{sample['system']}<|eot_id|><|start_header_id|>user<|end_header_id|>
{sample['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{sample['output']}<|eot_id|>"""

# 5. Training
training_args = TrainingArguments(
    output_dir="./checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # effective batch = 16
    gradient_checkpointing=True,    # save VRAM at cost of speed
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
    # Optimizer
    optim="paged_adamw_32bit",
    # Weights & Biases
    report_to="wandb",
    run_name=f"llama3-8b-qlora-{model_id.split('/')[-1]}",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    formatting_func=format_instruction,
    max_seq_length=2048,
    packing=True,  # Pack multiple short examples per sequence (2-3x throughput)
)

trainer.train()

Data Quality: The Real Bottleneck

Training code is a solved problem. Data quality isn’t.

Minimum Viable Dataset

# Quality checklist for each example
def validate_example(example: dict) -> bool:
    checks = [
        len(example["instruction"]) > 20,          # Not trivially short
        len(example["output"]) > 50,               # Substantive response
        not has_pii(example["output"]),             # No sensitive data
        not is_duplicate(example, seen_hashes),    # Deduplication
        response_follows_format(example["output"]), # Matches target format
    ]
    return all(checks)

# Deduplication by MinHash
from datasketch import MinHash, MinHashLSH

def get_minhash(text: str, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for word in text.lower().split():
        m.update(word.encode("utf8"))
    return m

Synthetic Data Generation

For most tasks, GPT-4.1 or Claude 3.7 can generate training data:

async def generate_training_example(
    task_description: str,
    few_shot_examples: list[dict],
    llm_client
) -> dict:
    prompt = f"""Generate a training example for this task: {task_description}

Examples of good training data:
{json.dumps(few_shot_examples[:3], indent=2)}

Generate a NEW, diverse example in the same JSON format.
Vary the style, length, and content. Do not copy the examples."""

    response = await llm_client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.9  # High temp for diversity
    )
    return json.loads(response.choices[0].message.content)

Warning: Model-generated data inherits the teacher model’s biases and errors. Always human-review a sample (10-20%) before training.

Evaluation: Don’t Skip This

Fine-tuning without evaluation is flying blind.

from rouge_score import rouge_scorer
import bert_score

class FineTuneEvaluator:
    def __init__(self, model, tokenizer, eval_dataset):
        self.model = model
        self.tokenizer = tokenizer
        self.eval_dataset = eval_dataset

    def generate_predictions(self, max_new_tokens=512) -> list[str]:
        predictions = []
        for example in self.eval_dataset:
            inputs = self.tokenizer(
                example["prompt"],
                return_tensors="pt"
            ).to(self.model.device)

            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=max_new_tokens,
                    temperature=0.1,
                    do_sample=True
                )

            pred = self.tokenizer.decode(
                outputs[0][inputs["input_ids"].shape[1]:],
                skip_special_tokens=True
            )
            predictions.append(pred)
        return predictions

    def evaluate(self) -> dict:
        predictions = self.generate_predictions()
        references = [ex["output"] for ex in self.eval_dataset]

        # ROUGE scores
        scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
        rouge_scores = [
            scorer.score(ref, pred)["rougeL"].fmeasure
            for ref, pred in zip(references, predictions)
        ]

        # BERTScore (semantic similarity)
        P, R, F1 = bert_score.score(predictions, references, lang="en")

        # Task-specific: format compliance
        format_compliance = sum(
            1 for pred in predictions
            if self.check_format(pred)
        ) / len(predictions)

        return {
            "rouge_l": sum(rouge_scores) / len(rouge_scores),
            "bert_score_f1": F1.mean().item(),
            "format_compliance": format_compliance,
        }

Cost Estimates (2026 Cloud Pricing)

SetupGPUTraining Time (8B, 1K examples)Cost
QLoRA 4-bit1× A10G (24GB)~2h~$3
QLoRA 4-bit1× A100 (40GB)~45min~$2
Full LoRA bf164× A100 (80GB)~1h~$12
Full fine-tune8× H100~3h~$50

For most use cases, QLoRA on a single A100 is optimal. Use vast.ai or Lambda Labs to avoid reserved instance commitments for one-off training runs.

Serving Fine-Tuned Models

# Merge LoRA into base model for deployment
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./checkpoints/final")
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

# Deploy with vLLM
# vllm serve ./merged-model --tensor-parallel-size 1 --max-model-len 8192

Conclusion

Fine-tuning is a precision tool, not a default path. The teams that use it well are those who first exhaust prompt engineering and RAG, then reach for fine-tuning with a clear quality metric, clean data, and a defined evaluation suite.

QLoRA democratized fine-tuning to the point where the barrier is no longer compute—it’s data quality and evaluation rigor. Invest in those, and a fine-tuned 8B model can outperform GPT-4 on your specific task at 1/100th the cost per token.

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)