LLM Fine-Tuning and RAG Optimization: A Practical Guide for 2026
on Llm, Fine-tuning, Rag, Ai, Machine learning
LLM Fine-Tuning and RAG Optimization: A Practical Guide for 2026
The landscape of LLM customization has matured significantly. Whether you’re fine-tuning models for specific domains or building Retrieval-Augmented Generation (RAG) systems, understanding the tradeoffs and best practices is essential.
Photo by Steve Johnson on Unsplash
When to Fine-Tune vs. Use RAG
Choose Fine-Tuning When:
- You need consistent style, tone, or format
- Domain-specific terminology must be deeply understood
- Latency is critical (no retrieval step)
- Your use case is well-defined and stable
Choose RAG When:
- Information changes frequently
- You need to cite sources
- Transparency and explainability matter
- You have limited training data but rich documentation
The Hybrid Approach
In 2026, most production systems use both: fine-tuned models enhanced with RAG for up-to-date information.
Fine-Tuning Techniques
LoRA (Low-Rank Adaptation)
The standard approach for efficient fine-tuning:
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.3",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling factor
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 7,255,511,040 || 0.19%
QLoRA for Consumer GPUs
Fine-tune 70B models on a single GPU:
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-70B",
quantization_config=bnb_config,
device_map="auto"
)
Photo by Markus Spiske on Unsplash
Dataset Preparation
Quality data matters more than quantity:
from datasets import Dataset
def prepare_chat_dataset(conversations: list) -> Dataset:
"""Format conversations for instruction fine-tuning"""
formatted = []
for conv in conversations:
messages = []
for turn in conv["turns"]:
messages.append({
"role": turn["role"],
"content": turn["content"]
})
# Apply chat template
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False
)
formatted.append({"text": text})
return Dataset.from_list(formatted)
# Training with proper formatting
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=SFTConfig(
output_dir="./output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
warmup_ratio=0.03,
logging_steps=10,
save_strategy="epoch"
),
peft_config=lora_config
)
RAG Architecture Optimization
Chunking Strategies
The foundation of good retrieval:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Semantic chunking with overlap
def smart_chunk(documents: list, chunk_size: int = 512) -> list:
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len
)
chunks = []
for doc in documents:
doc_chunks = splitter.split_text(doc.content)
for i, chunk in enumerate(doc_chunks):
chunks.append({
"content": chunk,
"metadata": {
**doc.metadata,
"chunk_index": i,
"total_chunks": len(doc_chunks)
}
})
return chunks
Hybrid Search
Combine dense and sparse retrieval:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, SparseVectorParams
client = QdrantClient("localhost", port=6333)
# Create collection with hybrid vectors
client.create_collection(
collection_name="documents",
vectors_config={
"dense": VectorParams(
size=1024,
distance=Distance.COSINE
)
},
sparse_vectors_config={
"sparse": SparseVectorParams()
}
)
# Hybrid search query
def hybrid_search(query: str, top_k: int = 10):
dense_vector = embed_dense(query)
sparse_vector = embed_sparse(query) # BM25 or SPLADE
results = client.query_points(
collection_name="documents",
prefetch=[
{"query": dense_vector, "using": "dense", "limit": top_k * 2},
{"query": sparse_vector, "using": "sparse", "limit": top_k * 2}
],
query={"fusion": "rrf"}, # Reciprocal Rank Fusion
limit=top_k
)
return results
Reranking
Improve precision with cross-encoders:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
def rerank_results(query: str, documents: list, top_k: int = 5) -> list:
"""Rerank retrieved documents using cross-encoder"""
pairs = [(query, doc["content"]) for doc in documents]
scores = reranker.predict(pairs)
ranked = sorted(
zip(documents, scores),
key=lambda x: x[1],
reverse=True
)
return [doc for doc, score in ranked[:top_k]]
Advanced RAG Patterns
Query Transformation
Improve retrieval with query rewriting:
async def transform_query(original_query: str, llm) -> list[str]:
"""Generate multiple search queries for better coverage"""
prompt = f"""Generate 3 different search queries to find information for:
"{original_query}"
Output as JSON array of strings."""
response = await llm.generate(prompt)
queries = json.loads(response)
return [original_query] + queries
# Multi-query retrieval
async def multi_query_retrieve(query: str, retriever, llm) -> list:
queries = await transform_query(query, llm)
all_docs = []
seen_ids = set()
for q in queries:
docs = await retriever.retrieve(q)
for doc in docs:
if doc.id not in seen_ids:
all_docs.append(doc)
seen_ids.add(doc.id)
return all_docs
Self-RAG
Let the model decide when to retrieve:
class SelfRAG:
def __init__(self, llm, retriever):
self.llm = llm
self.retriever = retriever
async def generate(self, query: str) -> str:
# First, assess if retrieval is needed
needs_retrieval = await self.assess_retrieval_need(query)
if needs_retrieval:
docs = await self.retriever.retrieve(query)
context = self.format_context(docs)
# Generate with retrieval
response = await self.llm.generate(
f"Context:\n{context}\n\nQuestion: {query}"
)
# Verify response is grounded
if await self.verify_grounding(response, docs):
return response
else:
# Retry with different docs or approach
return await self.fallback_generate(query)
else:
return await self.llm.generate(query)
Agentic RAG
Multi-step reasoning with tool use:
from langchain.agents import create_react_agent
tools = [
retriever_tool,
calculator_tool,
web_search_tool,
code_executor_tool
]
agent = create_react_agent(
llm=llm,
tools=tools,
prompt="""You are a helpful assistant with access to tools.
For complex questions:
1. Break down the question into sub-questions
2. Use the retriever for domain knowledge
3. Use web search for current information
4. Use calculator for numerical computations
Always cite your sources."""
)
Evaluation and Monitoring
RAG Evaluation Metrics
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
# Evaluate RAG pipeline
results = evaluate(
dataset=eval_dataset,
metrics=[
faithfulness, # Is answer grounded in context?
answer_relevancy, # Is answer relevant to question?
context_precision, # Are retrieved docs relevant?
context_recall # Are all needed docs retrieved?
]
)
print(f"Faithfulness: {results['faithfulness']:.3f}")
print(f"Relevancy: {results['answer_relevancy']:.3f}")
Production Monitoring
Track key metrics in production:
import prometheus_client as prom
retrieval_latency = prom.Histogram(
'rag_retrieval_latency_seconds',
'Time spent in retrieval',
buckets=[0.1, 0.25, 0.5, 1.0, 2.5]
)
retrieval_count = prom.Counter(
'rag_retrieval_total',
'Total retrieval operations',
['status'] # success, empty, error
)
@retrieval_latency.time()
async def monitored_retrieve(query: str):
try:
docs = await retriever.retrieve(query)
status = "success" if docs else "empty"
retrieval_count.labels(status=status).inc()
return docs
except Exception as e:
retrieval_count.labels(status="error").inc()
raise
Best Practices Summary
- Start with RAG - It’s faster to iterate and debug
- Fine-tune for style - When you need consistent output format
- Use hybrid search - Dense + sparse beats either alone
- Rerank everything - Cross-encoders significantly improve precision
- Monitor faithfulness - Hallucinations are your enemy
- Chunk thoughtfully - Bad chunking ruins good embeddings
Building a RAG system or fine-tuning a model? I’d love to hear about your approach and challenges.
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
