LLM Evaluation in Production: Beyond Vibes and Spot Checks
on Llm, Ai engineering, Evaluation, Mlops, Production ai
Shipping LLM-powered features is now table stakes. The hard part isn’t getting a prototype working — it’s knowing whether your model is performing well, catching regressions before users do, and maintaining quality as you iterate on prompts and models.
Most teams start with vibes: manually read some outputs, decide they look good, ship. This works until it doesn’t — which usually means a degraded user experience that took days to discover.
This post covers how to build an evaluation pipeline that actually catches problems.
Photo by Luke Chesser on Unsplash
The Evaluation Gap
Traditional software has clear correctness criteria. A function either returns the right value or it doesn’t. You write a test, it passes or fails, done.
LLMs don’t work like that. A response can be:
- Correct — answers the question accurately
- Plausible but wrong — sounds confident, is actually incorrect
- Correct but unhelpful — technically accurate, misses what the user needed
- Format-correct but content-poor — follows instructions, lacks substance
Measuring quality here requires judgment, which is why manual review feels like the only option. But judgment can be systematized.
The Evaluation Stack
A mature LLM evaluation pipeline has four layers:
1. Deterministic Checks (Fast, Cheap, Reliable)
These are programmatic assertions that don’t require an AI judge:
def check_response(response: str) -> EvalResult:
checks = [
# Format checks
lambda r: len(r) >= 50, # Not too short
lambda r: len(r) <= 2000, # Not too long
lambda r: not r.startswith("I cannot"), # No refusal
lambda r: "```" in r if expects_code else True, # Code block present
# Content checks
lambda r: "error" not in r.lower(), # No error bleed-through
lambda r: company_name in r, # Brand compliance
# Safety checks
lambda r: not contains_pii(r), # No PII in output
]
return all(check(response) for check in checks)
These run in microseconds and should gate every response. They catch obvious failures: empty responses, format violations, safety violations.
2. Reference-Based Evaluation
For tasks with a known-good answer (e.g., classification, extraction, structured data generation), compare against a golden dataset:
# Example: entity extraction evaluation
golden = [
{"input": "Email John at john@example.com", "expected": {"emails": ["john@example.com"]}},
{"input": "Call Sarah on 555-1234", "expected": {"phones": ["555-1234"]}},
]
results = []
for item in golden:
actual = extract_entities(item["input"])
results.append({
"precision": compute_precision(actual, item["expected"]),
"recall": compute_recall(actual, item["expected"]),
})
Maintain a golden dataset of 50–200 representative inputs. Run it on every prompt change. Track precision/recall/F1 over time.
3. LLM-as-Judge
For open-ended quality (summarization, tone, helpfulness), use a second LLM to evaluate the first:
JUDGE_PROMPT = """
You are evaluating the quality of an AI assistant's response.
User query: {query}
Assistant response: {response}
Rate the response on these dimensions (1-5):
- Accuracy: Does it answer correctly?
- Completeness: Does it cover what was asked?
- Conciseness: Is it appropriately brief?
- Helpfulness: Would a real user find this useful?
Respond in JSON: accuracy
"""
def judge_response(query: str, response: str) -> dict:
result = claude.complete(JUDGE_PROMPT.format(query=query, response=response))
return json.loads(result)
Important caveats:
- LLM judges have biases (prefer their own responses, prefer longer outputs)
- Use a different model family as judge than the model you’re evaluating
- Calibrate your judge against human ratings on a representative sample
4. Human Evaluation (Sparse but Essential)
You can’t automate everything. Set up a lightweight human review loop:
- A/B preference testing — show two responses, human picks better one
- Annotation on failures — when automated evals flag a response, human reviews to calibrate
- Monthly quality audits — sample 50-100 recent responses, rate manually
This doesn’t need to be a massive operation. 2-3 hours per week of structured human review will calibrate your automated evals and catch systematic issues.
Building the Pipeline
Here’s the architecture for a production eval pipeline:
[Production Logs] → [Sampling Layer] → [Eval Runner]
↓
┌──────────────────────────────┐
│ Deterministic Checks │
│ Reference Evals │
│ LLM Judge (async) │
└──────────┬───────────────────┘
↓
[Results DB] → [Dashboard]
↓
[Alerting on Regression]
Sampling Strategy
You can’t eval every request — it’s too expensive. Sample intelligently:
- 100% of new/changed prompt variants (before full rollout)
- 5-10% of production traffic (continuous quality monitoring)
- 100% of flagged responses (user feedback, low-confidence responses)
- Stratified sample — ensure coverage across user types, query categories
The Regression Gate
The most valuable part of the pipeline is the regression gate in CI/CD:
# .github/workflows/eval.yml
- name: Run LLM Eval Suite
run: python eval/run_suite.py --dataset golden_v3.jsonl
- name: Check Regression Threshold
run: |
python eval/check_regression.py \
--baseline ./eval/baselines/main.json \
--current ./eval/results/latest.json \
--threshold 0.05 # Allow max 5% regression on any metric
Merging a prompt change requires passing the eval suite. This is the single most impactful thing you can do to maintain quality over time.
Metrics That Actually Matter
Don’t track 20 metrics. Track 3-5 that map directly to user value:
| Metric | Measures | How |
|---|---|---|
| Task Success Rate | Did it do the thing? | Reference eval or LLM judge |
| Format Compliance | Did it follow instructions? | Deterministic |
| Factual Accuracy | Is it correct? | RAG grounding check or human |
| Refusal Rate | Is it too conservative? | Keyword detection |
| Latency (P95) | Is it fast enough? | Infrastructure metrics |
Establish baselines. Track weekly. Alert on drops > 5%.
Tools Worth Knowing
Photo by Google DeepMind on Unsplash
Evaluation Frameworks
LangSmith — If you’re in the LangChain ecosystem, LangSmith provides tracing + evaluation with minimal setup. Good default choice.
Braintrust — Purpose-built eval platform. Has a strong LLM-as-judge framework with calibration tooling.
Phoenix (Arize) — Open source, self-hostable. Good for RAG tracing + retrieval quality evaluation.
Ragas — Specialized for RAG evaluation. If you have a retrieval-augmented system, Ragas metrics (faithfulness, answer relevance, context precision) are well-calibrated.
DIY with Pytest — For many teams, a well-organized pytest suite with golden datasets is sufficient and has zero vendor dependency.
Observability
LLM calls need tracing just like API calls:
- OpenTelemetry GenAI semantic conventions — now stable for LLM traces
- Langfuse — open source LLM observability, self-hostable
- Helicone — lightweight proxy-based observability
Common Mistakes
1. Only evaluating the happy path Your golden dataset should include hard cases, edge cases, and adversarial inputs — not just typical queries.
2. Evaluating offline but not online Offline evals on golden datasets miss distribution shift. You need production sampling to catch “this works great on our test data but real users ask differently.”
3. Treating LLM judge scores as ground truth They’re noisy. Use them for trends and regression detection, not absolute quality measurement.
4. Not versioning your prompts Every eval needs to be tied to a specific prompt version. Use semantic versioning for prompts just as you would for code.
5. Evaluating too infrequently Run evals on every PR that touches prompts. Don’t let changes accumulate before you measure.
Getting Started
If you have no eval infrastructure today, here’s the minimal starting point:
- Write 30-50 golden examples — real inputs with verified expected outputs or quality ratings
- Add deterministic checks — length, format, safety keywords
- Set up LangSmith or Langfuse for tracing
- Write a 10-example regression test in pytest that runs in CI
- Sample 5% of production responses and spot-check weekly
You can build from there. The key is that evaluation becomes a habit, not an afterthought.
Resources
- HELM — Holistic Evaluation of Language Models (Stanford)
- Building LLM Evals at Scale (Shreya Shankar)
- Ragas Documentation
- OpenTelemetry GenAI Semantic Conventions
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
