Inference Optimization: Making LLMs Fast and Cheap Enough for Production



Inference Optimization: Making LLMs Fast and Cheap Enough for Production

Everyone talks about training large language models. The unsexy truth is that inference is where the money goes. A model that costs $10M to train might cost $50M/year to serve at scale. Inference optimization is the difference between a viable product and a money pit.

This post covers the techniques that matter most in 2026 — not the research paper versions, but the practical implementations that engineers are shipping in production today.

GPU server infrastructure Photo by Lars Kienle on Unsplash


The Inference Cost Stack

Before optimizing, know what you’re actually paying for:

User Request
    ↓
Tokenization (~0ms)
    ↓
Prefill Phase: Process input tokens in parallel
    - Cost: O(n²) attention, expensive for long prompts
    ↓
Decode Phase: Generate output tokens one-by-one
    - Cost: O(n) per token, memory bandwidth bound
    ↓
Response

The key insight: prefill is compute-bound, decode is memory-bandwidth bound. Different optimizations target different phases.


Quantization: The First Lever

Quantization reduces model weight precision. Most models train in BF16 (2 bytes/weight). You can serve in INT8, INT4, or even lower.

INT8 (W8A16)

  • Weights stored in INT8, activations stay in FP16
  • ~2x memory reduction, <1% quality degradation on most tasks
  • Widely supported via bitsandbytes, llm.int8(), GPTQ
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
    llm_int8_skip_modules=["lm_head"],  # Keep output layer precise
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-4-Scout-17B",
    quantization_config=bnb_config,
    device_map="auto",
)

INT4 (AWQ / GPTQ)

  • ~4x memory reduction vs BF16
  • 1–3% quality degradation depending on task
  • AWQ (Activation-Aware Weight Quantization) is generally better than GPTQ on reasoning tasks
  • Enables running 70B models on consumer-grade 4x24GB GPU setups

FP8 (NVIDIA H100/H200)

  • Supported natively on Hopper+ GPUs
  • ~2x over BF16, near-zero quality loss
  • vLLM 0.5+ and TensorRT-LLM both support FP8 seamlessly
# vLLM with FP8
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-4-Maverick-17B \
  --quantization fp8 \
  --max-model-len 128000

Practical guidance: Use FP8 on H100s, AWQ INT4 on older hardware, INT8 when quality can’t budge.


Speculative Decoding

One of the biggest throughput improvements of the last two years. The idea: use a small draft model to speculatively generate multiple tokens, then verify them all at once with the large model.

Normal decode:     [large_model] → token 1 → [large_model] → token 2 → ...
                   N large model calls for N tokens

Speculative:       [small_model] → tokens 1-4 (draft)
                   [large_model] → verify all 4 in one forward pass
                   Accept: 3 tokens, reject: 1
                   Net: 3 tokens for price of ~1.3 large model calls

When It Works

Speculative decoding works best when:

  • The draft model agrees with the large model often (high acceptance rate)
  • You’re generating long outputs (>50 tokens)
  • You have spare compute for the draft model

Typical acceptance rates: 70–85% for code generation, 60–75% for chat, lower for creative tasks.

Implementation with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-4-Maverick-70B",
    speculative_model="meta-llama/Llama-4-Scout-8B",
    num_speculative_tokens=5,
    speculative_draft_tensor_parallel_size=1,
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(prompts, params)

Self-Speculative (Medusa / EAGLE)

No separate draft model required. Extra “heads” on the base model draft multiple tokens simultaneously. Lower memory overhead but slightly lower speedup than separate draft models.

EAGLE-2 (2025) achieves 3–4x speedup on code tasks with <0.1% quality drop — impressive.


KV Cache Optimization

Every token attended to in the decode phase requires its key-value tensors to be loaded from memory. For long contexts, this dominates latency and memory.

Paged Attention

vLLM’s core innovation (now industry standard). Instead of pre-allocating a contiguous KV cache block per request (causing fragmentation and waste), paged attention uses fixed-size “pages” managed like virtual memory.

Result: 2–4x better GPU memory utilization, enabling much higher batch sizes.

KV Cache Compression

For very long contexts, you can compress old KV cache entries:

  • StreamingLLM: Keep only initial tokens + recent tokens (attention sink phenomenon)
  • SnapKV / PyramidKV: Adaptive pruning based on attention scores
  • Quantized KV Cache (FP8/INT8): Store cache in lower precision
# vLLM KV cache quantization
llm = LLM(
    model="Qwen/Qwen3-72B",
    kv_cache_dtype="fp8_e5m2",  # Halves KV cache memory
    max_model_len=128000,
)

Prefix Caching

If many requests share the same system prompt or few-shot examples, cache the KV states for that prefix. Subsequent requests skip re-computing the prefix entirely.

# vLLM enables prefix caching by default in recent versions
# For SGLang, use RadixAttention:
from sglang import Runtime

rt = Runtime(
    model_path="meta-llama/Llama-4-Scout-17B",
    enable_radix_cache=True,  # Trie-based prefix cache
)

In production RAG systems where a large context is prepended to every query, prefix caching can reduce compute by 40–60%.


Continuous Batching

The naive approach: wait for N requests, batch them, process together. Problem: requests finish at different times, leaving GPU idle.

Continuous batching (also called iteration-level scheduling): at each decode step, the scheduler can inject new requests and eject completed ones. GPU utilization goes from ~40% to 80%+.

All major serving frameworks implement this now (vLLM, SGLang, TGI). If you’re building a custom serving layer, don’t reinvent it — use one of these.


Tensor Parallelism vs Pipeline Parallelism

When a model doesn’t fit on one GPU:

Tensor Parallelism (TP)

  • Split individual weight matrices across GPUs
  • High NVLink bandwidth required between GPUs
  • Latency increases (all GPUs communicate at every layer)
  • Best for same-node multi-GPU (A100/H100 NVLink)

Pipeline Parallelism (PP)

  • Split layers across GPUs (GPU 0 runs layers 0-16, GPU 1 runs 17-32…)
  • Lower bandwidth requirements — only activations cross boundaries
  • Introduces “bubble” overhead
  • Best for multi-node setups
# vLLM tensor + pipeline parallel
llm = LLM(
    model="deepseek-ai/DeepSeek-V3",
    tensor_parallel_size=4,   # 4 GPUs within a node
    pipeline_parallel_size=2, # 2 nodes
)

Disaggregated Prefill/Decode

Emerging in 2025, now production-ready in 2026: run prefill and decode on separate GPU pools.

Why? Prefill is compute-intensive (wants big GPUs). Decode is memory-bandwidth intensive (wants many GPUs or HBM-dense GPUs).

Requests → Prefill Farm (H100 SXM, 8x per node) → KV Transfer → Decode Farm (H200, 4x per node)

This lets you optimize each phase independently and handle traffic spikes more gracefully. Mooncake (developed by Moonshot AI) is the most mature open-source implementation.


The Numbers: What You Can Expect

For a 70B parameter model serving chat traffic:

ConfigurationThroughput (tokens/s)MemoryRelative Cost
BF16, no optimization800140GB1.0x
INT8 + continuous batching1,80070GB0.45x
FP8 + speculative decoding3,20070GB0.25x
FP8 + spec dec + prefix cache4,500+70GB0.18x

Real numbers vary significantly by request pattern, output length, and hardware. But the direction is consistent: stack these optimizations and costs drop 5–6x.


Practical Stack for 2026

For most teams:

Serving framework: vLLM (most mature ecosystem) or SGLang (better for structured outputs/constrained generation)

Quantization: FP8 if on H100+, AWQ INT4 if on older hardware

Speculative decoding: Enable if output length > 100 tokens on average, draft model available

KV cache: Enable prefix caching by default, FP8 KV cache if memory-constrained

Monitoring: Track TPOT (time-per-output-token), TTFT (time-to-first-token), and GPU MFU (model flop utilization)


Conclusion

Inference optimization is now core engineering, not ML research. The gap between an unoptimized serving setup and a well-tuned one is 5–10x in cost and latency. For any team running LLMs at meaningful scale, this work pays back faster than almost anything else you can do.

Start with quantization (easiest win), add continuous batching (vLLM/SGLang give you this for free), then layer in prefix caching and speculative decoding based on your traffic patterns.


What’s your current inference stack? I’m particularly curious about teams running models larger than 70B in production — what’s working?

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)