Serverless GPU in 2026: Deploying AI Models Without Managing Infrastructure
on Gpu, Serverless, Ai, Mlops, Modal, Inference, Llm
Serverless GPU in 2026: Deploying AI Models Without Managing Infrastructure
Running AI models requires GPUs. GPUs are expensive. Reserved GPU instances cost thousands of dollars per month — even when idle. For most teams, the math doesn’t work.
Serverless GPU flips the model: pay only for the seconds your model is actually running. Scale to zero when idle. Handle traffic spikes automatically. No CUDA driver nightmares, no cluster management.
In 2026, serverless GPU platforms have matured enough to handle everything from real-time inference to overnight fine-tuning jobs — at a fraction of the cost of dedicated GPU instances.
Photo by Christian Wiediger on Unsplash
The Serverless GPU Landscape
Why Traditional GPU Deployment Fails Most Teams
Traditional approach:
- AWS p3.2xlarge (1× V100): $3.06/hour
- Running 24/7: ~$2,200/month
- Average utilization: 20–40%
- Effective cost: $5,500–$11,000/month equivalent
Serverless GPU:
- Pay per second of GPU time
- Scale to zero when idle
- $0.80–$2.50/hour for H100/A100
- 80% utilization by default (only pay when running)
For most AI applications — chatbots, image generation, embedding APIs — traffic is bursty. Serverless matches cost to actual usage.
Platform Comparison 2026
| Platform | Strengths | GPU Types | Cold Start | Best For |
|---|---|---|---|---|
| Modal | Best DX, Python-native, fast builds | H100, A100, T4 | 5–15s | Custom inference, fine-tuning |
| Replicate | Model marketplace, easy API | A100, H100 | 10–30s | Deploying open models fast |
| RunPod Serverless | Cheapest, flexible | Wide range | 15–45s | Cost-sensitive workloads |
| Beam | Simple Python, good for scripts | A10G, A100 | 10–20s | Batch jobs, scheduled tasks |
| Baseten | MLOps features, monitoring | H100, A100 | 5–10s | Production ML serving |
| Hugging Face Inference Endpoints | HF ecosystem, easy deploy | T4–A100 | Warm: instant | HF model serving |
Modal: The Developer-Favorite Platform
Modal is the most developer-friendly serverless GPU platform in 2026. You write regular Python; Modal handles everything else.
Getting Started
pip install modal
modal token new # Authenticate with modal.com
Deploy an LLM Inference Endpoint
# inference.py
import modal
app = modal.App("llm-inference")
# Define the container image
image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(
"transformers",
"torch",
"accelerate",
"bitsandbytes",
"fastapi",
"uvicorn"
)
)
# Model storage volume (download once, reuse)
volume = modal.Volume.from_name("model-weights", create_if_missing=True)
@app.cls(
image=image,
gpu=modal.gpu.H100(), # Request H100 GPU
memory=32768, # 32GB RAM
volumes={"/models": volume}, # Persistent model storage
container_idle_timeout=300, # Keep warm 5 min after last request
allow_concurrent_inputs=4, # Handle 4 concurrent requests per container
)
class LLMInference:
@modal.enter()
def load_model(self):
"""Called once when container starts."""
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Llama-3.3-70B-Instruct"
cache_dir = "/models/llama-3.3-70b"
print("Loading tokenizer...")
self.tokenizer = AutoTokenizer.from_pretrained(
model_name, cache_dir=cache_dir
)
print("Loading model...")
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
cache_dir=cache_dir,
torch_dtype=torch.bfloat16,
device_map="auto",
load_in_4bit=True, # 4-bit quantization: 4× memory reduction
)
print("Model loaded!")
@modal.method()
def generate(self, prompt: str, max_tokens: int = 512) -> str:
"""Generate text from prompt."""
inputs = self.tokenizer(
prompt,
return_tensors="pt"
).to(self.model.device)
outputs = self.model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id,
)
# Decode only the new tokens
new_tokens = outputs[0][inputs['input_ids'].shape[1]:]
return self.tokenizer.decode(new_tokens, skip_special_tokens=True)
@app.local_entrypoint()
def main():
inference = LLMInference()
result = inference.generate.remote(
"Explain the CAP theorem in simple terms:"
)
print(result)
# Deploy
modal deploy inference.py
# Test locally (runs in cloud)
modal run inference.py
Web Endpoint (Production API)
from fastapi import FastAPI
from pydantic import BaseModel
web_app = FastAPI()
class GenerateRequest(BaseModel):
prompt: str
max_tokens: int = 512
temperature: float = 0.7
@app.function(
image=image,
gpu=modal.gpu.A100(),
volumes={"/models": volume},
container_idle_timeout=600,
)
@modal.asgi_app()
def inference_api():
"""Expose as HTTPS API endpoint."""
# Load model inline
model_instance = LLMInference()
@web_app.post("/generate")
async def generate(request: GenerateRequest):
result = model_instance.generate(
request.prompt,
request.max_tokens
)
return {"text": result, "model": "llama-3.3-70b"}
@web_app.get("/health")
async def health():
return {"status": "ok"}
return web_app
modal deploy inference.py
# → Deployed to: https://your-org--inference-api.modal.run
Fine-Tuning on Serverless GPU
One of the most powerful use cases: run overnight fine-tuning jobs without maintaining GPU clusters.
# finetune.py
import modal
app = modal.App("llm-finetune")
image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(
"transformers", "torch", "accelerate",
"peft", "trl", "datasets", "bitsandbytes",
"wandb"
)
)
volume = modal.Volume.from_name("training-runs", create_if_missing=True)
@app.function(
image=image,
gpu=modal.gpu.H100(count=2), # Multi-GPU training
memory=65536,
volumes={"/runs": volume},
timeout=86400, # 24-hour timeout for long training runs
secrets=[
modal.Secret.from_name("huggingface-secret"),
modal.Secret.from_name("wandb-secret"),
]
)
def finetune(
base_model: str = "meta-llama/Llama-3.1-8B",
dataset_name: str = "your-org/your-dataset",
output_dir: str = "/runs/llama-8b-finetuned",
num_epochs: int = 3,
):
"""Fine-tune with LoRA + QLoRA."""
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import torch
# Load base model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
base_model,
load_in_4bit=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = prepare_model_for_kbit_training(model)
# LoRA config — train only ~1% of parameters
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,553,600 || all params: 8,036,925,440 || trainable%: 0.08%
# Load dataset
tokenizer = AutoTokenizer.from_pretrained(base_model)
dataset = load_dataset(dataset_name)
# Training
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
eval_dataset=dataset.get("validation"),
args=TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_epochs,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_ratio=0.05,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
evaluation_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=200,
report_to="wandb",
),
)
trainer.train()
trainer.save_model(output_dir)
print(f"Training complete! Model saved to {output_dir}")
@app.local_entrypoint()
def main():
finetune.remote(
base_model="meta-llama/Llama-3.1-8B",
dataset_name="your-org/custom-instructions",
num_epochs=3,
)
print("Fine-tuning job submitted! Check wandb for progress.")
modal run finetune.py
# → Job running on 2× H100 in the cloud
# → Cost: ~$6/hour × estimated 4 hours = ~$24 for the job
Image Generation: Stable Diffusion on Serverless
# image_gen.py
import modal
import io
from pathlib import Path
app = modal.App("stable-diffusion")
image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install("diffusers", "torch", "accelerate", "Pillow", "xformers")
)
model_volume = modal.Volume.from_name("sd-models", create_if_missing=True)
@app.cls(
image=image,
gpu="A10G", # A10G: good for image gen, cheaper than H100
volumes={"/models": model_volume},
container_idle_timeout=120, # Keep warm 2 min
)
class StableDiffusion:
@modal.enter()
def load_pipeline(self):
from diffusers import StableDiffusionXLPipeline
import torch
self.pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
cache_dir="/models/sdxl",
).to("cuda")
self.pipe.enable_xformers_memory_efficient_attention()
@modal.method()
def generate(
self,
prompt: str,
negative_prompt: str = "blurry, low quality, distorted",
steps: int = 30,
guidance_scale: float = 7.5,
) -> bytes:
image = self.pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=steps,
guidance_scale=guidance_scale,
width=1024,
height=1024,
).images[0]
buffer = io.BytesIO()
image.save(buffer, format="PNG")
return buffer.getvalue()
@app.local_entrypoint()
def main():
sd = StableDiffusion()
image_bytes = sd.generate.remote(
"A futuristic server room with quantum computers, cinematic lighting"
)
Path("output.png").write_bytes(image_bytes)
print("Image saved to output.png")
Cost Optimization Strategies
Photo by imgix on Unsplash
1. Quantization: Smaller Models, Same Quality
# 4-bit quantization: 4× memory, ~5% quality loss
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
# Llama 70B: normally needs 4× A100 (80GB each)
# With 4-bit quantization: fits on 1× H100 (80GB)
2. Batching for Throughput
@app.cls(
gpu="H100",
allow_concurrent_inputs=32, # Batch up to 32 requests
)
class BatchedInference:
@modal.method(batch_max_size=32, batch_wait_ms=100)
def generate_batch(self, prompts: list[str]) -> list[str]:
"""Process up to 32 prompts in one GPU forward pass."""
# Batched inference is 10–20× more efficient than one-by-one
inputs = self.tokenizer(prompts, padding=True, return_tensors="pt")
outputs = self.model.generate(**inputs, max_new_tokens=256)
return self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
3. Spot/Preemptible Instances
RunPod Serverless and Modal both support spot pricing — 50–80% cheaper, with automatic retry on preemption. Ideal for batch jobs, fine-tuning.
4. Right-Sizing GPU Selection
# For different model sizes:
gpu_map = {
"7B model": modal.gpu.T4(), # $0.59/hr — embedding, small gen
"13B model": modal.gpu.A10G(), # $1.10/hr — balanced
"70B model": modal.gpu.A100(), # $3.00/hr — large inference
"70B + fine-tune": modal.gpu.H100(), # $3.90/hr — max performance
}
Real-World Cost Comparison
Scenario: 1M inference requests/month, 70B LLM, avg 1 second/request
| Approach | Monthly Cost |
|---|---|
| Reserved A100 instance (AWS p4d.24xlarge) | $7,200/month |
| Serverless GPU (Modal A100) | ~$830/month |
| Serverless GPU with batching | ~$200/month |
| Serverless + quantization + batching | ~$80/month |
The combination of serverless, quantization, and batching can reduce GPU costs by 90× compared to naive dedicated deployment.
When NOT to Use Serverless GPU
Serverless GPU isn’t always the right choice:
- High sustained load (>80% utilization): Reserved instances become cheaper
- Sub-100ms latency requirements: Cold starts (5–30s) may not meet SLAs
- Complex multi-GPU training with NCCL: Some platforms limit multi-node comms
- Strict data residency: Ensure platform compliance with your requirements
For latency-critical production serving, consider keeping one warm instance running permanently (eliminates cold starts) while using serverless for burst capacity.
Summary
Serverless GPU in 2026 has made AI deployment accessible to every team. The key takeaways:
- Modal has the best developer experience — Python-native, fast iteration, excellent docs
- Combine quantization + batching for 10–20× cost efficiency
- Persistent volumes solve the model download problem — download once, reuse forever
- Cold starts are manageable — use
container_idle_timeoutto keep models warm during active periods - Fine-tuning jobs are ideal for serverless — run, then scale to zero
The era of paying for idle GPUs is over. Start serverless, scale to dedicated when your economics demand it.
Tags: Serverless GPU, Modal, AI Inference, LLM Deployment, Fine-tuning, MLOps, LoRA, Stable Diffusion
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
