Serverless GPU in 2026: Deploying AI Models Without Managing Infrastructure

Running AI models requires GPUs. GPUs are expensive. Reserved GPU instances cost thousands of dollars per month — even when idle. For most teams, the math doesn’t work.

Serverless GPU flips the model: pay only for the seconds your model is actually running. Scale to zero when idle. Handle traffic spikes automatically. No CUDA driver nightmares, no cluster management.

In 2026, serverless GPU platforms have matured enough to handle everything from real-time inference to overnight fine-tuning jobs — at a fraction of the cost of dedicated GPU instances.

GPU chip with glowing green circuits on dark background Photo by Christian Wiediger on Unsplash

The Serverless GPU Landscape

Why Traditional GPU Deployment Fails Most Teams

Traditional approach:
- AWS p3.2xlarge (1× V100): $3.06/hour
- Running 24/7: ~$2,200/month
- Average utilization: 20–40%
- Effective cost: $5,500–$11,000/month equivalent

Serverless GPU:
- Pay per second of GPU time
- Scale to zero when idle
- $0.80–$2.50/hour for H100/A100
- 80% utilization by default (only pay when running)

For most AI applications — chatbots, image generation, embedding APIs — traffic is bursty. Serverless matches cost to actual usage.

Platform Comparison 2026

Platform	Strengths	GPU Types	Cold Start	Best For
Modal	Best DX, Python-native, fast builds	H100, A100, T4	5–15s	Custom inference, fine-tuning
Replicate	Model marketplace, easy API	A100, H100	10–30s	Deploying open models fast
RunPod Serverless	Cheapest, flexible	Wide range	15–45s	Cost-sensitive workloads
Beam	Simple Python, good for scripts	A10G, A100	10–20s	Batch jobs, scheduled tasks
Baseten	MLOps features, monitoring	H100, A100	5–10s	Production ML serving
Hugging Face Inference Endpoints	HF ecosystem, easy deploy	T4–A100	Warm: instant	HF model serving

Modal is the most developer-friendly serverless GPU platform in 2026. You write regular Python; Modal handles everything else.

Getting Started

pip install modal
modal token new  # Authenticate with modal.com

Deploy an LLM Inference Endpoint

# inference.py
import modal

app = modal.App("llm-inference")

# Define the container image
image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "transformers",
        "torch",
        "accelerate",
        "bitsandbytes",
        "fastapi",
        "uvicorn"
    )
)

# Model storage volume (download once, reuse)
volume = modal.Volume.from_name("model-weights", create_if_missing=True)

@app.cls(
    image=image,
    gpu=modal.gpu.H100(),          # Request H100 GPU
    memory=32768,                   # 32GB RAM
    volumes={"/models": volume},    # Persistent model storage
    container_idle_timeout=300,     # Keep warm 5 min after last request
    allow_concurrent_inputs=4,      # Handle 4 concurrent requests per container
)
class LLMInference:
    
    @modal.enter()
    def load_model(self):
        """Called once when container starts."""
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch
        
        model_name = "meta-llama/Llama-3.3-70B-Instruct"
        cache_dir = "/models/llama-3.3-70b"
        
        print("Loading tokenizer...")
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name, cache_dir=cache_dir
        )
        
        print("Loading model...")
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            cache_dir=cache_dir,
            torch_dtype=torch.bfloat16,
            device_map="auto",
            load_in_4bit=True,         # 4-bit quantization: 4× memory reduction
        )
        print("Model loaded!")
    
    @modal.method()
    def generate(self, prompt: str, max_tokens: int = 512) -> str:
        """Generate text from prompt."""
        inputs = self.tokenizer(
            prompt, 
            return_tensors="pt"
        ).to(self.model.device)
        
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            do_sample=True,
            pad_token_id=self.tokenizer.eos_token_id,
        )
        
        # Decode only the new tokens
        new_tokens = outputs[0][inputs['input_ids'].shape[1]:]
        return self.tokenizer.decode(new_tokens, skip_special_tokens=True)

@app.local_entrypoint()
def main():
    inference = LLMInference()
    result = inference.generate.remote(
        "Explain the CAP theorem in simple terms:"
    )
    print(result)

# Deploy
modal deploy inference.py

# Test locally (runs in cloud)
modal run inference.py

Web Endpoint (Production API)

from fastapi import FastAPI
from pydantic import BaseModel

web_app = FastAPI()

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 512
    temperature: float = 0.7

@app.function(
    image=image,
    gpu=modal.gpu.A100(),
    volumes={"/models": volume},
    container_idle_timeout=600,
)
@modal.asgi_app()
def inference_api():
    """Expose as HTTPS API endpoint."""
    
    # Load model inline
    model_instance = LLMInference()
    
    @web_app.post("/generate")
    async def generate(request: GenerateRequest):
        result = model_instance.generate(
            request.prompt, 
            request.max_tokens
        )
        return {"text": result, "model": "llama-3.3-70b"}
    
    @web_app.get("/health")
    async def health():
        return {"status": "ok"}
    
    return web_app

modal deploy inference.py
# → Deployed to: https://your-org--inference-api.modal.run

Fine-Tuning on Serverless GPU

One of the most powerful use cases: run overnight fine-tuning jobs without maintaining GPU clusters.

# finetune.py
import modal

app = modal.App("llm-finetune")

image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "transformers", "torch", "accelerate",
        "peft", "trl", "datasets", "bitsandbytes",
        "wandb"
    )
)

volume = modal.Volume.from_name("training-runs", create_if_missing=True)

@app.function(
    image=image,
    gpu=modal.gpu.H100(count=2),   # Multi-GPU training
    memory=65536,
    volumes={"/runs": volume},
    timeout=86400,                  # 24-hour timeout for long training runs
    secrets=[
        modal.Secret.from_name("huggingface-secret"),
        modal.Secret.from_name("wandb-secret"),
    ]
)
def finetune(
    base_model: str = "meta-llama/Llama-3.1-8B",
    dataset_name: str = "your-org/your-dataset",
    output_dir: str = "/runs/llama-8b-finetuned",
    num_epochs: int = 3,
):
    """Fine-tune with LoRA + QLoRA."""
    from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
    from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
    from trl import SFTTrainer
    from datasets import load_dataset
    import torch
    
    # Load base model with 4-bit quantization
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        load_in_4bit=True,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )
    model = prepare_model_for_kbit_training(model)
    
    # LoRA config — train only ~1% of parameters
    lora_config = LoraConfig(
        r=16,                      # Rank
        lora_alpha=32,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM",
    )
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    # trainable params: 6,553,600 || all params: 8,036,925,440 || trainable%: 0.08%
    
    # Load dataset
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    dataset = load_dataset(dataset_name)
    
    # Training
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset["train"],
        eval_dataset=dataset.get("validation"),
        args=TrainingArguments(
            output_dir=output_dir,
            num_train_epochs=num_epochs,
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,
            warmup_ratio=0.05,
            learning_rate=2e-4,
            bf16=True,
            logging_steps=10,
            evaluation_strategy="steps",
            eval_steps=100,
            save_strategy="steps",
            save_steps=200,
            report_to="wandb",
        ),
    )
    
    trainer.train()
    trainer.save_model(output_dir)
    print(f"Training complete! Model saved to {output_dir}")

@app.local_entrypoint()
def main():
    finetune.remote(
        base_model="meta-llama/Llama-3.1-8B",
        dataset_name="your-org/custom-instructions",
        num_epochs=3,
    )
    print("Fine-tuning job submitted! Check wandb for progress.")

modal run finetune.py
# → Job running on 2× H100 in the cloud
# → Cost: ~$6/hour × estimated 4 hours = ~$24 for the job

Image Generation: Stable Diffusion on Serverless

# image_gen.py
import modal
import io
from pathlib import Path

app = modal.App("stable-diffusion")

image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install("diffusers", "torch", "accelerate", "Pillow", "xformers")
)

model_volume = modal.Volume.from_name("sd-models", create_if_missing=True)

@app.cls(
    image=image,
    gpu="A10G",                     # A10G: good for image gen, cheaper than H100
    volumes={"/models": model_volume},
    container_idle_timeout=120,     # Keep warm 2 min
)
class StableDiffusion:
    
    @modal.enter()
    def load_pipeline(self):
        from diffusers import StableDiffusionXLPipeline
        import torch
        
        self.pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            cache_dir="/models/sdxl",
        ).to("cuda")
        
        self.pipe.enable_xformers_memory_efficient_attention()
    
    @modal.method()
    def generate(
        self, 
        prompt: str,
        negative_prompt: str = "blurry, low quality, distorted",
        steps: int = 30,
        guidance_scale: float = 7.5,
    ) -> bytes:
        image = self.pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=steps,
            guidance_scale=guidance_scale,
            width=1024,
            height=1024,
        ).images[0]
        
        buffer = io.BytesIO()
        image.save(buffer, format="PNG")
        return buffer.getvalue()

@app.local_entrypoint()
def main():
    sd = StableDiffusion()
    image_bytes = sd.generate.remote(
        "A futuristic server room with quantum computers, cinematic lighting"
    )
    Path("output.png").write_bytes(image_bytes)
    print("Image saved to output.png")

Cost Optimization Strategies

Abstract technology background with server infrastructure Photo by imgix on Unsplash

1. Quantization: Smaller Models, Same Quality

# 4-bit quantization: 4× memory, ~5% quality loss
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

# Llama 70B: normally needs 4× A100 (80GB each)
# With 4-bit quantization: fits on 1× H100 (80GB)

2. Batching for Throughput

@app.cls(
    gpu="H100",
    allow_concurrent_inputs=32,     # Batch up to 32 requests
)
class BatchedInference:
    
    @modal.method(batch_max_size=32, batch_wait_ms=100)
    def generate_batch(self, prompts: list[str]) -> list[str]:
        """Process up to 32 prompts in one GPU forward pass."""
        # Batched inference is 10–20× more efficient than one-by-one
        inputs = self.tokenizer(prompts, padding=True, return_tensors="pt")
        outputs = self.model.generate(**inputs, max_new_tokens=256)
        return self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

3. Spot/Preemptible Instances

RunPod Serverless and Modal both support spot pricing — 50–80% cheaper, with automatic retry on preemption. Ideal for batch jobs, fine-tuning.

4. Right-Sizing GPU Selection

# For different model sizes:
gpu_map = {
    "7B model": modal.gpu.T4(),          # $0.59/hr — embedding, small gen
    "13B model": modal.gpu.A10G(),       # $1.10/hr — balanced
    "70B model": modal.gpu.A100(),       # $3.00/hr — large inference
    "70B + fine-tune": modal.gpu.H100(), # $3.90/hr — max performance
}

Real-World Cost Comparison

Scenario: 1M inference requests/month, 70B LLM, avg 1 second/request

Approach	Monthly Cost
Reserved A100 instance (AWS p4d.24xlarge)	$7,200/month
Serverless GPU (Modal A100)	~$830/month
Serverless GPU with batching	~$200/month
Serverless + quantization + batching	~$80/month

The combination of serverless, quantization, and batching can reduce GPU costs by 90× compared to naive dedicated deployment.

When NOT to Use Serverless GPU

Serverless GPU isn’t always the right choice:

High sustained load (>80% utilization): Reserved instances become cheaper
Sub-100ms latency requirements: Cold starts (5–30s) may not meet SLAs
Complex multi-GPU training with NCCL: Some platforms limit multi-node comms
Strict data residency: Ensure platform compliance with your requirements

For latency-critical production serving, consider keeping one warm instance running permanently (eliminates cold starts) while using serverless for burst capacity.

Summary

Serverless GPU in 2026 has made AI deployment accessible to every team. The key takeaways:

Modal has the best developer experience — Python-native, fast iteration, excellent docs
Combine quantization + batching for 10–20× cost efficiency
Persistent volumes solve the model download problem — download once, reuse forever
Cold starts are manageable — use container_idle_timeout to keep models warm during active periods
Fine-tuning jobs are ideal for serverless — run, then scale to zero

The era of paying for idle GPUs is over. Start serverless, scale to dedicated when your economics demand it.

Tags: Serverless GPU, Modal, AI Inference, LLM Deployment, Fine-tuning, MLOps, LoRA, Stable Diffusion

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)