RunPod Serverless: Deploy Any AI Model as an API in 30 Minutes
You don't need a GPU server humming in the background 24/7 to serve an AI model. RunPod Serverless gives you a queue-backed GPU endpoint that scales to zero when idle and spins up on demand. Here's how it actually works.
I've deployed a handful of AI models in production over the past year — Whisper for transcription, Flux for image generation, a few fine-tuned LLMs. The common theme: most of them get used in bursts. Someone hits the endpoint, jobs run, then it's quiet for a while. Paying for an always-on GPU for that pattern is wasteful. RunPod Serverless is the answer to that problem.
What RunPod Serverless actually is
Serverless on RunPod is a queue-based GPU endpoint. You write a Python handler function, package it in a Docker image, and RunPod handles the rest: worker lifecycle, job queuing, scaling, and billing by the second. You pay only for actual compute time — when no jobs are running, cost is zero.
Jobs come in through two endpoints:
- /run — async. Submits the job, returns a job ID immediately. You poll
/status/<id>for results. Results are retained for 30 minutes. - /runsync — waits for the result inline (up to 90 seconds by default, configurable up to 300 seconds). Use this for fast models or impatient callers. Results retained for 1 minute.
- /stream — for handlers that
yieldtokens progressively (LLMs, etc.).
There's also a load-balanced endpoint type for direct HTTP routing without a queue, which is better for low-latency real-time inference and custom REST APIs.
The handler pattern
This is the most important thing to understand: load your model outside the handler function. Workers are initialized once and then handle many jobs. If you load the model inside the handler, you pay for model loading on every single request.
import runpod
from transformers import pipeline
# Model loads ONCE when the worker starts
# Stays in memory across all subsequent jobs
transcriber = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v3",
device="cuda"
)
def handler(job):
job_input = job["input"]
audio_url = job_input["audio_url"]
result = transcriber(audio_url)
return {"transcription": result["text"]}
runpod.serverless.start({"handler": handler})
The handler function receives a job dict. job["input"] is whatever
you sent in the request body. Return a dict and it becomes the response. That's the whole interface.
You can also write async handlers with async def handler(job), or streaming handlers that
yield results incrementally. For streaming, clients consume via /stream.
The Dockerfile pattern
Your handler runs inside a Docker container on RunPod's GPU fleet. The base image gives you CUDA and the RunPod SDK:
FROM runpod/base:0.4.0-cuda11.8.0
COPY requirements.txt /requirements.txt
RUN pip install -r /requirements.txt
# Optional: bake the model into the image so cold starts are faster
# RUN python -c "from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')"
COPY handler.py /handler.py
CMD ["python", "-u", "/handler.py"]
Build for RunPod's Linux/x86_64 fleet: docker build --platform linux/amd64 -t myimage .
Push to Docker Hub (or use RunPod's GitHub integration for automatic builds on push).
Cold starts: the main thing to optimize
Cold start = the time between "job submitted" and "GPU is actually doing work". It includes: spinning up a worker container, pulling the image, loading the model into VRAM. For large models, this can be several minutes if you're not careful.
Three levers to pull:
1. FlashBoot (enabled by default for GPU endpoints)
FlashBoot retains worker state after a job finishes instead of fully tearing down the worker. When the next job comes in, the worker resumes from the cached state rather than reinitializing. You get this automatically — it's on by default.
2. Bake the model into the Docker image
Run the model download as part of your Dockerfile RUN step. The model is stored in the image
layer. No download at startup. This is the most reliable approach for private or custom models.
Downside: larger image, slower first pull (though images are cached on RunPod hosts).
3. Active workers
Set at least 1 active worker in the endpoint configuration. Active workers are always-on — they eliminate cold starts entirely. You pay the active worker rate (about 20-30% cheaper than the flex rate) continuously, but every request is instant. Good if you have steady baseline traffic.
The active worker formula: active workers = (requests/min × avg_duration_seconds) / 60.
6 requests/min at 30 seconds each = 3 active workers.
RUNPOD_INIT_TIMEOUT=800 (seconds) in your container environment.
GPU pricing
Pricing is per second of execution. Two rates: flex (spot-like, worker spins down when idle) and active (reserved, always-on).
| GPU | VRAM | Flex $/s | Active $/s | Active $/hr |
|---|---|---|---|---|
| A4000 / A4500 / RTX 4000 | 16 GB | $0.00016 | $0.00011 | ~$0.40 |
| L4 / A5000 / 3090 | 24 GB | $0.00019 | $0.00013 | ~$0.47 |
| 4090 PRO | 24 GB | $0.00031 | $0.00021 | ~$0.76 |
| L40 / L40S / 6000 Ada PRO | 48 GB | $0.00053 | $0.00037 | ~$1.33 |
| A6000 / A40 | 48 GB | $0.00034 | $0.00024 | ~$0.86 |
| A100 | 80 GB | $0.00076 | $0.00060 | ~$2.16 |
| H100 PRO | 80 GB | $0.00116 | $0.00093 | ~$3.35 |
| H200 PRO | 141 GB | $0.00155 | $0.00124 | ~$4.46 |
| B200 | 180 GB | $0.00240 | $0.00190 | ~$6.84 |
VRAM sizing rule of thumb: ~2 GB per billion parameters at full precision. A 7B model needs ~14 GB — fits on a 16 GB A4000. A 70B model at 4-bit quantization (~35 GB) needs at least a 48 GB card. Match your GPU tier to your model before you deploy.
Public endpoints you can use today
Don't need to deploy your own? RunPod's public endpoint catalog has pre-deployed models. No infrastructure required — just an API key and credits.
curl -X POST \
"https://api.runpod.ai/v2/black-forest-labs-flux-1-schnell/runsync" \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": {
"prompt": "a cinematic photo of a rainy Tokyo street at night",
"width": 1024,
"height": 1024,
"num_inference_steps": 4
}
}'
Notable public endpoints:
- Flux Dev (
black-forest-labs-flux-1-dev) — $0.02/megapixel, high quality - Flux Schnell (
black-forest-labs-flux-1-schnell) — $0.0024/megapixel, fast - Whisper V3 Large — $0.05/1K chars, best-in-class transcription
- Qwen3 32B AWQ (
qwen3-32b-awq) — $10/1M tokens, OpenAI-compatible API - WAN 2.6 — image-to-video with audio support
The text endpoints (Qwen3, GPT OSS 120B) are OpenAI-compatible — you can point Cursor, Cline, or the Vercel AI SDK at them with just a base URL change.
When to use /run vs /runsync
Use /runsync when your model runs in under 90 seconds and the caller can wait inline.
Image generation, transcription of short clips, text inference — these are good candidates.
Use /run when jobs take longer (video generation, large document processing) or when
you want to decouple submission from result retrieval. The caller gets a job ID back immediately
and polls /status/<job_id> until it sees COMPLETED.
Payload limits: /run accepts up to 10 MB, /runsync up to 20 MB.
For larger inputs (high-res images, audio), upload to S3 first and pass a URL.
Local testing before you deploy
# Test with inline JSON input
python handler.py --test_input '{"input": {"prompt": "test prompt"}}'
# Or put your test input in test_input.json and just run:
python handler.py
# Spin up a local API server that mimics the real endpoint
python handler.py --rp_serve_api
# Then hit it: curl -X POST http://localhost:8000/runsync -d '{"input": {"prompt": "test"}}'
The local server is invaluable. It runs the handler synchronously so you can iterate on your code without deploying anything.
Input validation
The SDK includes a validator so you don't have to write your own field checking:
from runpod.serverless.utils.rp_validator import validate
schema = {
"prompt": {"type": str, "required": True},
"steps": {
"type": int,
"required": False,
"default": 20,
"constraints": lambda x: 1 <= x <= 100
}
}
def handler(job):
validated = validate(job["input"], schema)
if "errors" in validated:
return {"error": validated["errors"]}
inp = validated["validated_input"]
# proceed with inp["prompt"] and inp["steps"]
...
Auto-scaling
Two scaling modes:
- Queue delay (default) — adds workers when jobs wait longer than 4 seconds. Good general default.
- Request count — more aggressive. Formula:
ceil((inQueue + inProgress) / scalerValue). SetscalerValue=1for maximum responsiveness. Recommended for LLM workloads where every second of queue time is felt.
The workflow: Pods → Serverless
I develop on a RunPod Pod (full GPU with SSH and Jupyter), then deploy the same Docker image to Serverless. A simple environment variable switch handles the difference:
import os
import runpod
# Load model here (same for both modes)
model = load_my_model()
def handler(job):
return model.predict(job["input"]["prompt"])
if os.environ.get("MODE") == "serverless":
runpod.serverless.start({"handler": handler})
else:
# Pod mode: run a simple HTTP server or Jupyter
print("Running in pod mode — use Jupyter or interactive shell")
This is the practical workflow. Pods for development and debugging. Serverless for production serving. Same image, different entry point behavior.