RunPod Serverless: Deploy Any AI Model as an API in 30 Minutes

I've deployed a handful of AI models in production over the past year — Whisper for transcription, Flux for image generation, a few fine-tuned LLMs. The common theme: most of them get used in bursts. Someone hits the endpoint, jobs run, then it's quiet for a while. Paying for an always-on GPU for that pattern is wasteful. RunPod Serverless is the answer to that problem.

What RunPod Serverless actually is

Serverless on RunPod is a queue-based GPU endpoint. You write a Python handler function, package it in a Docker image, and RunPod handles the rest: worker lifecycle, job queuing, scaling, and billing by the second. You pay only for actual compute time — when no jobs are running, cost is zero.

Jobs come in through two endpoints:

/run — async. Submits the job, returns a job ID immediately. You poll /status/<id> for results. Results are retained for 30 minutes.
/runsync — waits for the result inline (up to 90 seconds by default, configurable up to 300 seconds). Use this for fast models or impatient callers. Results retained for 1 minute.
/stream — for handlers that yield tokens progressively (LLMs, etc.).

There's also a load-balanced endpoint type for direct HTTP routing without a queue, which is better for low-latency real-time inference and custom REST APIs.

The handler pattern

This is the most important thing to understand: load your model outside the handler function. Workers are initialized once and then handle many jobs. If you load the model inside the handler, you pay for model loading on every single request.

import runpod
from transformers import pipeline

# Model loads ONCE when the worker starts
# Stays in memory across all subsequent jobs
transcriber = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    device="cuda"
)

def handler(job):
    job_input = job["input"]
    audio_url = job_input["audio_url"]

    result = transcriber(audio_url)
    return {"transcription": result["text"]}

runpod.serverless.start({"handler": handler})

The handler function receives a job dict. job["input"] is whatever you sent in the request body. Return a dict and it becomes the response. That's the whole interface.

You can also write async handlers with async def handler(job), or streaming handlers that yield results incrementally. For streaming, clients consume via /stream.

The Dockerfile pattern

Your handler runs inside a Docker container on RunPod's GPU fleet. The base image gives you CUDA and the RunPod SDK:

FROM runpod/base:0.4.0-cuda11.8.0

COPY requirements.txt /requirements.txt
RUN pip install -r /requirements.txt

# Optional: bake the model into the image so cold starts are faster
# RUN python -c "from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')"

COPY handler.py /handler.py
CMD ["python", "-u", "/handler.py"]

Build for RunPod's Linux/x86_64 fleet: docker build --platform linux/amd64 -t myimage .

Push to Docker Hub (or use RunPod's GitHub integration for automatic builds on push).

Cold starts: the main thing to optimize

Cold start = the time between "job submitted" and "GPU is actually doing work". It includes: spinning up a worker container, pulling the image, loading the model into VRAM. For large models, this can be several minutes if you're not careful.

Three levers to pull:

1. FlashBoot (enabled by default for GPU endpoints)

FlashBoot retains worker state after a job finishes instead of fully tearing down the worker. When the next job comes in, the worker resumes from the cached state rather than reinitializing. You get this automatically — it's on by default.

2. Bake the model into the Docker image

Run the model download as part of your Dockerfile RUN step. The model is stored in the image layer. No download at startup. This is the most reliable approach for private or custom models. Downside: larger image, slower first pull (though images are cached on RunPod hosts).

3. Active workers

Set at least 1 active worker in the endpoint configuration. Active workers are always-on — they eliminate cold starts entirely. You pay the active worker rate (about 20-30% cheaper than the flex rate) continuously, but every request is instant. Good if you have steady baseline traffic.

The active worker formula: active workers = (requests/min × avg_duration_seconds) / 60. 6 requests/min at 30 seconds each = 3 active workers.

Cold start budget: If your worker doesn't finish initializing within 7 minutes, RunPod marks it unhealthy and tries another. If your model genuinely takes longer to load, set RUNPOD_INIT_TIMEOUT=800 (seconds) in your container environment.

GPU pricing

Pricing is per second of execution. Two rates: flex (spot-like, worker spins down when idle) and active (reserved, always-on).

GPU	VRAM	Flex $/s	Active $/s	Active $/hr
A4000 / A4500 / RTX 4000	16 GB	$0.00016	$0.00011	~$0.40
L4 / A5000 / 3090	24 GB	$0.00019	$0.00013	~$0.47
4090 PRO	24 GB	$0.00031	$0.00021	~$0.76
L40 / L40S / 6000 Ada PRO	48 GB	$0.00053	$0.00037	~$1.33
A6000 / A40	48 GB	$0.00034	$0.00024	~$0.86
A100	80 GB	$0.00076	$0.00060	~$2.16
H100 PRO	80 GB	$0.00116	$0.00093	~$3.35
H200 PRO	141 GB	$0.00155	$0.00124	~$4.46
B200	180 GB	$0.00240	$0.00190	~$6.84

VRAM sizing rule of thumb: ~2 GB per billion parameters at full precision. A 7B model needs ~14 GB — fits on a 16 GB A4000. A 70B model at 4-bit quantization (~35 GB) needs at least a 48 GB card. Match your GPU tier to your model before you deploy.

Public endpoints you can use today

Don't need to deploy your own? RunPod's public endpoint catalog has pre-deployed models. No infrastructure required — just an API key and credits.

curl -X POST \
  "https://api.runpod.ai/v2/black-forest-labs-flux-1-schnell/runsync" \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "prompt": "a cinematic photo of a rainy Tokyo street at night",
      "width": 1024,
      "height": 1024,
      "num_inference_steps": 4
    }
  }'

Notable public endpoints:

Flux Dev (black-forest-labs-flux-1-dev) — $0.02/megapixel, high quality
Flux Schnell (black-forest-labs-flux-1-schnell) — $0.0024/megapixel, fast
Whisper V3 Large — $0.05/1K chars, best-in-class transcription
Qwen3 32B AWQ (qwen3-32b-awq) — $10/1M tokens, OpenAI-compatible API
WAN 2.6 — image-to-video with audio support

The text endpoints (Qwen3, GPT OSS 120B) are OpenAI-compatible — you can point Cursor, Cline, or the Vercel AI SDK at them with just a base URL change.

When to use /run vs /runsync

Use /runsync when your model runs in under 90 seconds and the caller can wait inline. Image generation, transcription of short clips, text inference — these are good candidates.

Use /run when jobs take longer (video generation, large document processing) or when you want to decouple submission from result retrieval. The caller gets a job ID back immediately and polls /status/<job_id> until it sees COMPLETED.

Payload limits: /run accepts up to 10 MB, /runsync up to 20 MB. For larger inputs (high-res images, audio), upload to S3 first and pass a URL.

Local testing before you deploy

# Test with inline JSON input
python handler.py --test_input '{"input": {"prompt": "test prompt"}}'

# Or put your test input in test_input.json and just run:
python handler.py

# Spin up a local API server that mimics the real endpoint
python handler.py --rp_serve_api
# Then hit it: curl -X POST http://localhost:8000/runsync -d '{"input": {"prompt": "test"}}'

The local server is invaluable. It runs the handler synchronously so you can iterate on your code without deploying anything.

Input validation

The SDK includes a validator so you don't have to write your own field checking:

from runpod.serverless.utils.rp_validator import validate

schema = {
    "prompt": {"type": str, "required": True},
    "steps": {
        "type": int,
        "required": False,
        "default": 20,
        "constraints": lambda x: 1 <= x <= 100
    }
}

def handler(job):
    validated = validate(job["input"], schema)
    if "errors" in validated:
        return {"error": validated["errors"]}
    inp = validated["validated_input"]
    # proceed with inp["prompt"] and inp["steps"]
    ...

Auto-scaling

Two scaling modes:

Queue delay (default) — adds workers when jobs wait longer than 4 seconds. Good general default.
Request count — more aggressive. Formula: ceil((inQueue + inProgress) / scalerValue). Set scalerValue=1 for maximum responsiveness. Recommended for LLM workloads where every second of queue time is felt.

The workflow: Pods → Serverless

I develop on a RunPod Pod (full GPU with SSH and Jupyter), then deploy the same Docker image to Serverless. A simple environment variable switch handles the difference:

import os
import runpod

# Load model here (same for both modes)
model = load_my_model()

def handler(job):
    return model.predict(job["input"]["prompt"])

if os.environ.get("MODE") == "serverless":
    runpod.serverless.start({"handler": handler})
else:
    # Pod mode: run a simple HTTP server or Jupyter
    print("Running in pod mode — use Jupyter or interactive shell")

This is the practical workflow. Pods for development and debugging. Serverless for production serving. Same image, different entry point behavior.