Modal vs Replicate: GPU cloud pricing compared
Modal and Replicate are the two serverless GPU clouds most often compared for inference workloads. Both bill per-second. Modal is Python-native (decorator-style deployment, broader compute primitives) and is competitive on raw GPU rates. Replicate is inference-first with a Cog container format and a model marketplace at a price premium on raw GPU time.
Side-by-side
| Dimension | Modal | Replicate |
|---|---|---|
| Cheapest H100 per-hour equiv | $3.95/hr | $5.04/hr |
| Cheapest A100 80GB per-hour equiv | $2.78/hr | $5.04/hr |
| Pricing model | Per-second GPU; separate CPU and cold-start charges | Per-second across GPU and setup time |
| Deployment surface | Python decorator, broader compute primitives | Cog container format, model marketplace |
| Best for | Inference, batch, and training as one platform | Inference-only of open-source or custom models |
You want a Python-native deployment surface that doubles as a batch and training platform, and the lowest serverless H100 / A100 per-second rates in this comparison.
You want zero-infrastructure inference of open-source models with a Cog deployment format and a public model marketplace; you accept the per-second rate premium for the operational simplicity.
Worked example
Acme MLOps Co. (illustrative example, not a real company) needs an 8-GPU H100 cluster for 30 days at 18 hours per day (4,320 GPU-hours). At Modal's published H100 rate ($3.950/GPU-hr, H100 Per-second, hourly equiv) that is roughly $17,064; at Replicate's published H100 rate ($5.040/GPU-hr, Nvidia H100 Per-second, hourly equiv), roughly $21,773 for raw GPU compute, before storage, egress, and MLOps overhead.
Last verified June 2026.