Production ML Inference at Scale
Building reliable inference is not only about model quality; it is about availability, latency, and cost.
Core design
- Online feature store for low-latency reads
- Canary rollout for model versions
- Graceful fallback to previous model
Latency budget
{
"type": "bar",
"data": {
"labels": ["Gateway", "Feature Fetch", "Model", "Post-Processing"],
"datasets": [{ "label": "ms", "data": [8, 24, 30, 10], "backgroundColor": "#7aa2ff" }]
}
}
Key equation
$$ ext{p95 total latency} approx sum_i ext{p95}(stage_i) $$
Use strict SLO alerts and auto-rollback when error budget burn rate rises.