AI API Providers Need Error Budgets

Anthropic’s Claude API went down again yesterday. For the 326th+ time since January 2025.

That’s roughly one incident every 1.3 days.

I use Claude across my entire workflow — coding, agents, automations. When the API returns 500s, it’s not “the chatbot is down.” It’s a production dependency failing.

And I get it. Scaling GPU inference at this level of demand is brutally hard. But here’s the thing:

Google solved this decades ago

The concept is called an error budget, and it came out of Google’s Site Reliability Engineering practice.

The math is simple:

Set an SLO (e.g., 99.9% uptime)
Error budget = 0.1% = ~43 minutes of downtime per month
Budget remaining? Ship features. Release new models.
Budget spent? Full stop. Fix reliability first.

The beauty of this framework is that it makes the trade-off between velocity and reliability explicit. It’s not “move fast and break things” vs. “freeze everything.” It’s a quantified negotiation.

Now look at what’s happening

Anthropic’s 2025-2026 release cadence: Opus 4.1, Sonnet 4.5, Haiku 4.5, Opus 4.5, Opus 4.6, Sonnet 4.6. A new model every few months. Meanwhile, Google’s own SRE data tells us that 70% of outages are caused by changes.

Each new model release is a change. Each change is a risk. And the risks are compounding.

Reliability is safety tier zero

Anthropic brands itself as “safety-first, stable, deliberate.” And I believe they genuinely care about AI safety. But there’s a foundational layer beneath all safety concerns: does the system actually work when you need it to?

You can have the most aligned, most carefully evaluated model in the world. If it returns 500 errors when my production pipeline calls it at 3 AM, none of that matters.

Reliability is safety — tier zero.

AI APIs are infrastructure now

This isn’t a chat product anymore. Claude, GPT, Gemini — they’re becoming infrastructure. They sit in CI pipelines, in customer-facing agents, in automated decision systems.

Infrastructure providers don’t get to have 326 incidents a year. AWS would never survive that. Google Cloud would never survive that. The expectations should be the same.

What should change

Maybe — just maybe — the next quarter should go to making the last six models work reliably under load. Instead of releasing model number seven.

The error budget framework exists. It’s battle-tested. It scales. And it gives teams permission to say: “We’ve burned our budget. The next release waits until reliability catches up.”

That’s not slow. That’s disciplined.

I spent three years at Amazon building systems that reconciled $250B+ annually. One incident per 1.3 days wouldn’t survive a single operational review. The bar for AI infrastructure should be the same.