Claude Throttling and the Cloud

One of the foundational promises of cloud computing is elastic, near-infinite scalability. You pay for what you use, and when demand spikes, the cloud scales with you. It was supposed to be the end of capacity planning, the death of “we need to buy more servers” conversations. So why is Anthropic—one of the best-funded AI companies in the world, running on the infrastructure of Amazon and Google—throttling its users during peak hours?

In March 2026, Anthropic confirmed they had been actively reducing Claude’s session limits during weekday peak hours, causing developers to burn through their usage quotas significantly faster than they expected. [1] This is a story worth understanding in detail, because it says something important about the real state of AI infrastructure—and the gap between cloud marketing and cloud reality.

Where Claude Actually Lives

Before we can understand why Claude is being throttled, we need to understand where Claude actually runs.

Claude is not hosted on Anthropic’s own hardware. Anthropic is an AI research company, not a data center operator. They rely on two primary cloud hyperscalers: Amazon Web Services (AWS) and Google Cloud Platform (GCP).

AWS: The Primary Partner

Amazon has invested a cumulative $8 billion in Anthropic, making it Anthropic’s largest financial backer. [2] As part of the deal, Anthropic designated AWS as its primary cloud and training partner, agreeing to train and deploy its flagship Claude models on Amazon’s custom Trainium and Inferentia AI chips.

The centerpiece of this relationship is Project Rainier, a massive AI supercomputing cluster dedicated entirely to Anthropic’s needs. By October 2025, Project Rainier had deployed nearly 500,000 Trainium2 chips across multiple US data centers, with a target of scaling beyond one million chips. [3] The infrastructure is built on “UltraServers”—each containing 64 Trainium2 chips interconnected via high-speed NeuronLink cables—grouped into a massive “UltraCluster” spanning multiple sites. At peak capacity, Project Rainier is reportedly 70% larger than any previous AWS AI computing platform, and provides more than five times the compute Anthropic used for earlier Claude models.

Despite these enormous numbers, the cluster is still finite and shared. Training new model versions and running inference for millions of active users compete for the same physical hardware.

Google Cloud: The Strategic Second Partner

Google has invested at least $2 billion in Anthropic (with reports of commitments growing toward $3 billion), and the partnership goes beyond capital. [4] Anthropic has reserved up to one million next-generation TPUs (Tensor Processing Units) from Google for training and inference of future Claude models. Claude is available to enterprise customers through Google’s Vertex AI platform. Anthropic has also partnered with Google and Broadcom to co-design custom TPU chips tailored specifically for Claude workloads.

Azure and the Broader Picture

Claude is also accessible through Microsoft Azure’s AI Foundry service, making it available across all three major hyperscalers. [5] Despite the breadth of this multi-cloud footprint, the combined capacity of AWS, Google, and Azure still has real, physical upper bounds.

What Actually Happened: The March 2026 Throttling

In late March 2026, Anthropic publicly confirmed that it had been adjusting Claude’s session limits during peak usage hours. The specific changes:

Peak hours defined: Monday–Friday, 5:00 AM to 11:00 AM Pacific Time (8:00 AM–2:00 PM Eastern; 1:00 PM–7:00 PM GMT).
How it works: Session limits are tied to token consumption, not clock time. During peak hours, each token costs more against your session budget—so a “five-hour session” can be exhausted in far less than five real hours if you’re doing heavy work.
Who is affected: Anthropic acknowledged that roughly 7% of users—particularly those on Pro and Max subscriptions doing token-intensive work like coding or running agents—now hit limits they wouldn’t have previously encountered. [6]
Weekly limits unchanged: The total weekly usage quota was not reduced; only how quickly you burn through it during peak hours changed.
API users unaffected: Developers paying per-token via the API are not subject to these session caps.

Anthropic’s statement on the changes included a telling detail:

“We’ve landed a lot of efficiency wins to offset this, but ~7% of users will hit session limits they wouldn’t have had before, particularly in Pro tiers. If you run token-intensive background jobs, shifting them to off-peak hours will stretch your session limits further.”

Adding to the confusion, the throttling announcement landed at the same time as a separate (and more serious) technical incident: a bug in Claude Code’s prompt caching system that caused token consumption to run 10–20x higher than expected. [7] When a cache miss occurred due to the bug, the model had to reprocess entire conversation histories from scratch, massively inflating token usage. Some users reported a single prompt consuming 37% of their entire session limit. Anthropic acknowledged the bug publicly and called it a top priority, but the combination of intentional throttling, a technical bug, and the end of a two-week promotional period that had doubled off-peak limits made the timing particularly painful for developers. [8]

Is It Cost Savings or Real Infrastructure Limits?

This is the core question. The cynical read: Anthropic is throttling to cut costs. The naive read: they’re genuinely out of capacity. The reality is more nuanced—and more interesting.

The Cost Math Is Real

As documented in the AI Benefits - But at What Cost? article, running Claude is extraordinarily expensive. Anthropic’s CFO stated in a March 2026 legal filing that the company had spent “over $10 billion” on inference and training combined while generating only $5 billion in cumulative lifetime revenue. [9] An analysis of the Max subscription tier found that a single heavy user can generate $163 in actual compute costs while paying $100–$200/month. [10]

The peak hours that Anthropic throttles (5 AM–11 AM Pacific) are exactly when North American and European business hours overlap—the most expensive time to run inference because demand is highest and capacity is most constrained. Cloud GPU providers can and do charge premium rates during high-demand periods. By reducing the rate at which subscriptions consume capacity during these hours, Anthropic directly reduces its highest-cost operational window.

So yes: throttling is partly a cost-management strategy.

But the Infrastructure Limits Are Also Real

Here is where the “infinite cloud scale” narrative breaks down.

The promise of cloud elasticity was built for traditional workloads—web servers, databases, application backends—where you can spin up additional virtual machines in seconds to handle traffic spikes. Those workloads run on commodity CPUs and memory. You want more? The cloud provisions more.

AI inference is fundamentally different. Running a large language model like Claude requires specialized AI accelerator hardware: GPUs (primarily NVIDIA H100s or custom chips like Trainium2 and Google TPUs). These chips are:

Physically scarce. NVIDIA produces a finite number of H100s per year, constrained by advanced chip packaging and high-bandwidth memory supply chains. Even with hundreds of thousands of chips in AWS’s Project Rainier, the global supply of frontier AI hardware is not elastic on short timescales. [11]
Expensive to acquire. An H100 GPU cost $25,000–$30,000 in 2025. Building out a new data center wing takes months to years, not minutes.
Power-constrained. Modern AI data centers require enormous amounts of power and specialized cooling infrastructure. The largest hyperscalers are running into limits on available grid power at their existing sites, forcing them to develop new campuses from scratch. [12]

Google’s own cloud blog acknowledged the reality directly in a post titled “The Infinite Capacity Myth: How AI Is Breaking the Old Cloud Rules.” [13] The “scale instantly on demand” model that defined a decade of cloud computing simply does not apply to GPU-dependent AI inference at scale.

The Honest Answer: Both

Anthropic is throttling because the economics of AI subscriptions don’t work, and because their hosting infrastructure has real physical limits during peak demand periods. These aren’t competing explanations—they reinforce each other.

If Anthropic had unlimited cheap capacity, they could absorb the cost overruns of heavy users and simply provision more GPUs to meet demand. They can’t do either. So they throttle.

The throttling is a symptom of the same underlying problem: AI compute is expensive, scarce, and not nearly as elastic as the cloud marketing suggests.

What This Tells Us About “Cloud Scale” for AI

The traditional cloud model worked because adding more VMs to handle web traffic is cheap and nearly instantaneous. The marginal cost of serving one more web request approaches zero at scale.

AI inference has no such economy. Each token generated by Claude requires a fixed amount of GPU compute. More users mean proportionally more GPU time, proportionally more cost, and—critically—if the GPUs are already busy—proportionally more queuing and degraded service.

This is why peak-hour throttling makes sense from a systems perspective. During business hours, Anthropic’s GPU clusters are running at or near capacity. Allowing heavy users to continue burning tokens at their normal rate would either:

Degrade service for all users (slower responses, higher error rates), or
Require Anthropic to provision additional GPU capacity that sits idle during off-peak hours—an enormous fixed cost that their subscription pricing cannot justify.

Shifting heavy users to off-peak hours is the classic solution from the history of shared computing: time-sharing. The 1960s and 1970s mainframe era solved exactly this problem by making compute cheaper at night. Anthropic has just rediscovered it at scale.

What It Means for Developers

If you’re a developer building with Claude, a few practical implications:

The limits are real and will likely get tighter. Anthropic is explicitly managing capacity, which means your session limits are not guaranteed to stay where they are today. As demand grows faster than infrastructure build-out, expect further adjustments.

Schedule batch and background work carefully. Any token-intensive job that doesn’t need to run during peak hours—background indexing, large refactoring passes, code generation pipelines—should be shifted to off-peak windows (nights, weekends). This is Anthropic’s explicit recommendation, and it’s good practice regardless.

Understand the difference between subscriptions and API access. Subscription plans are subject to session-based throttling. API access is not—you pay per token, and you can use as many tokens as you’re willing to pay for, subject to rate limits on requests per minute. For production workloads, the API may provide more predictable behavior than a subscription plan.

Watch your token costs closely. The Claude Code prompt cache bug illustrates that token consumption can spike without warning due to implementation issues. Instrument your usage, monitor costs, and set alerts before a bug or unexpected workload multiplies your spend.

The Bigger Picture

Anthropic throttling Claude is not a surprise if you’ve been paying attention to the economics. What it illustrates is a collision between two things that were never really compatible:

The promise of cloud computing—unlimited elastic scale, frictionless on-demand resources.
The reality of frontier AI inference—specialized hardware, fixed physical supply, enormous cost per token.

AWS’s Project Rainier, with its one million Trainium2 chips, is genuinely impressive infrastructure. Google’s TPU partnership represents a serious long-term bet on AI capacity. The hyperscalers are spending hundreds of billions of dollars to build the compute capacity the AI industry needs. [14]

But “hundreds of billions in capital expenditure” is not the same as “infinite.” The chips take time to manufacture. The data centers take time to build. The power infrastructure takes years to develop. And in the meantime, Anthropic has millions of users on fixed-price subscriptions that were priced assuming a level of AI efficiency that hasn’t yet materialized.

Throttling is what happens when the product you promised to deliver at scale is more expensive to deliver than the revenue you’re collecting to do it—on hardware that can’t be instantaneously expanded to meet demand.

The cloud is not infinitely scalable for AI. Not yet. Perhaps not ever, in the way the original cloud promise implied. And Anthropic is letting their users experience that reality directly, one throttled session at a time.

References

Did you find this helpful? Let me know on BlueSky or subscribe to my newsletter for more content like this!