# Understanding Claude API Rate Limits
Hey developers! If you're building scalable apps with the Claude API, you've probably bumped into rate limits at some point. They keep the service fair and stable, but they can throttle your high-volume workloads. In this guide, we'll tackle them head-on with production-grade strategies: prompt caching, batching, retries, queuing, and optimization. By the end, you'll have code snippets in Python and TypeScript to keep your apps humming.
Claude's limits vary by model and tier (check your Anthropic Console for exacts). For Claude 3.5 Sonnet (Tier 1):
| Limit | Value |
|-------|-------|
| RPM (Requests/Min) | 50 |
| ITPM (Input Tokens/Min) | 20,000 |
| OTPM (Output Tokens/Min) | 10,000 |
Higher tiers unlock more (e.g., Tier 5: 10k RPM). Limits reset per minute, so bursting is possible but risky.
## Step 1: Implement Exponential Backoff and Retries
The first line of defense? Smart retries. When you hit a 429 (rate limit), don't hammer the API—back off exponentially.
Use libraries like `tenacity` in Python or `p-retry` in Node.js/TS.
### Python Example (Anthropic SDK)
```python
import anthropic
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
client = anthropic.Anthropic(api_key="your_key")
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=4, max=10),
retry=retry_if_exception_type(anthropic.RateLimitError)
)
def call_claude(prompt):
return client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
response = call_claude("Explain quantum computing simply.")
print(response.content[0].text)
```
This retries up to 5 times, waiting 4s → 8s → etc., only on RateLimitError.
### TypeScript Example (@anthropic-ai/sdk)
```typescript
import { Anthropic } from '@anthropic-ai/sdk';
import pRetry from 'p-retry';
const client = new Anthropic({ apiKey: 'your_key' });
async function callClaude(prompt: string) {
return pRetry(
() => client.messages.create({
model: 'claude-3-5-sonnet-20240620',
max_tokens: 1024,
messages: [{ role: 'user', content: prompt }],
}),
{
retries: 5,
minTimeout: 4000,
factor: 2,
}
);
}
const response = await callClaude('Explain quantum computing simply.');
console.log(response.content[0].text);
```
Pro tip: Log retry attempts for monitoring.
## Step 2: Leverage Prompt Caching for Repeated Prefixes
Anthropic's prompt caching (GA as of late 2024) is a game-changer. Cache common prompt prefixes (e.g., system instructions) to slash latency (up to 85% faster) and costs (50-75% off).
Cache lasts 5 mins to 1 hour, billed only on hits.
### Python: Enable Caching
```python
response = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
messages=[{"role": "user", "content": "Your prompt"}],
cache_control=[{"type": "ephemeral"}] # Cache the whole prompt
)
```
For prefixes:
```python
messages = [
{"role": "system", "content": "You are a helpful assistant.", "cache_control": {"type": "ephemeral"}},
{"role": "user", "content": "Answer this: What is caching?"}
]
response = client.messages.create(model="claude-3-5-sonnet-20240620", max_tokens=1024, messages=messages)
```
Subsequent identical prefix calls use cache.
### TypeScript
```typescript
const response = await client.messages.create({
model: 'claude-3-5-sonnet-20240620',
max_tokens: 1024,
messages: [
{
role: 'system',
content: 'You are a helpful assistant.',
cache_control: { type: 'ephemeral' },
},
{ role: 'user', content: 'Answer this: What is caching?' },
],
});
```
Ideal for chat apps or RAG with fixed instructions. Monitor `usage.cache_creation_input_tokens` vs. `cache_read_input_tokens`.
## Step 3: Batch Requests for High-Volume Workloads
Claude's Batch API processes up to 100k requests asynchronously (24h turnaround). Perfect for non-real-time tasks like data processing.
Costs 50% less, no rate limits beyond upload quota.
### Python: Create Batch
```python
# requests.jsonl
[{"custom_id": "task-1", "method": "POST", "url": "/v1/messages", "body": {"model": "claude-3-haiku-20240307", "max_tokens": 1024, "messages": [{"role": "user", "content": "Task 1"}]}}]
with open('requests.jsonl', 'rb') as f:
batch = client.beta.messages.batches.create(
input_file_id=f.read(), # Upload first via files.create
endpoint='/v1/messages',
num_retries=3
)
# Poll status
status = client.beta.messages.batches.retrieve(batch_id=batch.id)
```
Upload file first:
```python
file = client.files.create(file=('requests.jsonl', open('requests.jsonl', 'rb')))
```
### TypeScript
```typescript
// Upload requests.jsonl
const file = await client.files.create({
file: fs.createReadStream('requests.jsonl'),
purpose: 'batch',
});
const batch = await client.beta.messages.batches.create({
input_file_id: file.id,
endpoint: '/v1/messages',
num_retries: 3,
});
// Retrieve
const status = await client.beta.messages.batches.retrieve(batch.id);
```
Download results via output_file_id when complete. Use for bulk analysis, email generation, etc.
## Step 4: Client-Side Queuing and Throttling
For real-time apps, queue requests and throttle to stay under limits.
Python: `ratelimit` or `bottleneck`.
```python
from ratelimit import limits, sleep_and_retry
ONE_MINUTE = 60
RPM = 50
@sleep_and_retry
@limits(calls=RPM, period=ONE_MINUTE)
def throttled_call(prompt):
return client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
```
Queue with `asyncio.Queue` for concurrency.
TS: `bottleneck` lib.
```typescript
import Bottleneck from 'bottleneck';
const limiter = new Bottleneck({
minTime: 60_000 / 50, // 1.2s between calls for 50 RPM
maxConcurrent: 1,
});
const throttledCall = limiter.wrap(async (prompt: string) => {
return client.messages.create({ /* ... */ });
});
```
Combine with Redis for distributed queuing in production.
## Step 5: Optimize Prompts to Reduce Token Usage
Smaller prompts = fewer TPM hits.
- Use concise system prompts.
- XML tags for structure (Claude loves them).
- Truncate history in chats.
- Prefer Haiku for simple tasks.
Example: Instead of verbose, use:
```
<system>You are concise.</system>
<query>What is X?</query>
```
Monitor `usage` in responses to iterate.
## Step 6: Monitoring, Alerts, and Scaling
- Track headers: `x-ratelimit-*` for remaining limits.
- Use Datadog/New Relic for 429 spikes.
- Request tier upgrades via Anthropic support.
- Fallback to multiple API keys (pool them).
Python key pooling:
```python
from anthropic import AsyncAnthropic
from typing import List
class KeyPool:
def __init__(self, keys: List[str]):
self.clients = [AsyncAnthropic(api_key=k) for k in keys]
# Round-robin logic
```
## Wrapping Up
With retries, caching, batching, throttling, and optimization, you'll conquer Claude API limits. Start with backoff + caching for quick wins, then layer on batching for bulk. Test under load with Locust or Artillery.
Building something cool? Share in comments! Check Anthropic docs for latest limits. Happy scaling! 🚀
(Word count: ~1450)