Key Takeaways
- API throttling limits the number of requests a client can make to prevent server overload
- Burst allowance provides temporary flexibility above the base rate limit
- Delayed requests indicate your application needs optimization or caching
- Best practice: Stay at 80% or below your rate limit capacity
- Implement exponential backoff for handling 429 (Too Many Requests) responses
What Is API Throttling?
API throttling (also called rate limiting) is a technique used by API providers to control the number of requests a client can make within a specified time period. This prevents server overload, ensures fair usage, and maintains service quality for all users.
When you exceed the rate limit, the API typically returns a 429 Too Many Requests HTTP status code, and your requests are either delayed or rejected until the rate limit window resets.
Example: 100 req/min with 60 limit and 10 burst
How API Throttling Is Calculated
Over Limit = max(0, Requests - Rate Limit)Delayed Requests = max(0, Over Limit - Burst Allowance)Average Delay = 60 / Rate Limit seconds
How to Handle API Throttling
Implement Request Queuing
Queue requests and process them at a controlled rate that stays within limits. This prevents sudden bursts that trigger throttling.
Use Exponential Backoff
When you receive a 429 error, wait progressively longer between retries: 1s, 2s, 4s, 8s, etc. This gives the rate limit time to reset.
Implement Caching
Cache API responses to reduce the number of requests. Many responses don't change frequently and can be cached for minutes or hours.
Monitor Rate Limit Headers
Most APIs return headers like X-RateLimit-Remaining and X-RateLimit-Reset. Use these to proactively slow down before hitting limits.
Pro Tip: The 80% Rule
Design your application to use no more than 80% of your available rate limit under normal conditions. This provides headroom for traffic spikes and prevents unexpected throttling during peak usage.
Common API Rate Limits
Different APIs have vastly different rate limits. Here are some examples from popular services:
- Twitter API v2: 450 requests per 15-minute window (authenticated)
- GitHub API: 5,000 requests per hour (authenticated)
- Google Maps API: 50 requests per second
- Stripe API: 100 read requests per second
- OpenAI API: Varies by model and tier
- AWS API Gateway: Configurable, default 10,000 req/sec
Frequently Asked Questions
Burst allowance is a temporary buffer that allows you to exceed the base rate limit for short periods. For example, if your rate limit is 60 req/min with a burst of 10, you can briefly make up to 70 requests before throttling kicks in. This accommodates legitimate traffic spikes without immediately rejecting requests.
When you receive a 429 error: (1) Check the Retry-After header for how long to wait, (2) Implement exponential backoff - wait 1s, then 2s, then 4s between retries, (3) Queue the failed request for later retry, (4) Log the event for monitoring. Never immediately retry a 429 response as this can worsen the situation.
Rate limiting typically refers to hard caps that reject requests once exceeded. Throttling often refers to slowing down (delaying) requests rather than rejecting them outright. In practice, these terms are often used interchangeably, and many APIs use a combination of both approaches.
Options include: (1) Upgrade to a paid tier - most APIs offer higher limits for paying customers, (2) Apply for elevated access - some APIs (like Twitter) have application processes for higher limits, (3) Use multiple API keys if allowed, (4) Contact the API provider directly to discuss enterprise or custom arrangements.
The token bucket algorithm is a common rate limiting technique. Imagine a bucket that fills with tokens at a constant rate. Each request consumes one token. If the bucket is empty, requests are delayed or rejected. The bucket size determines the burst capacity. This allows for smoother rate limiting compared to fixed windows.