API Limits
Traffic Profile
Utilization Summary
Your team can safely issue 48,000 requests per minute before hitting the buffer.
At a Glance
- Queue headroom: No headroom — expect queue growth
- Queue growth: Stable — inbound rate stays within safe budget
- Burst backlog: Burst stays within safe capacity
- Burst recovery: 0 seconds to recover
Token Bucket Blueprint
These parameters work well for implementing a token bucket or leaky bucket throttle in code.
- Capacity: 60,000 tokens (matches provider window)
- Refill rate: 800 tokens / second
- Worker allowance: 66.67 tokens / second per client
- Sleep interval: 15 ms
Scaling Guidance
Understand when to scale horizontally versus dialing back concurrency.
- Add workers when: utilization exceeds 85% and backlog forms faster than it drains.
- Throttle per worker: target 4,000 requests per minute or slower.
- Queue sizing: allow for at least 0 pending requests to absorb spikes.
- Backoff policy: exponential backoff starting at 15 ms keeps retries within limits.
Operational Checklist
- ✅ Log 429 responses with correlation IDs to tune buffers.
- ✅ Surface queue depth in dashboards; alert at 70% capacity.
- ✅ Stagger worker start-up to avoid synchronized bursts.
- ✅ Recalculate when API providers change limits or pricing tiers.
Need higher throughput? Ask the provider for regional limits or evaluate multi-account sharding.
Need Professional IT Services?
Our IT professionals can help optimize your infrastructure and improve your operations.
Understanding Rate Limiting
Rate limiting is a critical technique for controlling the number of requests a client can make to an API within a specified time window. It protects your infrastructure from overload, prevents abuse, ensures fair resource allocation, and maintains service quality for all users.
Why Rate Limiting Matters
Infrastructure Protection: Without rate limits, a single client or malicious actor could overwhelm your servers with requests, causing degraded performance or complete outages for all users. Rate limiting acts as a circuit breaker that prevents cascading failures.
Cost Control: Cloud providers charge based on compute time, bandwidth, and API calls to downstream services. Uncontrolled request volumes can lead to unexpected bills running into thousands of dollars. Rate limiting caps your maximum exposure.
Fair Resource Allocation: In multi-tenant systems, rate limits ensure that no single customer monopolizes shared resources. A noisy neighbor shouldn't be able to slow down everyone else's experience.
DDoS Mitigation: While not a complete defense, rate limiting is your first line of protection against denial-of-service attacks. It forces attackers to distribute their requests across more IP addresses and time.
Compliance and SLA Management: Many third-party APIs impose strict rate limits. Your internal rate limiting must stay within those bounds to avoid service interruptions and maintain contractual obligations.
Rate Limiting Algorithms
Token Bucket Algorithm
The token bucket algorithm is the most flexible and widely-used approach. Imagine a bucket that holds tokens, with new tokens added at a fixed rate. Each request consumes one token. If the bucket is empty, requests must wait or be rejected.
How it works:
- Initialize a bucket with a maximum capacity (e.g., 1000 tokens)
- Add tokens at a constant refill rate (e.g., 100 tokens per second)
- When a request arrives, check if a token is available
- If yes, remove one token and allow the request
- If no, reject the request with a 429 status code
- Never exceed the bucket's maximum capacity
Advantages:
- Handles traffic bursts elegantly - you can consume the entire bucket instantly if needed
- Simple to reason about and implement
- Works well with distributed systems when backed by Redis or similar
- Allows "saving up" capacity during quiet periods
Use cases: API gateways, microservices communication, client SDKs
Leaky Bucket Algorithm
The leaky bucket enforces a strictly constant output rate, regardless of input spikes. Requests enter a queue (bucket) and are processed at a fixed rate. If the queue fills up, new requests are rejected.
How it works:
- Maintain a FIFO queue with maximum size
- Process requests from the queue at a constant rate
- When a request arrives, add it to the queue if space is available
- If the queue is full, reject the request immediately
- A background process continuously "drains" the queue
Advantages:
- Guarantees perfectly smooth output rate
- Protects downstream services from any spikes
- Good for systems that can't handle bursty traffic
Disadvantages:
- Adds latency as requests wait in the queue
- Requires more infrastructure (queue management)
- Less intuitive for developers to understand
Use cases: Traffic shaping, streaming data pipelines, telecom systems
Fixed Window Counter
The fixed window algorithm counts requests in fixed time windows (e.g., per minute) and rejects requests once the limit is reached.
How it works:
- Define a time window (e.g., 00:00-00:59, 01:00-01:59)
- Count requests within the current window
- Allow requests if count < limit
- Reset the counter when the window expires
Advantages:
- Extremely simple to implement
- Low memory footprint (just a counter and timestamp)
- Easy to explain to stakeholders
Disadvantages:
- Vulnerable to "boundary spike" attacks - a client can send limit requests at 00:59 and another limit at 01:00, effectively doubling throughput
- Doesn't account for request distribution within the window
Use cases: Simple APIs, prototyping, systems where boundary spikes aren't a concern
Sliding Window Log
Sliding window log maintains a log of request timestamps and counts requests in a sliding time window, providing more accurate rate limiting than fixed windows.
How it works:
- Store timestamps of all requests (or a recent subset)
- When a new request arrives, count requests in the past N seconds
- Remove timestamps older than the window
- Allow the request if count < limit
Advantages:
- No boundary spike vulnerability
- Accurate request rate measurement
- Fair distribution of capacity
Disadvantages:
- Higher memory usage (stores timestamps)
- More expensive computation (filtering timestamps)
- Harder to implement in distributed systems
Use cases: High-security APIs, premium tiers, systems requiring precise fairness
Sliding Window Counter (Hybrid)
A hybrid approach that combines fixed window efficiency with sliding window accuracy. It uses weighted counters from the current and previous windows.
How it works:
- Maintain counters for current and previous windows
- Calculate the rate using:
previous_window_count × overlap_percentage + current_window_count - Allow request if calculated rate < limit
Advantages:
- More accurate than fixed window
- More efficient than sliding window log
- Good balance of simplicity and fairness
Disadvantages:
- Slightly more complex to implement
- Still has minor boundary effects (though reduced)
Use cases: Production APIs, rate limiting middleware, modern API gateways
Implementing Rate Limiting
Choosing the Right Algorithm
For public APIs: Use token bucket for flexibility and burst handling For background workers: Use leaky bucket for consistent throughput For simple use cases: Start with fixed window for quick implementation For critical systems: Consider sliding window log for maximum accuracy
Distributed Rate Limiting
When running multiple servers, you need a centralized state store:
Redis-based Implementation:
import redis
import time
redis_client = redis.Redis(host='localhost', port=6379)
def is_rate_limited(user_id, limit=100, window=60):
key = f"rate_limit:{user_id}"
current = int(time.time())
# Remove old entries outside the window
redis_client.zremrangebyscore(key, 0, current - window)
# Count requests in current window
request_count = redis_client.zcard(key)
if request_count < limit:
# Add current request
redis_client.zadd(key, {current: current})
redis_client.expire(key, window)
return False
return True
Token Bucket with Redis:
def check_rate_limit_token_bucket(user_id, capacity=1000, refill_rate=100):
key = f"token_bucket:{user_id}"
now = time.time()
# Get current state
data = redis_client.hgetall(key)
if not data:
# Initialize bucket
tokens = capacity - 1
last_refill = now
else:
tokens = float(data[b'tokens'])
last_refill = float(data[b'last_refill'])
# Calculate tokens to add
elapsed = now - last_refill
tokens_to_add = elapsed * refill_rate
tokens = min(capacity, tokens + tokens_to_add)
if tokens < 1:
return True # Rate limited
tokens -= 1
# Update state
redis_client.hset(key, mapping={
'tokens': tokens,
'last_refill': now
})
redis_client.expire(key, 60)
return False
Response Headers
Always include rate limit information in response headers:
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4987
X-RateLimit-Reset: 1699564800
Retry-After: 13
This helps clients implement proper backoff strategies.
Rate Limit Tiers
Different user tiers should have different limits:
- Free tier: 1,000 requests/hour
- Basic tier: 10,000 requests/hour
- Pro tier: 100,000 requests/hour
- Enterprise: Custom limits negotiated
Consider implementing burst limits separately from sustained limits.
Best Practices
1. Implement Graceful Degradation
Don't just reject requests with 429 errors. Consider:
- Queuing non-critical requests
- Returning cached data with a staleness indicator
- Offering reduced functionality at lower rate limits
2. Use Hierarchical Rate Limiting
Apply limits at multiple levels:
- Global limit: Protect overall system capacity
- Per-IP limit: Prevent individual abuse
- Per-user limit: Ensure fair allocation
- Per-endpoint limit: Protect expensive operations
3. Monitor and Alert
Track these metrics:
- Requests rejected due to rate limits
- Time spent in queue (for leaky bucket)
- Token bucket fill levels
- Distribution of requests across time windows
Alert when:
- Rejection rate exceeds 5% for any user
- Global utilization consistently above 85%
- Specific endpoints seeing unusual traffic patterns
4. Document Clearly
Your API documentation must include:
- Exact rate limits for each tier
- Time window definitions
- Retry-After header guidance
- Recommended backoff strategies
- Contact information for limit increases
5. Implement Client-Side Rate Limiting
Don't rely solely on server enforcement. SDKs should:
- Track request counts locally
- Implement automatic backoff
- Respect Retry-After headers
- Queue requests intelligently
Common Pitfalls
Clock Skew in Distributed Systems
When multiple servers have different system times, rate limiting becomes inconsistent. Solutions:
- Use NTP synchronization
- Rely on Redis timestamps rather than application server clocks
- Implement sliding window algorithms that are more tolerant of small skew
Thundering Herd Problem
When rate limit windows reset, all clients may rush to send requests simultaneously. Mitigations:
- Use sliding windows instead of fixed windows
- Implement jitter in client retry logic
- Stagger window reset times for different users
Insufficient Burst Capacity
If your token bucket capacity is too small, legitimate traffic spikes get rejected. Guidelines:
- Capacity should be at least 10x the per-second limit
- Monitor P99 request batch sizes
- Adjust based on real traffic patterns
Poor Error Messages
Generic "Too Many Requests" errors frustrate developers. Include:
- Which specific limit was exceeded (global, per-user, per-endpoint)
- Exactly when the limit resets
- Recommended retry timing
- Link to documentation
Not Accounting for Retry Storms
When clients automatically retry failed requests, you can enter a death spiral where retries consume all capacity. Solutions:
- Implement exponential backoff with jitter
- Add circuit breakers to client SDKs
- Return 503 instead of 429 when the system is actually overloaded
Real-World Examples
Stripe API
Stripe uses a token bucket algorithm with:
- 100 requests per second in live mode
- Different limits for different endpoints
- Automatic retry with exponential backoff in their SDKs
- Clear documentation of rate limit headers
GitHub API
GitHub implements multiple tiers of rate limiting:
- 5,000 requests/hour for authenticated users
- 60 requests/hour for unauthenticated requests
- Separate limits for GraphQL (5,000 points/hour)
- Additional limits on specific operations (search: 30 requests/minute)
Twitter API
Twitter uses sliding window rate limiting:
- Different windows for different endpoints (15 minutes, 24 hours)
- Both user-level and app-level limits
- OAuth-based authentication for tracking
- Granular limits per endpoint (e.g., 180 timeline requests per 15 minutes)
Testing Your Rate Limits
Always test your rate limiting implementation:
# Burst test - send 1000 requests as fast as possible
for i in {1..1000}; do
curl -s -o /dev/null -w "%{http_code}\\n" https://api.example.com/endpoint
done | sort | uniq -c
# Sustained load test
wrk -t12 -c400 -d30s --latency https://api.example.com/endpoint
# Verify headers
curl -i https://api.example.com/endpoint | grep -i rate
Look for:
- Correct 429 responses when limit is exceeded
- Accurate rate limit headers
- Proper reset timing
- No boundary condition bugs
Conclusion
Rate limiting is not just about preventing abuse—it's about building resilient, scalable systems that provide predictable performance for all users. By choosing the right algorithm, implementing it correctly across distributed systems, and following best practices, you create an API that's both developer-friendly and operationally sound.
Remember: rate limiting is not a replacement for proper capacity planning, auto-scaling, or architectural optimizations. It's one layer in a defense-in-depth strategy for building production-ready APIs.
Frequently Asked Questions
Common questions about the Rate Limit Calculator
Start with 10-20% below the published limit. This cushion absorbs clock drift between systems, network jitter, and uneven worker performance while leaving room for retries. Increase the buffer if you operate in multiple regions or cannot centrally coordinate concurrency.
ℹ️ Disclaimer
This tool is provided for informational and educational purposes only. All processing happens entirely in your browser - no data is sent to or stored on our servers. While we strive for accuracy, we make no warranties about the completeness or reliability of results. Use at your own discretion.