Rate Limiting — Algorithms and Implementation Patterns
Without rate limiting, a single misbehaving client can exhaust your API’s capacity and trigger cascading failures across your backend. Rate limiting puts a ceiling on how fast any one entity can make requests. The interesting part isn’t the concept — it’s choosing the right algorithm. Each one has different fairness properties, burst characteristics, and implementation complexity.
Why Rate Limiting Matters
Rate limits protect against:
- Accidental abuse: a client with a bug sending thousands of requests per second
- Intentional scraping: competitors extracting your data en masse
- Cost control: when your API calls paid upstream services (LLMs, SMS providers)
- Cascading failures: one slow service overwhelming a downstream dependency
Algorithm 1: Fixed Window Counter
The simplest approach: count requests in a fixed time window (e.g., per minute) and reset the counter at the boundary.
import redis
import time
r = redis.Redis()
def is_allowed(user_id: str, limit: int = 100) -> bool:
window = int(time.time() // 60) # current minute bucket
key = f"rate:{user_id}:{window}"
count = r.incr(key)
if count == 1:
r.expire(key, 60)
return count <= limit
The boundary problem: a burst attack straddling the window can send 2x the limit. With a 100/min limit, a client can send 100 requests at 12:00:59 and 100 more at 12:01:00 — 200 requests within two seconds.
Algorithm 2: Sliding Window Log
Track the timestamp of every request. Count how many timestamps fall within the last window_size seconds.
import redis
import time
r = redis.Redis()
def is_allowed(user_id: str, limit: int = 100, window_secs: int = 60) -> bool:
now = time.time()
key = f"rate:log:{user_id}"
pipe = r.pipeline()
pipe.zremrangebyscore(key, 0, now - window_secs) # evict old entries
pipe.zadd(key, {str(now): now})
pipe.zcard(key)
pipe.expire(key, window_secs)
_, _, count, _ = pipe.execute()
return count <= limit
Accurate and fair — no boundary burst. The trade-off: storing one timestamp per request is memory-intensive at scale. 1,000 users × 100 requests = 100,000 sorted set entries in Redis.
Algorithm 3: Sliding Window Counter (Hybrid)
A practical middle ground: combine two fixed-window counters with a weighted formula to approximate a true sliding window.
import redis
import time
r = redis.Redis()
def is_allowed(user_id: str, limit: int = 100, window_secs: int = 60) -> bool:
now = time.time()
current_window = int(now // window_secs)
prev_window = current_window - 1
elapsed_fraction = (now % window_secs) / window_secs
prev_key = f"rate:{user_id}:{prev_window}"
curr_key = f"rate:{user_id}:{current_window}"
prev_count = int(r.get(prev_key) or 0)
curr_count = r.incr(curr_key)
if curr_count == 1:
r.expire(curr_key, window_secs * 2)
# The previous window contributes its remaining (unused) fraction
weighted = prev_count * (1 - elapsed_fraction) + curr_count
return weighted <= limit
This approximation is accurate to within a few percent of the true sliding window, at a fraction of the memory cost.
Algorithm 4: Token Bucket
Each user has a bucket that refills with tokens at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected.
- Allows bursting up to the bucket capacity
- Long-term average rate is governed by the refill rate
import redis
import time
r = redis.Redis()
def is_allowed(
user_id: str,
capacity: int = 20,
refill_rate: float = 10, # tokens per second
) -> bool:
key = f"rate:bucket:{user_id}"
now = time.time()
data = r.hgetall(key)
tokens = float(data.get(b"tokens", capacity))
last_refill = float(data.get(b"last_refill", now))
elapsed = now - last_refill
tokens = min(capacity, tokens + elapsed * refill_rate)
if tokens < 1:
return False
tokens -= 1
r.hset(key, mapping={"tokens": tokens, "last_refill": now})
r.expire(key, 3600)
return True
Token bucket is the right choice when you want to permit short bursts (a user firing 5 API calls in quick succession) while still enforcing a long-term average rate.
Algorithm 5: Leaky Bucket
Requests enter a queue (the bucket) and are processed at a fixed output rate. If the bucket is full, new requests are dropped. Unlike the token bucket, leaky bucket smooths output — it’s often used for traffic shaping rather than access control.
In practice, leaky bucket requires a FIFO queue with a rate-limited consumer, which is more complex to implement in stateless request handlers. It’s more common in network equipment and proxy layers than in application-level API rate limiting.
Choosing the Right Algorithm
| Algorithm | Burst handling | Memory | Accuracy | Best for |
|---|---|---|---|---|
| Fixed window | Double burst at boundary | Low | Low | Simple, non-critical limits |
| Sliding window log | No burst | High | Exact | Low-volume, strict limits |
| Sliding window counter | Small burst only | Low | ~95% | General-purpose API limits |
| Token bucket | Configurable burst | Low | Exact | User-facing APIs with burst allowance |
| Leaky bucket | No burst (smooth output) | Medium | Exact | Traffic shaping, queuing |
Rate Limiting with Nginx
For IP-based limiting without application code:
http {
# Shared memory zone: 10MB stores ~160,000 IPs
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
server {
location /api/ {
limit_req zone=api_limit burst=20 nodelay;
limit_req_status 429;
proxy_pass http://backend;
}
}
}
rate=10r/s: token bucket refill rate (10 tokens per second)burst=20: bucket capacity — allows up to 20 queued requests before rejectingnodelay: burst requests are served immediately (not queued), up toburst; excess requests get 429
Returning the Right Response
When a request is rate limited, return 429 Too Many Requests with retry guidance:
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1748134800
Retry-After: 47
{
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "Too many requests. Try again in 47 seconds."
}
}
Always include Retry-After. Without it, well-intentioned clients with retry logic will immediately retry and make the problem worse.
Choosing Your Rate Limit Key
What you’re limiting on matters as much as the algorithm:
- IP address: protects against anonymous abuse but over-limits shared IPs (office NATs, Cloudflare exit nodes)
- API key / user ID: accurate per-client limiting; requires authentication upstream
- Endpoint + user: different limits for expensive endpoints (
/search,/export) vs cheap ones (/ping) - Tenant / organization: for multi-tenant SaaS, limit at the account level to prevent one customer from starving others
Combine keys for layered protection: a global IP limit to stop floods, a per-user limit for fair use, and a per-endpoint limit for your most expensive operations.
Conclusion
Rate limiting is most effective when you choose the algorithm that matches your traffic pattern. Token bucket is the most versatile for user-facing APIs — it permits legitimate bursts while enforcing a long-term ceiling. Back it with Redis for distributed enforcement across multiple app servers, and layer Nginx upstream for infrastructure-level protection. Always include Retry-After in your 429 responses so well-behaved clients know when to back off, and design your limit keys to target the right entity at each layer.