1. Why Rate Limiting Matters
Without rate limiting, a single misbehaving client can exhaust your server resources, degrade service for legitimate users, or run up your cloud bill. Rate limiting serves several purposes:
- DoS / abuse prevention: Limit automated scraping, credential stuffing, or intentional flooding.
- Fair usage: Prevent one tenant from starving others in a multi-tenant system.
- Cost control: API calls to downstream services (LLMs, payment processors) cost money — cap usage per customer tier.
- SLA enforcement: Protect backend services from receiving more load than they can handle.
2. Requirements
Functional Requirements
- Allow at most N requests per user per time window (e.g. 1000 req/min)
- Support multiple granularities: per user, per IP, per API key, per endpoint
- Return 429 Too Many Requests when limit exceeded
- Return rate limit headers on every response
- Support tiered limits (free: 100/min, pro: 1000/min, enterprise: 10000/min)
Non-Functional Requirements
- Add <2ms overhead to each request
- Work correctly across multiple API server instances (distributed)
- Highly available — rate limiter failure should not block requests (fail open or fail closed, configurable)
- Handle 100,000 requests/second across the cluster
3. Algorithm Comparison
Five algorithms are commonly discussed. Understanding their trade-offs is the most important part of this design.
| Algorithm | Accuracy | Memory | Burst Handling | Complexity | Used By |
|---|---|---|---|---|---|
| Fixed Window Counter | Medium — edge bursts | Very low (1 int) | 2× burst at window edge | Trivial | Simple APIs |
| Sliding Window Log | Exact | High (1 ts per req) | No burst allowed | Medium | Accurate audit systems |
| Sliding Window Counter | Good (approximation) | Low (2 ints) | Smooth approximation | Low | Cloudflare, Nginx |
| Token Bucket | Good | Low (tokens + ts) | Configurable burst cap | Medium | AWS, Stripe, most APIs |
| Leaky Bucket | Exact output rate | Low (queue) | No burst (queue-based) | Medium | Traffic shaping, QoS |
Sliding Window Counter (Best Balance)
Approximates the sliding window without storing individual timestamps. Uses two counters: current_window and prev_window. The effective count is:
4. Redis-Based Implementation
Redis is the standard backing store for distributed rate limiting. The key operations must be atomic — use a Lua script to combine the check and increment in a single Redis round-trip.
5. Architecture — Distributed Rate Limiter
API Request
│
▼
┌─────────────────────────────────────────────────────┐
│ API Gateway / Middleware │
│ │
│ 1. Extract identifier (user_id / API key / IP) │
│ 2. Look up tier limit from config cache │
│ 3. Call Redis rate limit check (Lua script, <1ms) │
│ 4a. Allowed → add headers, forward to backend │
│ 4b. Rejected → return 429 with Retry-After │
└─────────────────────────────────────────────────────┘
│ │
▼ ▼
┌─────────────┐ ┌───────────────┐
│ Redis │ │ Config Store │
│ Cluster │ │ (tier limits)│
│ (counters) │ │ Redis / DB │
└─────────────┘ └───────────────┘
Rate Limit Key Naming:
Per user: rl:user:{user_id}:{window}
Per IP: rl:ip:{ip_addr}:{window}
Per endpoint: rl:ep:{user_id}:{endpoint}:{window}
Composite: rl:{user_id}:{endpoint}:{window}
6. Rate Limit Granularities
A production rate limiter enforces limits at multiple levels simultaneously. A request passes only if ALL applicable limits pass:
| Granularity | Key | Purpose | Example Limit |
|---|---|---|---|
| Global (service) | rl:global:{window} | Total service capacity cap | 1M req/min |
| Per IP | rl:ip:{ip}:{window} | Block anonymous abuse / DDoS | 100 req/min |
| Per API Key | rl:key:{key}:{window} | Tier enforcement | Free: 60, Pro: 1000 |
| Per User | rl:user:{id}:{window} | Authenticated user limit | 500 req/min |
| Per Endpoint | rl:ep:{id}:{ep}:{win} | Expensive endpoint protection | 10 req/min for /export |
7. Response Headers Standard
Always include rate limit headers so clients can implement smart backoff and show users meaningful errors.
Fail Open vs Fail Closed
If Redis is unreachable, should you allow (fail open) or reject (fail closed) requests? For most APIs: fail open — better to allow extra requests than to bring down your service when Redis has a blip. For high-security endpoints (payments, auth): fail closed or use a local in-memory fallback counter. Make this a configurable policy per endpoint.
8. Token Bucket Deep Dive
Token bucket is the most common algorithm in practice. Here is the Redis implementation:
How We Research and Update This Guide
We test the underlying formula or workflow, compare outputs with reliable references, and revise examples whenever the page content changes.
- The workflow or formula is tested directly in the tool and compared against independent reference examples.
- Examples are kept practical so readers can verify the result without hidden assumptions.
- Pages are revised whenever the interface, calculation flow, or surrounding guidance materially changes.
Frequently Asked Questions — Rate Limiter Design
Fixed window counter — count requests per minute window, reset at window boundary. Increment a counter for each request; reject when it exceeds the limit. Fast (O(1) Redis INCR) but allows bursts at window edges: a user could make 100 requests in the last second of window 1 and 100 in the first second of window 2 — 200 requests in 2 seconds against a stated 100/minute limit.
Tokens are added to a bucket at a fixed rate R tokens per second, up to a maximum burst capacity B. Each request consumes one token. If the bucket is empty, the request is rejected. Allows bursts up to B requests, then smoothly enforces rate R. Used by AWS API Gateway, Stripe, and most cloud API gateways. Best when you want to allow occasional bursts while enforcing an average rate.
Store the timestamp of each request in a sorted set. To check a new request, count timestamps within the last 60 seconds. Most accurate — no edge-burst problem. Downside: memory usage is proportional to request count (each request stores a timestamp). For 1000 requests/minute per user, this is manageable; for millions of users it requires per-user sorted sets.
Use Redis as a shared counter store — all instances of your service check the same Redis key. A single Redis Lua script atomically increments the counter and returns whether the limit is exceeded. Lua scripts run atomically in Redis, preventing race conditions between the check and increment steps. Redis cluster handles horizontal scaling.
Redis is the standard choice: sub-millisecond latency, supports TTL for automatic window expiry, atomic operations via Lua scripts, and Redis Cluster for horizontal scaling. Memcached is an alternative for simple counters. Avoid the application database for rate limiting — DB round-trips add too much latency to every request.
Return standard headers: X-RateLimit-Limit (max requests per window), X-RateLimit-Remaining (requests left in current window), X-RateLimit-Reset (Unix timestamp when window resets), Retry-After (seconds to wait, returned with 429 responses). These allow clients to implement backoff and display sensible error messages to end users.