How to Design a Rate Limiter — System Design Interview [2026]

Q: What is the simplest rate limiting algorithm?

Fixed window counter — count requests per minute window, reset at window boundary. Increment a counter for each request; reject when it exceeds the limit. Fast (O(1) Redis INCR) but allows bursts at window edges: a user could make 100 requests in the last second of window 1 and 100 in the first second of window 2 — 200 requests in 2 seconds against a stated 100/minute limit.

Q: What is the token bucket algorithm?

Tokens are added to a bucket at a fixed rate R tokens per second, up to a maximum burst capacity B. Each request consumes one token. If the bucket is empty, the request is rejected. Allows bursts up to B requests, then smoothly enforces rate R. Used by AWS API Gateway, Stripe, and most cloud API gateways. Best when you want to allow occasional bursts while enforcing an average rate.

Q: What is sliding window log rate limiting?

Store the timestamp of each request in a sorted set. To check a new request, count timestamps within the last 60 seconds. Most accurate — no edge-burst problem. Downside: memory usage is proportional to request count (each request stores a timestamp). For 1000 requests/minute per user, this is manageable; for millions of users it requires per-user sorted sets.

Q: How do you implement distributed rate limiting?

Use Redis as a shared counter store — all instances of your service check the same Redis key. A single Redis Lua script atomically increments the counter and returns whether the limit is exceeded. Lua scripts run atomically in Redis, preventing race conditions between the check and increment steps. Redis cluster handles horizontal scaling.

Q: Where do you store rate limit state?

Redis is the standard choice: sub-millisecond latency, supports TTL for automatic window expiry, atomic operations via Lua scripts, and Redis Cluster for horizontal scaling. Memcached is an alternative for simple counters. Avoid the application database for rate limiting — DB round-trips add too much latency to every request.

Q: How do you handle rate limit response headers?

Return standard headers: X-RateLimit-Limit (max requests per window), X-RateLimit-Remaining (requests left in current window), X-RateLimit-Reset (Unix timestamp when window resets), Retry-After (seconds to wait, returned with 429 responses). These allow clients to implement backoff and display sensible error messages to end users.

1. Why Rate Limiting Matters

Without rate limiting, a single misbehaving client can exhaust your server resources, degrade service for legitimate users, or run up your cloud bill. Rate limiting serves several purposes:

DoS / abuse prevention: Limit automated scraping, credential stuffing, or intentional flooding.
Fair usage: Prevent one tenant from starving others in a multi-tenant system.
Cost control: API calls to downstream services (LLMs, payment processors) cost money — cap usage per customer tier.
SLA enforcement: Protect backend services from receiving more load than they can handle.

2. Requirements

Functional Requirements

Allow at most N requests per user per time window (e.g. 1000 req/min)
Support multiple granularities: per user, per IP, per API key, per endpoint
Return 429 Too Many Requests when limit exceeded
Return rate limit headers on every response
Support tiered limits (free: 100/min, pro: 1000/min, enterprise: 10000/min)

Non-Functional Requirements

Add <2ms overhead to each request
Work correctly across multiple API server instances (distributed)
Highly available — rate limiter failure should not block requests (fail open or fail closed, configurable)
Handle 100,000 requests/second across the cluster

3. Algorithm Comparison

Five algorithms are commonly discussed. Understanding their trade-offs is the most important part of this design.

Algorithm	Accuracy	Memory	Burst Handling	Complexity	Used By
Fixed Window Counter	Medium — edge bursts	Very low (1 int)	2× burst at window edge	Trivial	Simple APIs
Sliding Window Log	Exact	High (1 ts per req)	No burst allowed	Medium	Accurate audit systems
Sliding Window Counter	Good (approximation)	Low (2 ints)	Smooth approximation	Low	Cloudflare, Nginx
Token Bucket	Good	Low (tokens + ts)	Configurable burst cap	Medium	AWS, Stripe, most APIs
Leaky Bucket	Exact output rate	Low (queue)	No burst (queue-based)	Medium	Traffic shaping, QoS

Sliding Window Counter (Best Balance)

Approximates the sliding window without storing individual timestamps. Uses two counters: current_window and prev_window. The effective count is:

Formula — sliding window approximation # Position in current window (0.0 to 1.0) position = (current_time % window_size) / window_size # Weighted estimate of requests in the past window_size seconds effective_count = current_window + prev_window * (1 - position) # Example: window = 60s, current position = 70% through window # prev_window = 80 requests, current_window = 40 requests # effective_count = 40 + 80 * (1 - 0.70) = 40 + 24 = 64 requests

4. Redis-Based Implementation

Redis is the standard backing store for distributed rate limiting. The key operations must be atomic — use a Lua script to combine the check and increment in a single Redis round-trip.

Lua script — fixed window counter (runs atomically in Redis) -- KEYS[1] = rate limit key (e.g. "rl:user:123:1748822400") -- ARGV[1] = max requests (limit) -- ARGV[2] = window TTL in seconds local current = redis.call("INCR", KEYS[1]) if current == 1 then redis.call("EXPIRE", KEYS[1], ARGV[2]) end if current > tonumber(ARGV[1]) then return 0 -- rate limited end return 1 -- allowed -- Key format: "rl:{identifier}:{window_start_timestamp}" -- Window start = floor(current_unix_time / window_seconds) * window_seconds -- This creates a new key each window and auto-expires the old one

Python — calling the rate limiter import redis import time r = redis.Redis(host='localhost', port=6379) RATE_LIMIT_SCRIPT = """ local current = redis.call("INCR", KEYS[1]) if current == 1 then redis.call("EXPIRE", KEYS[1], ARGV[2]) end if current > tonumber(ARGV[1]) then return {0, current} end return {1, current} """ script = r.register_script(RATE_LIMIT_SCRIPT) def check_rate_limit(user_id: str, limit: int = 100, window_seconds: int = 60): window_start = int(time.time() // window_seconds) * window_seconds key = f"rl:{user_id}:{window_start}" allowed, count = script(keys=[key], args=[limit, window_seconds]) remaining = max(0, limit - count) reset_at = window_start + window_seconds return bool(allowed), remaining, reset_at

5. Architecture — Distributed Rate Limiter

  API Request
      │
      ▼
┌─────────────────────────────────────────────────────┐
│              API Gateway / Middleware                │
│                                                     │
│  1. Extract identifier (user_id / API key / IP)     │
│  2. Look up tier limit from config cache            │
│  3. Call Redis rate limit check (Lua script, <1ms)  │
│  4a. Allowed → add headers, forward to backend      │
│  4b. Rejected → return 429 with Retry-After         │
└─────────────────────────────────────────────────────┘
      │                           │
      ▼                           ▼
┌─────────────┐           ┌───────────────┐
│ Redis       │           │  Config Store │
│ Cluster     │           │  (tier limits)│
│ (counters)  │           │  Redis / DB   │
└─────────────┘           └───────────────┘

Rate Limit Key Naming:
  Per user:     rl:user:{user_id}:{window}
  Per IP:       rl:ip:{ip_addr}:{window}
  Per endpoint: rl:ep:{user_id}:{endpoint}:{window}
  Composite:    rl:{user_id}:{endpoint}:{window}

6. Rate Limit Granularities

A production rate limiter enforces limits at multiple levels simultaneously. A request passes only if ALL applicable limits pass:

Granularity	Key	Purpose	Example Limit
Global (service)	rl:global:{window}	Total service capacity cap	1M req/min
Per IP	rl:ip:{ip}:{window}	Block anonymous abuse / DDoS	100 req/min
Per API Key	rl:key:{key}:{window}	Tier enforcement	Free: 60, Pro: 1000
Per User	rl:user:{id}:{window}	Authenticated user limit	500 req/min
Per Endpoint	rl:ep:{id}:{ep}:{win}	Expensive endpoint protection	10 req/min for /export

7. Response Headers Standard

Always include rate limit headers so clients can implement smart backoff and show users meaningful errors.

HTTP Response Headers HTTP/1.1 200 OK X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 847 X-RateLimit-Reset: 1748823060 X-RateLimit-Policy: 1000;w=60 # When rate limited: HTTP/1.1 429 Too Many Requests X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1748823060 Retry-After: 42 Content-Type: application/json {"error": "rate_limit_exceeded", "message": "Too many requests. Retry after 42 seconds."}

Fail Open vs Fail Closed

If Redis is unreachable, should you allow (fail open) or reject (fail closed) requests? For most APIs: fail open — better to allow extra requests than to bring down your service when Redis has a blip. For high-security endpoints (payments, auth): fail closed or use a local in-memory fallback counter. Make this a configurable policy per endpoint.

8. Token Bucket Deep Dive

Token bucket is the most common algorithm in practice. Here is the Redis implementation:

Lua — Token Bucket in Redis -- KEYS[1] = bucket key -- ARGV[1] = max tokens (burst capacity) -- ARGV[2] = refill rate (tokens per second) -- ARGV[3] = current time (Unix seconds with milliseconds) -- ARGV[4] = tokens requested (usually 1) local bucket = redis.call("HMGET", KEYS[1], "tokens", "last_refill") local max_tokens = tonumber(ARGV[1]) local refill_rate = tonumber(ARGV[2]) local now = tonumber(ARGV[3]) local requested = tonumber(ARGV[4]) local tokens = tonumber(bucket[1]) or max_tokens local last_refill = tonumber(bucket[2]) or now -- Refill tokens based on elapsed time local elapsed = now - last_refill local new_tokens = math.min(max_tokens, tokens + elapsed * refill_rate) if new_tokens >= requested then redis.call("HMSET", KEYS[1], "tokens", new_tokens - requested, "last_refill", now) redis.call("EXPIRE", KEYS[1], math.ceil(max_tokens / refill_rate) + 1) return {1, math.floor(new_tokens - requested)} else redis.call("HMSET", KEYS[1], "tokens", new_tokens, "last_refill", now) return {0, math.floor(new_tokens)} end

How We Research and Update This Guide

We test the underlying formula or workflow, compare outputs with reliable references, and revise examples whenever the page content changes.

The workflow or formula is tested directly in the tool and compared against independent reference examples.
Examples are kept practical so readers can verify the result without hidden assumptions.
Pages are revised whenever the interface, calculation flow, or surrounding guidance materially changes.

Frequently Asked Questions — Rate Limiter Design

What is the simplest rate limiting algorithm?

What is the token bucket algorithm?

What is sliding window log rate limiting?

How do you implement distributed rate limiting?

Where do you store rate limit state?

How do you handle rate limit response headers?