1. Why Rate Limiting Matters

Without rate limiting, a single misbehaving client can exhaust your server resources, degrade service for legitimate users, or run up your cloud bill. Rate limiting serves several purposes:

  • DoS / abuse prevention: Limit automated scraping, credential stuffing, or intentional flooding.
  • Fair usage: Prevent one tenant from starving others in a multi-tenant system.
  • Cost control: API calls to downstream services (LLMs, payment processors) cost money — cap usage per customer tier.
  • SLA enforcement: Protect backend services from receiving more load than they can handle.

2. Requirements

Functional Requirements

  • Allow at most N requests per user per time window (e.g. 1000 req/min)
  • Support multiple granularities: per user, per IP, per API key, per endpoint
  • Return 429 Too Many Requests when limit exceeded
  • Return rate limit headers on every response
  • Support tiered limits (free: 100/min, pro: 1000/min, enterprise: 10000/min)

Non-Functional Requirements

  • Add <2ms overhead to each request
  • Work correctly across multiple API server instances (distributed)
  • Highly available — rate limiter failure should not block requests (fail open or fail closed, configurable)
  • Handle 100,000 requests/second across the cluster

3. Algorithm Comparison

Five algorithms are commonly discussed. Understanding their trade-offs is the most important part of this design.

AlgorithmAccuracyMemoryBurst HandlingComplexityUsed By
Fixed Window CounterMedium — edge burstsVery low (1 int)2× burst at window edgeTrivialSimple APIs
Sliding Window LogExactHigh (1 ts per req)No burst allowedMediumAccurate audit systems
Sliding Window CounterGood (approximation)Low (2 ints)Smooth approximationLowCloudflare, Nginx
Token BucketGoodLow (tokens + ts)Configurable burst capMediumAWS, Stripe, most APIs
Leaky BucketExact output rateLow (queue)No burst (queue-based)MediumTraffic shaping, QoS

Sliding Window Counter (Best Balance)

Approximates the sliding window without storing individual timestamps. Uses two counters: current_window and prev_window. The effective count is:

Formula — sliding window approximation # Position in current window (0.0 to 1.0) position = (current_time % window_size) / window_size # Weighted estimate of requests in the past window_size seconds effective_count = current_window + prev_window * (1 - position) # Example: window = 60s, current position = 70% through window # prev_window = 80 requests, current_window = 40 requests # effective_count = 40 + 80 * (1 - 0.70) = 40 + 24 = 64 requests

4. Redis-Based Implementation

Redis is the standard backing store for distributed rate limiting. The key operations must be atomic — use a Lua script to combine the check and increment in a single Redis round-trip.

Lua script — fixed window counter (runs atomically in Redis) -- KEYS[1] = rate limit key (e.g. "rl:user:123:1748822400") -- ARGV[1] = max requests (limit) -- ARGV[2] = window TTL in seconds local current = redis.call("INCR", KEYS[1]) if current == 1 then redis.call("EXPIRE", KEYS[1], ARGV[2]) end if current > tonumber(ARGV[1]) then return 0 -- rate limited end return 1 -- allowed -- Key format: "rl:{identifier}:{window_start_timestamp}" -- Window start = floor(current_unix_time / window_seconds) * window_seconds -- This creates a new key each window and auto-expires the old one
Python — calling the rate limiter import redis import time r = redis.Redis(host='localhost', port=6379) RATE_LIMIT_SCRIPT = """ local current = redis.call("INCR", KEYS[1]) if current == 1 then redis.call("EXPIRE", KEYS[1], ARGV[2]) end if current > tonumber(ARGV[1]) then return {0, current} end return {1, current} """ script = r.register_script(RATE_LIMIT_SCRIPT) def check_rate_limit(user_id: str, limit: int = 100, window_seconds: int = 60): window_start = int(time.time() // window_seconds) * window_seconds key = f"rl:{user_id}:{window_start}" allowed, count = script(keys=[key], args=[limit, window_seconds]) remaining = max(0, limit - count) reset_at = window_start + window_seconds return bool(allowed), remaining, reset_at

5. Architecture — Distributed Rate Limiter

  API Request
      │
      ▼
┌─────────────────────────────────────────────────────┐
│              API Gateway / Middleware                │
│                                                     │
│  1. Extract identifier (user_id / API key / IP)     │
│  2. Look up tier limit from config cache            │
│  3. Call Redis rate limit check (Lua script, <1ms)  │
│  4a. Allowed → add headers, forward to backend      │
│  4b. Rejected → return 429 with Retry-After         │
└─────────────────────────────────────────────────────┘
      │                           │
      ▼                           ▼
┌─────────────┐           ┌───────────────┐
│ Redis       │           │  Config Store │
│ Cluster     │           │  (tier limits)│
│ (counters)  │           │  Redis / DB   │
└─────────────┘           └───────────────┘

Rate Limit Key Naming:
  Per user:     rl:user:{user_id}:{window}
  Per IP:       rl:ip:{ip_addr}:{window}
  Per endpoint: rl:ep:{user_id}:{endpoint}:{window}
  Composite:    rl:{user_id}:{endpoint}:{window}

6. Rate Limit Granularities

A production rate limiter enforces limits at multiple levels simultaneously. A request passes only if ALL applicable limits pass:

GranularityKeyPurposeExample Limit
Global (service)rl:global:{window}Total service capacity cap1M req/min
Per IPrl:ip:{ip}:{window}Block anonymous abuse / DDoS100 req/min
Per API Keyrl:key:{key}:{window}Tier enforcementFree: 60, Pro: 1000
Per Userrl:user:{id}:{window}Authenticated user limit500 req/min
Per Endpointrl:ep:{id}:{ep}:{win}Expensive endpoint protection10 req/min for /export

7. Response Headers Standard

Always include rate limit headers so clients can implement smart backoff and show users meaningful errors.

HTTP Response Headers HTTP/1.1 200 OK X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 847 X-RateLimit-Reset: 1748823060 X-RateLimit-Policy: 1000;w=60 # When rate limited: HTTP/1.1 429 Too Many Requests X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1748823060 Retry-After: 42 Content-Type: application/json {"error": "rate_limit_exceeded", "message": "Too many requests. Retry after 42 seconds."}

Fail Open vs Fail Closed

If Redis is unreachable, should you allow (fail open) or reject (fail closed) requests? For most APIs: fail open — better to allow extra requests than to bring down your service when Redis has a blip. For high-security endpoints (payments, auth): fail closed or use a local in-memory fallback counter. Make this a configurable policy per endpoint.

8. Token Bucket Deep Dive

Token bucket is the most common algorithm in practice. Here is the Redis implementation:

Lua — Token Bucket in Redis -- KEYS[1] = bucket key -- ARGV[1] = max tokens (burst capacity) -- ARGV[2] = refill rate (tokens per second) -- ARGV[3] = current time (Unix seconds with milliseconds) -- ARGV[4] = tokens requested (usually 1) local bucket = redis.call("HMGET", KEYS[1], "tokens", "last_refill") local max_tokens = tonumber(ARGV[1]) local refill_rate = tonumber(ARGV[2]) local now = tonumber(ARGV[3]) local requested = tonumber(ARGV[4]) local tokens = tonumber(bucket[1]) or max_tokens local last_refill = tonumber(bucket[2]) or now -- Refill tokens based on elapsed time local elapsed = now - last_refill local new_tokens = math.min(max_tokens, tokens + elapsed * refill_rate) if new_tokens >= requested then redis.call("HMSET", KEYS[1], "tokens", new_tokens - requested, "last_refill", now) redis.call("EXPIRE", KEYS[1], math.ceil(max_tokens / refill_rate) + 1) return {1, math.floor(new_tokens - requested)} else redis.call("HMSET", KEYS[1], "tokens", new_tokens, "last_refill", now) return {0, math.floor(new_tokens)} end

How We Research and Update This Guide

We test the underlying formula or workflow, compare outputs with reliable references, and revise examples whenever the page content changes.

  • The workflow or formula is tested directly in the tool and compared against independent reference examples.
  • Examples are kept practical so readers can verify the result without hidden assumptions.
  • Pages are revised whenever the interface, calculation flow, or surrounding guidance materially changes.

Frequently Asked Questions — Rate Limiter Design