1. Why Naive Immediate Retry is Dangerous
The simplest retry — immediately try again on failure — is also the most dangerous. If a service is overloaded and 1,000 clients retry immediately after receiving a 503, the service receives 2,000 requests instead of 1,000, making the overload worse. Immediate retries can turn a brief hiccup into a cascading failure.
NAIVE IMMEDIATE RETRY (dangerous):
T=0: Service overloaded, returns 503 to 1,000 clients
T=0.01: All 1,000 clients retry immediately
Service receives 2,000 requests → MORE overloaded
T=0.02: All 1,000 clients retry again
Service receives 3,000 requests → crash
EXPONENTIAL BACKOFF WITH JITTER (safe):
T=0: Service returns 503 to 1,000 clients
T=1-2s: Clients retry with random delay in [0, 2s]
Only ~500 clients retry per second → service can recover
T=4-8s: Remaining failures retry with delay in [0, 8s]
~125 per second → service almost fully recovered ✓
2. Exponential Backoff
Exponential backoff increases the delay between retry attempts exponentially: delay = base_delay * (2 ^ attempt). With base_delay = 1 second:
- Attempt 1: wait 1s (2^0 = 1)
- Attempt 2: wait 2s (2^1 = 2)
- Attempt 3: wait 4s (2^2 = 4)
- Attempt 4: wait 8s (2^3 = 8)
- Attempt 5: wait 16s (2^4 = 16)
Always cap the maximum delay to prevent indefinitely long waits: delay = min(base_delay * 2^attempt, max_delay). A max_delay of 30–60 seconds is typical.
3. Adding Jitter
Jitter randomises the backoff delay to prevent synchronized retries from multiple clients. AWS's distributed systems team recommends these three jitter strategies:
4. What to Retry — Retryable vs Non-Retryable Errors
| Error Type | Retryable? | Examples | Notes |
|---|---|---|---|
| Network timeout | Yes | Connection reset, read timeout | Requires idempotency |
| 503 Service Unavailable | Yes | Service restarting, overloaded | Respect Retry-After header |
| 502 Bad Gateway | Yes | Upstream died mid-request | Transient, usually recovers |
| 429 Too Many Requests | Yes | Rate limited | Use Retry-After header delay |
| 500 Internal Server Error | Sometimes | Server bug vs temporary overload | Retry with low attempt count |
| 400 Bad Request | No | Malformed request | Retrying won't fix the request |
| 401 Unauthorized | No | Invalid API key | Fix credentials first |
| 403 Forbidden | No | Insufficient permissions | Not a transient error |
| 404 Not Found | No | Resource doesn't exist | Won't appear on retry |
| 422 Unprocessable Entity | No | Validation failure | Fix the request payload |
Idempotency is Required for Safe Retries
You can only safely retry an operation if it is idempotent — making the same request twice produces the same result. GET, HEAD, PUT, and DELETE are naturally idempotent. POST is not. For POST operations (creating resources, charging payments), use idempotency keys so the server can detect and deduplicate retried requests. Never retry non-idempotent POST operations without an idempotency key.
5. Retry Pattern Implementation in PHP
6. Retry vs Circuit Breaker vs Timeout
| Pattern | Purpose | When to Use | Works Together? |
|---|---|---|---|
| Retry + Backoff | Handle transient failures | Brief, self-healing failures (network blip, brief overload) | Yes — first line of defence |
| Circuit Breaker | Stop calling a failing service | Sustained failures, service down for >30s | Yes — opens when retry budget exhausted |
| Timeout | Bound request duration | Always — set before configuring retries | Yes — triggers retry on timeout |
| Bulkhead | Limit concurrent requests | Prevent one slow service from consuming all threads | Yes — complements circuit breaker |
7. Libraries for Retry Logic
Do not implement retry logic from scratch in production. Use battle-tested libraries:
- Polly (.NET):
Policy.Handle<HttpRequestException>().WaitAndRetry(3, r => TimeSpan.FromSeconds(Math.Pow(2, r)))— supports retry, circuit breaker, timeout, bulkhead, fallback, and hedging in a unified fluent API. - Resilience4j (Java): Lightweight fault tolerance library for Java. Supports Retry, CircuitBreaker, RateLimiter, TimeLimiter, Bulkhead, and Cache as composable decorators.
- tenacity (Python):
@retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(5))— decorator-based retry with exponential backoff, jitter, and custom retry conditions. - axios-retry (Node.js): Plug-in for the Axios HTTP client with configurable retry conditions and exponential backoff.
- AWS SDK: All AWS SDKs have built-in retry logic with full jitter by default. Configure with
maxAttemptsandretryMode: 'adaptive'.
Retry Amplification — The Hidden Danger
With 3 retries, every failed request becomes 4 requests to the downstream service. If 50% of requests fail, your downstream service receives 4× the expected load instead of 2×. Under sustained failure, retries amplify load and can prevent recovery. Always implement a retry budget (max N% of total requests can be retries) and use circuit breakers to stop retrying when a service is consistently failing. Combine with exponential backoff to give the service time to recover between attempts.
How We Research and Update This Guide
We test the underlying formula or workflow, compare outputs with reliable references, and revise examples whenever the page content changes.
- The workflow or formula is tested directly in the tool and compared against independent reference examples.
- Examples are kept practical so readers can verify the result without hidden assumptions.
- Pages are revised whenever the interface, calculation flow, or surrounding guidance materially changes.
Frequently Asked Questions — Retry Pattern
Exponential backoff is a retry strategy where the delay between attempts grows exponentially: delay = base_delay * 2^attempt. For example with base_delay=1s: attempt 1 waits 1s, attempt 2 waits 2s, attempt 3 waits 4s, attempt 4 waits 8s. This gives the failing service time to recover and reduces load compared to immediate retries. Most implementations cap the maximum delay (e.g. max 60s) to prevent indefinite waits.
Jitter adds randomness to retry delays to prevent synchronized retries. If 1,000 clients all experience the same failure at the same time and all retry with the same exponential backoff schedule, they will all retry simultaneously — causing a thundering herd against the recovering service. Jitter spreads retries over time. Full jitter: delay = random(0, base_delay * 2^attempt). Decorrelated jitter (AWS recommendation): delay = random(base_delay, prev_delay * 3). Equal jitter: delay = cap/2 + random(0, cap/2).
Client errors (4xx) should generally not be retried because they indicate a problem with the request itself, not a transient server issue. 400 Bad Request (malformed request), 401 Unauthorized (invalid credentials), 403 Forbidden (no permission), 404 Not Found (resource does not exist), and 422 Unprocessable Entity should not be retried — retrying will not fix the underlying problem. Only transient errors (5xx server errors, 429 Too Many Requests with a Retry-After header, and network-level timeouts) should be retried.
A retry budget limits the total number of retries in a time window to prevent a cascade of retries amplifying load on a failing service. For example, a service with a 10% retry budget means at most 10% of total requests can be retries at any given time. If you receive 1,000 requests/second and 500 are failing, a 10% budget allows only 100 retries/second rather than 500 — preventing the retry storm from tripling the load on the downstream service. Implemented as a global rate limiter on retry operations.
Use retry for transient errors — brief network blips, temporary service unavailability, or rate limiting where the service will recover quickly. Use a circuit breaker when a service is consistently failing and retries would just amplify the load. The circuit breaker opens after a threshold of failures, immediately returning errors without making downstream calls, giving the failing service time to recover. In practice, you use both together: retries handle brief hiccups, and the circuit breaker handles sustained outages.
A timeout defines how long to wait for a single request to complete before giving up. A retry defines how many times to attempt an operation that has failed or timed out. They work together: set a per-request timeout (e.g. 2s) to avoid waiting forever, then retry up to N times with backoff on timeout or 5xx errors. Always set a timeout before configuring retries — without a timeout, a hung connection will wait indefinitely and your retry budget will be consumed by slow requests rather than failed ones.