How to Design a Notification System — System Design Interview [2026]

Q: How do you handle notification fan-out at scale?

Fan-out means sending one event to thousands or millions of recipients. Use a message queue (Kafka or SQS) as the fan-out backbone. The notification service publishes one event; worker pools subscribe and dispatch per-channel (push, SMS, email). For very large fan-outs (e.g. broadcasting to 10M users), pre-segment users into shards and process each shard in parallel with dedicated workers.

Q: How do you prevent duplicate notifications?

Use a deduplication key (idempotency key) stored in Redis with a TTL equal to your retry window (e.g. 24 hours). Before dispatching, check Redis for the key. If it exists, skip the send. The key is composed of: user_id + notification_type + event_id. This prevents duplicates both from retries and from accidental double-publishes.

Q: How do you implement priority queues for notifications?

Use separate Kafka topics or SQS queues per priority tier: CRITICAL (OTP, fraud alerts), HIGH (order updates), NORMAL (social), LOW (marketing). Critical queues have dedicated workers and bypass rate limits. Low-priority queues share workers and respect per-user rate limits. This ensures OTPs always arrive in seconds even when the system is processing millions of marketing messages.

Q: How do you manage user notification preferences?

Store preferences in a user_preferences table keyed by (user_id, notification_type, channel). Cache the preferences in Redis (TTL 1 hour) since they are read on every notification dispatch but rarely updated. Provide a preference center UI. Before dispatching, load preferences and skip channels the user has opted out of. Respect Do Not Disturb hours using the user's stored timezone.

Q: What is the retry strategy for failed notifications?

Use exponential backoff with jitter: retry at 1s, 5s, 30s, 5min, 30min, 2hr. Cap at 3–5 retries for transient failures (network timeout). Do not retry permanent failures (invalid token, unsubscribed email). Store retry state in the delivery_logs table. After max retries, mark as FAILED and trigger an alert if the failure rate exceeds 1%.

Q: How do you handle device token management for push notifications?

Device tokens expire and change when users reinstall apps. When FCM/APNs returns a "registration not found" or "invalid token" error, immediately delete the token from your devices table. When a user logs in on a new device, upsert the token (insert or update by user_id + platform). Support multiple devices per user by storing one row per device. Tokens should be refreshed in the app on every launch.

1. Requirements Clarification

Notification systems vary significantly based on use case. Drive this conversation with your interviewer to establish boundaries.

Functional Requirements

Support three channels: push notifications (iOS/Android), SMS, and email
Trigger notifications from internal services (order placed, OTP request) and scheduled campaigns (marketing blasts)
User preference management — opt-out per channel, per notification type, DND hours
Delivery tracking — sent, delivered, read receipts where available
Template management — dynamic content injection into reusable templates
Retry on failure with dead-letter handling

Non-Functional Requirements

Scale: 10 million notifications per day; marketing blasts to 50M users
Latency: Critical notifications (OTP) delivered in <5 seconds end-to-end
Reliability: At-least-once delivery for critical types; best-effort for marketing
Throughput: ~115 notifications/second baseline, bursts to 500K/second during blasts

2. High-Level Architecture

  Triggering Services (Order, Auth, Marketing)
              │
              ▼
    ┌─────────────────────┐
    │  Notification Service│  ← validates, enriches, deduplicates
    │  (API + Fan-out)     │
    └──────────┬──────────┘
               │
    ┌──────────▼──────────────────────────────┐
    │           Message Queue (Kafka)          │
    │  topics: notif.critical  notif.normal    │
    │           notif.low  notif.email         │
    └──────┬──────────┬──────────┬────────────┘
           │          │          │
    ┌──────▼──┐ ┌─────▼───┐ ┌───▼────────┐
    │  Push   │ │   SMS   │ │   Email    │
    │ Workers │ │ Workers │ │  Workers   │
    │FCM/APNs │ │ Twilio  │ │ SendGrid   │
    └──────┬──┘ └─────┬───┘ └───┬────────┘
           └──────────┴─────────┘
                      │
               ┌──────▼──────┐
               │ Delivery Log │  (delivery_logs table + Redis)
               └─────────────┘

3. Notification Channels Deep Dive

Each channel has different providers, failure modes, and latency characteristics.

Channel	Provider	Delivery Latency	Cost	Limitations
Push (Android)	FCM (Firebase Cloud Messaging)	1–3 seconds	Free	Device must have app installed + internet
Push (iOS)	APNs (Apple Push Notification service)	1–3 seconds	Free	Requires Apple developer cert; strict payload size (4KB)
SMS	Twilio, AWS SNS, Vonage	2–10 seconds	$0.0079/msg (Twilio US)	Character limits; spam filters; regulatory compliance (TCPA)
Email	SendGrid, AWS SES, Mailgun	1–60 seconds	$0.0001/email (SES)	Spam filters; unsubscribe compliance (CAN-SPAM)

Push Notification Flow (FCM)

FCM is the primary channel for mobile apps. The push worker fetches the device token from the devices table and calls the FCM HTTP v1 API. FCM returns immediately with an acceptance; actual delivery to the device is asynchronous.

FCM HTTP v1 API — send push POST https://fcm.googleapis.com/v1/projects/{project_id}/messages:send { "message": { "token": "device_registration_token_here", "notification": { "title": "Your order has shipped!", "body": "Order #1234 is on its way. Track it here." }, "data": { "order_id": "1234", "deep_link": "myapp://orders/1234" }, "android": { "priority": "high" }, "apns": { "headers": { "apns-priority": "10" } } } } // Response 200: message accepted by FCM // Response 404: device token no longer valid — DELETE from DB // Response 429: FCM rate limit — back off and retry

4. Fan-Out Architecture

Fan-out is the process of taking one notification event and dispatching it to potentially millions of recipients (e.g. a marketing blast to all users). Naive fan-out — looping through users in the API handler — will time out and block the system. The correct approach is asynchronous multi-stage fan-out.

Two-Stage Fan-Out

Stage 1 (Segment Fan-out): The campaign service publishes one event with a user segment ID. A fan-out worker queries the user segment table in batches of 1,000 users and publishes individual notification jobs to Kafka.
Stage 2 (Channel Dispatch): Channel-specific workers (push, SMS, email) consume from Kafka, look up device tokens / phone numbers / email addresses, check user preferences, and call the external provider API.

Campaign: "Send promo to all 50M users"
    │
    ▼
Fan-out Worker
    reads users in batches of 1,000
    │
    ▼ (publishes 50,000 Kafka messages)
┌──────────────────────────────────────┐
│  topic: notif.marketing              │
│  50,000 partitions × 1,000 users    │
└──────┬───────────────────────────────┘
       │
       ▼ (100 parallel push workers)
  Push Workers call FCM
  → 500,000 FCM calls/second
  → Full blast completes in ~100 seconds

Interview Tip — Fan-out on Read vs Write

For social notifications (e.g. "X liked your post"), fan-out on write means pre-computing recipient lists at event time. For marketing, fan-out on read (query the segment at send time) is preferred so you always use the current user set. Most systems use a hybrid: write fan-out for small social graphs, read fan-out for large broadcasts.

5. Priority Queue Design

Not all notifications are equal. An OTP code that expires in 60 seconds must not be delayed behind a marketing email. Use separate Kafka topics per priority tier.

Priority Tier	Examples	Kafka Topic	Workers	Rate Limit
CRITICAL	OTP, fraud alert, password reset	notif.critical	Dedicated (always-on)	None
HIGH	Order shipped, payment failed	notif.high	Dedicated pool	None for individual
NORMAL	Social activity, reminders	notif.normal	Shared pool	50/user/day
LOW	Marketing, newsletters, promotions	notif.low	Shared pool (lower priority)	5/user/day

6. Retry Logic and Dead-Letter Queues

External providers (FCM, Twilio, SendGrid) fail transiently. A robust retry strategy ensures high delivery rates without hammering providers during outages.

Transient failures (5xx, timeout, rate limit): Retry with exponential backoff — 1s, 5s, 30s, 5min, 30min. Max 5 retries.
Permanent failures (invalid token, unsubscribed, invalid phone): Do not retry. Mark as FAILED_PERMANENT. Clean up the bad device/contact record.
Dead-letter queue (DLQ): After max retries, move the message to a DLQ topic. A separate DLQ worker inspects failures, alerts on-call engineers if failure rate exceeds 1%, and can replay messages after the provider recovers.

Watch Out — Retry Storms

If FCM has a 10-minute outage, millions of retries will queue up and hit simultaneously when FCM recovers. Add jitter (±30% randomness) to backoff delays to spread the retry load. Without jitter, the thundering herd will immediately trigger another outage.

7. Deduplication

At-least-once delivery from Kafka means the same message can be processed twice — e.g. a worker crashes after calling FCM but before committing the Kafka offset. Deduplication prevents the user from receiving duplicate notifications.

Redis — deduplication check # Dedup key: composed of event_id + user_id + channel dedup_key = f"notif:sent:{event_id}:{user_id}:{channel}" # Before dispatching: if redis.EXISTS(dedup_key): return # already sent, skip # After successful dispatch: redis.SET(dedup_key, "1", EX=86400) # expire after 24h # This window covers the retry period. # Any retry within 24h will see the key and be skipped.

8. Database Schema

The schema covers device registration, notification templates, and delivery tracking.

SQL — core notification tables CREATE TABLE devices ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, user_id BIGINT UNSIGNED NOT NULL, platform ENUM('ios','android','web') NOT NULL, token VARCHAR(512) NOT NULL, app_version VARCHAR(20) NULL, created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (id), UNIQUE KEY uk_user_platform_token (user_id, platform, token(64)), KEY idx_user_id (user_id) ); CREATE TABLE notification_templates ( id INT UNSIGNED NOT NULL AUTO_INCREMENT, type VARCHAR(60) NOT NULL, -- 'ORDER_SHIPPED', 'OTP', etc. channel ENUM('push','sms','email') NOT NULL, title_tmpl VARCHAR(200) NULL, body_tmpl TEXT NOT NULL, priority ENUM('CRITICAL','HIGH','NORMAL','LOW') NOT NULL DEFAULT 'NORMAL', PRIMARY KEY (id), UNIQUE KEY uk_type_channel (type, channel) ); CREATE TABLE delivery_logs ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, notification_id VARCHAR(64) NOT NULL, -- idempotency key user_id BIGINT UNSIGNED NOT NULL, channel ENUM('push','sms','email') NOT NULL, status ENUM('QUEUED','SENT','DELIVERED','FAILED','SKIPPED') NOT NULL, provider_msg_id VARCHAR(200) NULL, error_code VARCHAR(60) NULL, created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (id), KEY idx_notification_id (notification_id), KEY idx_user_status (user_id, status) ) PARTITION BY RANGE (YEAR(created_at)) ( PARTITION p2025 VALUES LESS THAN (2026), PARTITION p2026 VALUES LESS THAN (2027), PARTITION pfuture VALUES LESS THAN MAXVALUE );

9. User Preference Management

Users must be able to opt out of specific notification types and channels. Respecting these preferences is both a UX requirement and a legal requirement (CAN-SPAM, GDPR, TCPA).

Store preferences in a user_preferences table: (user_id, notif_type, channel, enabled, dnd_start, dnd_end, timezone)
Cache preferences in Redis on first lookup with a 1-hour TTL. Invalidate on user update.
Check preferences in the channel worker before dispatching. If enabled = false, log status as SKIPPED and stop.
For DND hours: convert the current UTC time to the user's timezone. If within DND window, delay the notification to DND end time using a scheduled re-queue.

Best Practice — Unsubscribe One-Click

Per Google/Yahoo's 2024 email requirements, bulk senders must support one-click List-Unsubscribe (RFC 8058). Implement a signed unsubscribe token in every email: https://yourapp.com/unsubscribe?token=signed_jwt. The token encodes user_id + notif_type + expiry. On click, update user_preferences without requiring login.

10. Scaling to Billions of Notifications

At marketing-blast scale (50M users in one campaign), the bottleneck shifts from your services to external provider rate limits.

FCM: No documented hard rate limit per project, but use connection pooling (gRPC for FCM HTTP v1) and batch sends (up to 500 messages per batch).
Twilio: Default 1 message/second per long code. Use short codes (100 msg/sec) or toll-free numbers for blasts. Pre-register sender pool.
SendGrid: IP warm-up required for new IPs. Dedicated IPs for high volume. Monitor bounce/spam rates to protect sender reputation.
Your infrastructure: Kafka partitioning = parallelism. 100 push workers × 500 FCM batch = 50,000 pushes/second. A 50M blast completes in ~17 minutes.

How We Research and Update This Guide

We test the underlying formula or workflow, compare outputs with reliable references, and revise examples whenever the page content changes.

The workflow or formula is tested directly in the tool and compared against independent reference examples.
Examples are kept practical so readers can verify the result without hidden assumptions.
Pages are revised whenever the interface, calculation flow, or surrounding guidance materially changes.

Frequently Asked Questions — Notification System Design

How do you handle notification fan-out at scale?

How do you prevent duplicate notifications?

How do you implement priority queues for notifications?

How do you manage user notification preferences?

What is the retry strategy for failed notifications?

How do you handle device token management for push notifications?