1. Requirements Clarification
Notification systems vary significantly based on use case. Drive this conversation with your interviewer to establish boundaries.
Functional Requirements
- Support three channels: push notifications (iOS/Android), SMS, and email
- Trigger notifications from internal services (order placed, OTP request) and scheduled campaigns (marketing blasts)
- User preference management — opt-out per channel, per notification type, DND hours
- Delivery tracking — sent, delivered, read receipts where available
- Template management — dynamic content injection into reusable templates
- Retry on failure with dead-letter handling
Non-Functional Requirements
- Scale: 10 million notifications per day; marketing blasts to 50M users
- Latency: Critical notifications (OTP) delivered in <5 seconds end-to-end
- Reliability: At-least-once delivery for critical types; best-effort for marketing
- Throughput: ~115 notifications/second baseline, bursts to 500K/second during blasts
2. High-Level Architecture
Triggering Services (Order, Auth, Marketing)
│
▼
┌─────────────────────┐
│ Notification Service│ ← validates, enriches, deduplicates
│ (API + Fan-out) │
└──────────┬──────────┘
│
┌──────────▼──────────────────────────────┐
│ Message Queue (Kafka) │
│ topics: notif.critical notif.normal │
│ notif.low notif.email │
└──────┬──────────┬──────────┬────────────┘
│ │ │
┌──────▼──┐ ┌─────▼───┐ ┌───▼────────┐
│ Push │ │ SMS │ │ Email │
│ Workers │ │ Workers │ │ Workers │
│FCM/APNs │ │ Twilio │ │ SendGrid │
└──────┬──┘ └─────┬───┘ └───┬────────┘
└──────────┴─────────┘
│
┌──────▼──────┐
│ Delivery Log │ (delivery_logs table + Redis)
└─────────────┘
3. Notification Channels Deep Dive
Each channel has different providers, failure modes, and latency characteristics.
| Channel | Provider | Delivery Latency | Cost | Limitations |
|---|---|---|---|---|
| Push (Android) | FCM (Firebase Cloud Messaging) | 1–3 seconds | Free | Device must have app installed + internet |
| Push (iOS) | APNs (Apple Push Notification service) | 1–3 seconds | Free | Requires Apple developer cert; strict payload size (4KB) |
| SMS | Twilio, AWS SNS, Vonage | 2–10 seconds | $0.0079/msg (Twilio US) | Character limits; spam filters; regulatory compliance (TCPA) |
| SendGrid, AWS SES, Mailgun | 1–60 seconds | $0.0001/email (SES) | Spam filters; unsubscribe compliance (CAN-SPAM) |
Push Notification Flow (FCM)
FCM is the primary channel for mobile apps. The push worker fetches the device token from the devices table and calls the FCM HTTP v1 API. FCM returns immediately with an acceptance; actual delivery to the device is asynchronous.
4. Fan-Out Architecture
Fan-out is the process of taking one notification event and dispatching it to potentially millions of recipients (e.g. a marketing blast to all users). Naive fan-out — looping through users in the API handler — will time out and block the system. The correct approach is asynchronous multi-stage fan-out.
Two-Stage Fan-Out
- Stage 1 (Segment Fan-out): The campaign service publishes one event with a user segment ID. A fan-out worker queries the user segment table in batches of 1,000 users and publishes individual notification jobs to Kafka.
- Stage 2 (Channel Dispatch): Channel-specific workers (push, SMS, email) consume from Kafka, look up device tokens / phone numbers / email addresses, check user preferences, and call the external provider API.
Campaign: "Send promo to all 50M users"
│
▼
Fan-out Worker
reads users in batches of 1,000
│
▼ (publishes 50,000 Kafka messages)
┌──────────────────────────────────────┐
│ topic: notif.marketing │
│ 50,000 partitions × 1,000 users │
└──────┬───────────────────────────────┘
│
▼ (100 parallel push workers)
Push Workers call FCM
→ 500,000 FCM calls/second
→ Full blast completes in ~100 seconds
Interview Tip — Fan-out on Read vs Write
For social notifications (e.g. "X liked your post"), fan-out on write means pre-computing recipient lists at event time. For marketing, fan-out on read (query the segment at send time) is preferred so you always use the current user set. Most systems use a hybrid: write fan-out for small social graphs, read fan-out for large broadcasts.
5. Priority Queue Design
Not all notifications are equal. An OTP code that expires in 60 seconds must not be delayed behind a marketing email. Use separate Kafka topics per priority tier.
| Priority Tier | Examples | Kafka Topic | Workers | Rate Limit |
|---|---|---|---|---|
| CRITICAL | OTP, fraud alert, password reset | notif.critical | Dedicated (always-on) | None |
| HIGH | Order shipped, payment failed | notif.high | Dedicated pool | None for individual |
| NORMAL | Social activity, reminders | notif.normal | Shared pool | 50/user/day |
| LOW | Marketing, newsletters, promotions | notif.low | Shared pool (lower priority) | 5/user/day |
6. Retry Logic and Dead-Letter Queues
External providers (FCM, Twilio, SendGrid) fail transiently. A robust retry strategy ensures high delivery rates without hammering providers during outages.
- Transient failures (5xx, timeout, rate limit): Retry with exponential backoff — 1s, 5s, 30s, 5min, 30min. Max 5 retries.
- Permanent failures (invalid token, unsubscribed, invalid phone): Do not retry. Mark as FAILED_PERMANENT. Clean up the bad device/contact record.
- Dead-letter queue (DLQ): After max retries, move the message to a DLQ topic. A separate DLQ worker inspects failures, alerts on-call engineers if failure rate exceeds 1%, and can replay messages after the provider recovers.
Watch Out — Retry Storms
If FCM has a 10-minute outage, millions of retries will queue up and hit simultaneously when FCM recovers. Add jitter (±30% randomness) to backoff delays to spread the retry load. Without jitter, the thundering herd will immediately trigger another outage.
7. Deduplication
At-least-once delivery from Kafka means the same message can be processed twice — e.g. a worker crashes after calling FCM but before committing the Kafka offset. Deduplication prevents the user from receiving duplicate notifications.
8. Database Schema
The schema covers device registration, notification templates, and delivery tracking.
9. User Preference Management
Users must be able to opt out of specific notification types and channels. Respecting these preferences is both a UX requirement and a legal requirement (CAN-SPAM, GDPR, TCPA).
- Store preferences in a
user_preferencestable:(user_id, notif_type, channel, enabled, dnd_start, dnd_end, timezone) - Cache preferences in Redis on first lookup with a 1-hour TTL. Invalidate on user update.
- Check preferences in the channel worker before dispatching. If
enabled = false, log status as SKIPPED and stop. - For DND hours: convert the current UTC time to the user's timezone. If within DND window, delay the notification to DND end time using a scheduled re-queue.
Best Practice — Unsubscribe One-Click
Per Google/Yahoo's 2024 email requirements, bulk senders must support one-click List-Unsubscribe (RFC 8058). Implement a signed unsubscribe token in every email: https://yourapp.com/unsubscribe?token=signed_jwt. The token encodes user_id + notif_type + expiry. On click, update user_preferences without requiring login.
10. Scaling to Billions of Notifications
At marketing-blast scale (50M users in one campaign), the bottleneck shifts from your services to external provider rate limits.
- FCM: No documented hard rate limit per project, but use connection pooling (gRPC for FCM HTTP v1) and batch sends (up to 500 messages per batch).
- Twilio: Default 1 message/second per long code. Use short codes (100 msg/sec) or toll-free numbers for blasts. Pre-register sender pool.
- SendGrid: IP warm-up required for new IPs. Dedicated IPs for high volume. Monitor bounce/spam rates to protect sender reputation.
- Your infrastructure: Kafka partitioning = parallelism. 100 push workers × 500 FCM batch = 50,000 pushes/second. A 50M blast completes in ~17 minutes.
How We Research and Update This Guide
We test the underlying formula or workflow, compare outputs with reliable references, and revise examples whenever the page content changes.
- The workflow or formula is tested directly in the tool and compared against independent reference examples.
- Examples are kept practical so readers can verify the result without hidden assumptions.
- Pages are revised whenever the interface, calculation flow, or surrounding guidance materially changes.
Frequently Asked Questions — Notification System Design
Fan-out means sending one event to thousands or millions of recipients. Use a message queue (Kafka or SQS) as the fan-out backbone. The notification service publishes one event; worker pools subscribe and dispatch per-channel (push, SMS, email). For very large fan-outs (e.g. broadcasting to 10M users), pre-segment users into shards and process each shard in parallel with dedicated workers.
Use a deduplication key (idempotency key) stored in Redis with a TTL equal to your retry window (e.g. 24 hours). Before dispatching, check Redis for the key. If it exists, skip the send. The key is composed of: user_id + notification_type + event_id. This prevents duplicates both from retries and from accidental double-publishes.
Use separate Kafka topics or SQS queues per priority tier: CRITICAL (OTP, fraud alerts), HIGH (order updates), NORMAL (social), LOW (marketing). Critical queues have dedicated workers and bypass rate limits. Low-priority queues share workers and respect per-user rate limits. This ensures OTPs always arrive in seconds even when the system is processing millions of marketing messages.
Store preferences in a user_preferences table keyed by (user_id, notification_type, channel). Cache the preferences in Redis (TTL 1 hour) since they are read on every notification dispatch but rarely updated. Provide a preference center UI. Before dispatching, load preferences and skip channels the user has opted out of. Respect Do Not Disturb hours using the user's stored timezone.
Use exponential backoff with jitter: retry at 1s, 5s, 30s, 5min, 30min, 2hr. Cap at 3–5 retries for transient failures (network timeout). Do not retry permanent failures (invalid token, unsubscribed email). Store retry state in the delivery_logs table. After max retries, mark as FAILED and trigger an alert if the failure rate exceeds 1%.
Device tokens expire and change when users reinstall apps. When FCM/APNs returns a "registration not found" or "invalid token" error, immediately delete the token from your devices table. When a user logs in on a new device, upsert the token (insert or update by user_id + platform). Support multiple devices per user by storing one row per device. Tokens should be refreshed in the app on every launch.