1. Requirements Clarification

Notification systems vary significantly based on use case. Drive this conversation with your interviewer to establish boundaries.

Functional Requirements

  • Support three channels: push notifications (iOS/Android), SMS, and email
  • Trigger notifications from internal services (order placed, OTP request) and scheduled campaigns (marketing blasts)
  • User preference management — opt-out per channel, per notification type, DND hours
  • Delivery tracking — sent, delivered, read receipts where available
  • Template management — dynamic content injection into reusable templates
  • Retry on failure with dead-letter handling

Non-Functional Requirements

  • Scale: 10 million notifications per day; marketing blasts to 50M users
  • Latency: Critical notifications (OTP) delivered in <5 seconds end-to-end
  • Reliability: At-least-once delivery for critical types; best-effort for marketing
  • Throughput: ~115 notifications/second baseline, bursts to 500K/second during blasts

2. High-Level Architecture

  Triggering Services (Order, Auth, Marketing)
              │
              ▼
    ┌─────────────────────┐
    │  Notification Service│  ← validates, enriches, deduplicates
    │  (API + Fan-out)     │
    └──────────┬──────────┘
               │
    ┌──────────▼──────────────────────────────┐
    │           Message Queue (Kafka)          │
    │  topics: notif.critical  notif.normal    │
    │           notif.low  notif.email         │
    └──────┬──────────┬──────────┬────────────┘
           │          │          │
    ┌──────▼──┐ ┌─────▼───┐ ┌───▼────────┐
    │  Push   │ │   SMS   │ │   Email    │
    │ Workers │ │ Workers │ │  Workers   │
    │FCM/APNs │ │ Twilio  │ │ SendGrid   │
    └──────┬──┘ └─────┬───┘ └───┬────────┘
           └──────────┴─────────┘
                      │
               ┌──────▼──────┐
               │ Delivery Log │  (delivery_logs table + Redis)
               └─────────────┘

3. Notification Channels Deep Dive

Each channel has different providers, failure modes, and latency characteristics.

ChannelProviderDelivery LatencyCostLimitations
Push (Android)FCM (Firebase Cloud Messaging)1–3 secondsFreeDevice must have app installed + internet
Push (iOS)APNs (Apple Push Notification service)1–3 secondsFreeRequires Apple developer cert; strict payload size (4KB)
SMSTwilio, AWS SNS, Vonage2–10 seconds$0.0079/msg (Twilio US)Character limits; spam filters; regulatory compliance (TCPA)
EmailSendGrid, AWS SES, Mailgun1–60 seconds$0.0001/email (SES)Spam filters; unsubscribe compliance (CAN-SPAM)

Push Notification Flow (FCM)

FCM is the primary channel for mobile apps. The push worker fetches the device token from the devices table and calls the FCM HTTP v1 API. FCM returns immediately with an acceptance; actual delivery to the device is asynchronous.

FCM HTTP v1 API — send push POST https://fcm.googleapis.com/v1/projects/{project_id}/messages:send { "message": { "token": "device_registration_token_here", "notification": { "title": "Your order has shipped!", "body": "Order #1234 is on its way. Track it here." }, "data": { "order_id": "1234", "deep_link": "myapp://orders/1234" }, "android": { "priority": "high" }, "apns": { "headers": { "apns-priority": "10" } } } } // Response 200: message accepted by FCM // Response 404: device token no longer valid — DELETE from DB // Response 429: FCM rate limit — back off and retry

4. Fan-Out Architecture

Fan-out is the process of taking one notification event and dispatching it to potentially millions of recipients (e.g. a marketing blast to all users). Naive fan-out — looping through users in the API handler — will time out and block the system. The correct approach is asynchronous multi-stage fan-out.

Two-Stage Fan-Out

  1. Stage 1 (Segment Fan-out): The campaign service publishes one event with a user segment ID. A fan-out worker queries the user segment table in batches of 1,000 users and publishes individual notification jobs to Kafka.
  2. Stage 2 (Channel Dispatch): Channel-specific workers (push, SMS, email) consume from Kafka, look up device tokens / phone numbers / email addresses, check user preferences, and call the external provider API.
Campaign: "Send promo to all 50M users"
    │
    ▼
Fan-out Worker
    reads users in batches of 1,000
    │
    ▼ (publishes 50,000 Kafka messages)
┌──────────────────────────────────────┐
│  topic: notif.marketing              │
│  50,000 partitions × 1,000 users    │
└──────┬───────────────────────────────┘
       │
       ▼ (100 parallel push workers)
  Push Workers call FCM
  → 500,000 FCM calls/second
  → Full blast completes in ~100 seconds

Interview Tip — Fan-out on Read vs Write

For social notifications (e.g. "X liked your post"), fan-out on write means pre-computing recipient lists at event time. For marketing, fan-out on read (query the segment at send time) is preferred so you always use the current user set. Most systems use a hybrid: write fan-out for small social graphs, read fan-out for large broadcasts.

5. Priority Queue Design

Not all notifications are equal. An OTP code that expires in 60 seconds must not be delayed behind a marketing email. Use separate Kafka topics per priority tier.

Priority TierExamplesKafka TopicWorkersRate Limit
CRITICALOTP, fraud alert, password resetnotif.criticalDedicated (always-on)None
HIGHOrder shipped, payment failednotif.highDedicated poolNone for individual
NORMALSocial activity, remindersnotif.normalShared pool50/user/day
LOWMarketing, newsletters, promotionsnotif.lowShared pool (lower priority)5/user/day

6. Retry Logic and Dead-Letter Queues

External providers (FCM, Twilio, SendGrid) fail transiently. A robust retry strategy ensures high delivery rates without hammering providers during outages.

  • Transient failures (5xx, timeout, rate limit): Retry with exponential backoff — 1s, 5s, 30s, 5min, 30min. Max 5 retries.
  • Permanent failures (invalid token, unsubscribed, invalid phone): Do not retry. Mark as FAILED_PERMANENT. Clean up the bad device/contact record.
  • Dead-letter queue (DLQ): After max retries, move the message to a DLQ topic. A separate DLQ worker inspects failures, alerts on-call engineers if failure rate exceeds 1%, and can replay messages after the provider recovers.

Watch Out — Retry Storms

If FCM has a 10-minute outage, millions of retries will queue up and hit simultaneously when FCM recovers. Add jitter (±30% randomness) to backoff delays to spread the retry load. Without jitter, the thundering herd will immediately trigger another outage.

7. Deduplication

At-least-once delivery from Kafka means the same message can be processed twice — e.g. a worker crashes after calling FCM but before committing the Kafka offset. Deduplication prevents the user from receiving duplicate notifications.

Redis — deduplication check # Dedup key: composed of event_id + user_id + channel dedup_key = f"notif:sent:{event_id}:{user_id}:{channel}" # Before dispatching: if redis.EXISTS(dedup_key): return # already sent, skip # After successful dispatch: redis.SET(dedup_key, "1", EX=86400) # expire after 24h # This window covers the retry period. # Any retry within 24h will see the key and be skipped.

8. Database Schema

The schema covers device registration, notification templates, and delivery tracking.

SQL — core notification tables CREATE TABLE devices ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, user_id BIGINT UNSIGNED NOT NULL, platform ENUM('ios','android','web') NOT NULL, token VARCHAR(512) NOT NULL, app_version VARCHAR(20) NULL, created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (id), UNIQUE KEY uk_user_platform_token (user_id, platform, token(64)), KEY idx_user_id (user_id) ); CREATE TABLE notification_templates ( id INT UNSIGNED NOT NULL AUTO_INCREMENT, type VARCHAR(60) NOT NULL, -- 'ORDER_SHIPPED', 'OTP', etc. channel ENUM('push','sms','email') NOT NULL, title_tmpl VARCHAR(200) NULL, body_tmpl TEXT NOT NULL, priority ENUM('CRITICAL','HIGH','NORMAL','LOW') NOT NULL DEFAULT 'NORMAL', PRIMARY KEY (id), UNIQUE KEY uk_type_channel (type, channel) ); CREATE TABLE delivery_logs ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, notification_id VARCHAR(64) NOT NULL, -- idempotency key user_id BIGINT UNSIGNED NOT NULL, channel ENUM('push','sms','email') NOT NULL, status ENUM('QUEUED','SENT','DELIVERED','FAILED','SKIPPED') NOT NULL, provider_msg_id VARCHAR(200) NULL, error_code VARCHAR(60) NULL, created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (id), KEY idx_notification_id (notification_id), KEY idx_user_status (user_id, status) ) PARTITION BY RANGE (YEAR(created_at)) ( PARTITION p2025 VALUES LESS THAN (2026), PARTITION p2026 VALUES LESS THAN (2027), PARTITION pfuture VALUES LESS THAN MAXVALUE );

9. User Preference Management

Users must be able to opt out of specific notification types and channels. Respecting these preferences is both a UX requirement and a legal requirement (CAN-SPAM, GDPR, TCPA).

  • Store preferences in a user_preferences table: (user_id, notif_type, channel, enabled, dnd_start, dnd_end, timezone)
  • Cache preferences in Redis on first lookup with a 1-hour TTL. Invalidate on user update.
  • Check preferences in the channel worker before dispatching. If enabled = false, log status as SKIPPED and stop.
  • For DND hours: convert the current UTC time to the user's timezone. If within DND window, delay the notification to DND end time using a scheduled re-queue.

Best Practice — Unsubscribe One-Click

Per Google/Yahoo's 2024 email requirements, bulk senders must support one-click List-Unsubscribe (RFC 8058). Implement a signed unsubscribe token in every email: https://yourapp.com/unsubscribe?token=signed_jwt. The token encodes user_id + notif_type + expiry. On click, update user_preferences without requiring login.

10. Scaling to Billions of Notifications

At marketing-blast scale (50M users in one campaign), the bottleneck shifts from your services to external provider rate limits.

  • FCM: No documented hard rate limit per project, but use connection pooling (gRPC for FCM HTTP v1) and batch sends (up to 500 messages per batch).
  • Twilio: Default 1 message/second per long code. Use short codes (100 msg/sec) or toll-free numbers for blasts. Pre-register sender pool.
  • SendGrid: IP warm-up required for new IPs. Dedicated IPs for high volume. Monitor bounce/spam rates to protect sender reputation.
  • Your infrastructure: Kafka partitioning = parallelism. 100 push workers × 500 FCM batch = 50,000 pushes/second. A 50M blast completes in ~17 minutes.

How We Research and Update This Guide

We test the underlying formula or workflow, compare outputs with reliable references, and revise examples whenever the page content changes.

  • The workflow or formula is tested directly in the tool and compared against independent reference examples.
  • Examples are kept practical so readers can verify the result without hidden assumptions.
  • Pages are revised whenever the interface, calculation flow, or surrounding guidance materially changes.

Frequently Asked Questions — Notification System Design