WhatsApp System Design

Quick Reference: Step 7: Message Queues | Step 10: Encryption


Quick Reference

Scale: 2B+ users, 100B+ messages/day, 65B+ messages/day (peak), 1B+ groups

Key Components: Messaging service, real-time delivery, end-to-end encryption, presence system, media storage, group messaging

Challenges: Real-time delivery at scale, end-to-end encryption, presence management, group messaging, media delivery


Clear Definition

WhatsApp is a global messaging platform serving 2B+ users, handling 100B+ messages per day with end-to-end encryption. It requires real-time message delivery, presence management, group messaging, and media sharing at unprecedented scale.

πŸ’‘ Key Insight: WhatsApp uses Erlang/OTP for high concurrency and fault tolerance, message queues for reliable delivery, and Signal Protocol for end-to-end encryption. The system is designed for reliability and security.


System Requirements

Functional Requirements

  1. Messaging

    • Send/receive text messages
    • Real-time delivery
    • Message status (sent, delivered, read)
    • Message history
  2. Group Messaging

    • Create groups (up to 256 members)
    • Send messages to groups
    • Group management (add/remove members)
    • Group metadata
  3. Media Sharing

    • Send images, videos, documents
    • Voice messages
    • File sharing
    • Media compression
  4. Presence

    • Online/offline status
    • Last seen timestamp
    • Typing indicators
    • Read receipts
  5. Security

    • End-to-end encryption
    • User authentication
    • Message integrity
    • Forward secrecy

Non-Functional Requirements

  1. Scale

    • 2B+ users globally
    • 100B+ messages/day
    • Handle message bursts
    • Support group messaging
  2. Performance

    • Low latency message delivery (< 1 second)
    • Real-time presence updates
    • Fast media delivery
  3. Reliability

    • 99.9% message delivery
    • Handle network failures
    • Message persistence
    • No message loss
  4. Security

    • End-to-end encryption
    • User privacy
    • Forward secrecy
    • No message interception

High-Level Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    WhatsApp Clients                          β”‚
β”‚  (Mobile: iOS, Android, Web, Desktop)                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β”‚ XMPP / Custom Protocol
                        β”‚ (TLS Encrypted)
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Load Balancer                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚               β”‚               β”‚
        β–Ό               β–Ό               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Message    β”‚ β”‚   Presence   β”‚ β”‚   Media      β”‚
β”‚   Service    β”‚ β”‚   Service    β”‚ β”‚   Service    β”‚
β”‚  (Erlang)    β”‚ β”‚  (Erlang)    β”‚ β”‚  (Erlang)    β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                β”‚                 β”‚
       β”‚                β”‚                 β”‚
       β–Ό                β–Ό                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Message Queue                            β”‚
β”‚              (RabbitMQ / Custom Queue)                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Database Layer                           β”‚
β”‚  (MySQL for metadata, HBase for messages)                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Media Storage (Object Storage)                 β”‚
β”‚         (S3-like for images, videos, files)                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

1. Client Applications

Responsibilities:

  • User interface
  • Message encryption/decryption
  • Presence management
  • Media handling

Technologies:

  • Mobile: Native apps (iOS, Android)
  • Web: WebSocket-based web app
  • Desktop: Electron-based desktop app

Key Features:

  • Offline Support: Queue messages when offline
  • Message Sync: Sync messages across devices
  • Media Compression: Compress before sending
  • Encryption: End-to-end encryption using Signal Protocol

2. Message Service (Erlang/OTP)

Why Erlang?

  • Concurrency: Handle millions of concurrent connections
  • Fault Tolerance: Self-healing, no single point of failure
  • Hot Code Swapping: Update code without downtime
  • Message Passing: Natural fit for messaging systems

Responsibilities:

  • Accept incoming messages
  • Route messages to recipients
  • Handle message delivery
  • Manage message queues
  • Handle offline messages

Message Flow:

1. User sends message
   β”‚
2. Client encrypts message (Signal Protocol)
   β”‚
3. Message sent to Message Service
   β”‚
4. Message Service stores in queue
   β”‚
5. If recipient online: Deliver immediately
   β”‚
6. If recipient offline: Store for later delivery
   β”‚
7. Message delivered to recipient
   β”‚
8. Delivery receipt sent to sender

Message Storage:

  • Online Users: In-memory queue (fast delivery)
  • Offline Users: Persistent queue (database)
  • Message History: Stored in database (HBase)

3. Presence Service

Responsibilities:

  • Track user online/offline status
  • Track last seen timestamp
  • Handle typing indicators
  • Manage read receipts

Presence States:

  • Online: User is active
  • Offline: User is not connected
  • Last Seen: Timestamp of last activity
  • Typing: User is typing a message

Implementation:

  • In-Memory: Store presence in memory (fast)
  • Distributed: Share presence across servers
  • TTL: Presence expires after inactivity
  • Privacy: Users can hide last seen

4. End-to-End Encryption

Signal Protocol:

  • Double Ratchet: Forward secrecy
  • Pre-keys: Asynchronous messaging
  • Key Exchange: Secure key exchange
  • Message Authentication: Prevent tampering

How it Works:

1. User A wants to send message to User B
   β”‚
2. User A requests User B's public key
   β”‚
3. User A encrypts message with User B's public key
   β”‚
4. Encrypted message sent to server
   β”‚
5. Server forwards encrypted message to User B
   β”‚
6. User B decrypts message with private key
   β”‚
7. Server cannot read message (end-to-end encrypted)

Key Features:

  • Forward Secrecy: Old messages cannot be decrypted if key compromised
  • Message Authentication: Verify message integrity
  • No Metadata Leakage: Server doesn't know message content
  • Group Encryption: Encrypt group messages

5. Group Messaging

Challenges:

  • Scale: Up to 256 members per group
  • Delivery: Deliver to all members
  • Consistency: Ensure all members receive messages
  • Ordering: Maintain message order

Group Architecture:

Group Message
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Group Service   β”‚
β”‚  - Get members   β”‚
β”‚  - Route message β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β”‚ Fan-out
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Deliver to each member         β”‚
β”‚  - Online: Immediate            β”‚
β”‚  - Offline: Queue for later     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Group Features:

  • Member Management: Add/remove members
  • Admin Controls: Group admins
  • Group Metadata: Name, description, icon
  • Message Delivery: Deliver to all members

6. Media Service

Responsibilities:

  • Accept media uploads
  • Store media files
  • Generate thumbnails
  • Deliver media to recipients

Media Flow:

1. User sends image/video
   β”‚
2. Client compresses media
   β”‚
3. Media uploaded to Media Service
   β”‚
4. Media Service stores in object storage
   β”‚
5. Thumbnail generated (for images/videos)
   β”‚
6. Media URL sent to recipient
   β”‚
7. Recipient downloads media

Storage:

  • Object Storage: S3-like storage (AWS S3, Google Cloud Storage)
  • CDN: Use CDN for media delivery
  • Compression: Compress before storage
  • Thumbnails: Generate thumbnails for fast preview

Optimizations:

  • Compression: Compress images/videos before upload
  • Progressive Loading: Load low quality first, then high quality
  • Caching: Cache frequently accessed media
  • CDN: Use CDN for global delivery

7. Database Layer

Database Architecture:

  • MySQL: User data, contacts, group metadata
  • HBase: Message history (time-series data)
  • Redis: Presence, online users, caching

Data Sharding:

  • User Data: Shard by user ID
  • Messages: Shard by conversation ID
  • Groups: Shard by group ID

Replication:

  • Read Replicas: For read scalability
  • Master-Slave: For write scalability
  • Geographic Distribution: Replicate across regions

Data Flow

Message Send Flow

1. User A sends message to User B
   β”‚
2. Client encrypts message (Signal Protocol)
   β”‚
3. Message sent to Message Service
   β”‚
4. Message Service validates and stores
   β”‚
5. Check if User B is online
   β”‚
6. If online: Deliver immediately via WebSocket
   β”‚
7. If offline: Store in persistent queue
   β”‚
8. Send delivery receipt to User A
   β”‚
9. When User B comes online: Deliver queued messages

Group Message Flow

1. User sends message to group
   β”‚
2. Message Service gets group members
   β”‚
3. Fan-out: Deliver to each member
   β”‚
4. For each member:
   - If online: Deliver immediately
   - If offline: Queue for later
   β”‚
5. Track delivery status for each member
   β”‚
6. Update message status (sent, delivered, read)

Presence Update Flow

1. User comes online
   β”‚
2. Presence Service updates status
   β”‚
3. Notify user's contacts
   β”‚
4. Update last seen timestamp
   β”‚
5. User goes offline
   β”‚
6. Presence Service updates status
   β”‚
7. Notify contacts (optional, privacy settings)

Scaling Strategies

1. Horizontal Scaling

Erlang/OTP:

  • Spawn processes for each connection
  • Distribute across multiple servers
  • No shared state (stateless)
  • Easy to scale horizontally

Message Queues:

  • Multiple queue servers
  • Distribute load
  • Handle message bursts

2. Database Sharding

User Data:

  • Shard by user ID
  • Distribute across databases
  • Handle cross-shard queries

Messages:

  • Shard by conversation ID
  • Time-based sharding for history
  • Archive old messages

3. Caching

Presence:

  • Cache presence in Redis
  • Fast lookups
  • TTL-based expiration

User Data:

  • Cache user profiles
  • Cache contacts
  • Cache group metadata

4. Geographic Distribution

Data Centers:

  • Multiple regions globally
  • Route users to nearest region
  • Replicate critical data

Message Routing:

  • Route messages within region
  • Cross-region routing when needed
  • Optimize for latency

Key Design Decisions

1. Erlang/OTP

Decision: Use Erlang/OTP for backend services

Rationale:

  • High concurrency (millions of connections)
  • Fault tolerance (self-healing)
  • Message passing (natural fit)
  • Hot code swapping (zero downtime)

Trade-offs:

  • βœ… Excellent concurrency
  • βœ… Fault tolerant
  • βœ… Natural fit for messaging
  • ❌ Smaller talent pool
  • ❌ Less common technology

2. End-to-End Encryption

Decision: Implement end-to-end encryption (Signal Protocol)

Rationale:

  • User privacy
  • Security requirement
  • Regulatory compliance
  • Competitive advantage

Trade-offs:

  • βœ… User privacy
  • βœ… Security
  • βœ… Compliance
  • ❌ Cannot read messages (support challenges)
  • ❌ More complex implementation

3. Message Queues

Decision: Use message queues for reliable delivery

Rationale:

  • Reliable delivery
  • Handle offline users
  • Decouple services
  • Handle message bursts

Trade-offs:

  • βœ… Reliable delivery
  • βœ… Handle offline users
  • βœ… Decouple services
  • ❌ More complexity
  • ❌ Message ordering challenges

4. Hybrid Storage

Decision: Use MySQL for metadata, HBase for messages

Rationale:

  • MySQL: Fast queries, ACID transactions
  • HBase: Time-series data, scalable writes
  • Right tool for right job

Trade-offs:

  • βœ… Optimized for each use case
  • βœ… Better performance
  • ❌ More complexity
  • ❌ Multiple systems to manage

Challenges and Solutions

Challenge 1: Real-time Delivery at Scale

Problem: Deliver messages to 2B+ users in real-time

Solution:

  • Erlang/OTP for high concurrency
  • WebSocket connections
  • Message queues for reliability
  • Horizontal scaling

Challenge 2: End-to-End Encryption

Problem: Implement encryption without breaking functionality

Solution:

  • Signal Protocol (proven)
  • Client-side encryption
  • Key management
  • Forward secrecy

Challenge 3: Group Messaging

Problem: Deliver messages to groups (up to 256 members)

Solution:

  • Fan-out architecture
  • Message queues per member
  • Optimize for common case (small groups)
  • Handle large groups efficiently

Challenge 4: Presence Management

Problem: Track presence for 2B+ users

Solution:

  • In-memory storage (Redis)
  • Distributed presence
  • TTL-based expiration
  • Privacy controls

Challenge 5: Media Delivery

Problem: Deliver media files efficiently

Solution:

  • Object storage (S3-like)
  • CDN for global delivery
  • Compression before upload
  • Progressive loading

Monitoring and Observability

Key Metrics

Performance Metrics:

  • Message delivery latency
  • Message delivery rate
  • Presence update latency
  • Media upload/download time

Reliability Metrics:

  • Message delivery success rate
  • System uptime
  • Error rate
  • Queue depth

Business Metrics:

  • Daily active users
  • Messages per day
  • Group count
  • Media shared

Alerting

  • Alert on high message delivery latency
  • Alert on message delivery failures
  • Alert on system errors
  • Alert on queue buildup

Best Practices

1. Message Delivery

  • Use message queues for reliability
  • Handle offline users
  • Retry failed deliveries
  • Track delivery status

2. Encryption

  • Use proven protocols (Signal Protocol)
  • Implement forward secrecy
  • Secure key management
  • Regular security audits

3. Presence Management

  • Cache presence for performance
  • Respect privacy settings
  • Handle presence storms
  • TTL-based expiration

4. Group Messaging

  • Optimize for small groups
  • Handle large groups efficiently
  • Maintain message order
  • Track delivery to all members

Quick Reference Summary

WhatsApp: Global messaging platform with 2B+ users, 100B+ messages/day.

Key Components:

  • Message Service (Erlang/OTP)
  • Presence Service
  • End-to-end encryption (Signal Protocol)
  • Media Service
  • Group Messaging

Key Design Decisions:

  • Erlang/OTP for concurrency and fault tolerance
  • End-to-end encryption for security
  • Message queues for reliable delivery
  • Hybrid storage (MySQL + HBase)

Scaling Strategies:

  • Horizontal scaling with Erlang/OTP
  • Database sharding
  • Caching (Redis)
  • Geographic distribution

Remember: WhatsApp's success comes from combining high concurrency (Erlang), reliable message delivery (queues), and strong security (end-to-end encryption) at massive scale.


Previous Topic: Twitter ←

Next Topic: Uber β†’

Back to: Step 12 Overview | Main Index