WhatsApp System Design
Quick Reference: Step 7: Message Queues | Step 10: Encryption
Quick Reference
Scale: 2B+ users, 100B+ messages/day, 65B+ messages/day (peak), 1B+ groups
Key Components: Messaging service, real-time delivery, end-to-end encryption, presence system, media storage, group messaging
Challenges: Real-time delivery at scale, end-to-end encryption, presence management, group messaging, media delivery
Clear Definition
WhatsApp is a global messaging platform serving 2B+ users, handling 100B+ messages per day with end-to-end encryption. It requires real-time message delivery, presence management, group messaging, and media sharing at unprecedented scale.
π‘ Key Insight: WhatsApp uses Erlang/OTP for high concurrency and fault tolerance, message queues for reliable delivery, and Signal Protocol for end-to-end encryption. The system is designed for reliability and security.
System Requirements
Functional Requirements
-
Messaging
- Send/receive text messages
- Real-time delivery
- Message status (sent, delivered, read)
- Message history
-
Group Messaging
- Create groups (up to 256 members)
- Send messages to groups
- Group management (add/remove members)
- Group metadata
-
Media Sharing
- Send images, videos, documents
- Voice messages
- File sharing
- Media compression
-
Presence
- Online/offline status
- Last seen timestamp
- Typing indicators
- Read receipts
-
Security
- End-to-end encryption
- User authentication
- Message integrity
- Forward secrecy
Non-Functional Requirements
-
Scale
- 2B+ users globally
- 100B+ messages/day
- Handle message bursts
- Support group messaging
-
Performance
- Low latency message delivery (< 1 second)
- Real-time presence updates
- Fast media delivery
-
Reliability
- 99.9% message delivery
- Handle network failures
- Message persistence
- No message loss
-
Security
- End-to-end encryption
- User privacy
- Forward secrecy
- No message interception
High-Level Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WhatsApp Clients β
β (Mobile: iOS, Android, Web, Desktop) β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
β XMPP / Custom Protocol
β (TLS Encrypted)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Load Balancer β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Message β β Presence β β Media β
β Service β β Service β β Service β
β (Erlang) β β (Erlang) β β (Erlang) β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β β
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Message Queue β
β (RabbitMQ / Custom Queue) β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Database Layer β
β (MySQL for metadata, HBase for messages) β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Media Storage (Object Storage) β
β (S3-like for images, videos, files) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Core Components
1. Client Applications
Responsibilities:
- User interface
- Message encryption/decryption
- Presence management
- Media handling
Technologies:
- Mobile: Native apps (iOS, Android)
- Web: WebSocket-based web app
- Desktop: Electron-based desktop app
Key Features:
- Offline Support: Queue messages when offline
- Message Sync: Sync messages across devices
- Media Compression: Compress before sending
- Encryption: End-to-end encryption using Signal Protocol
2. Message Service (Erlang/OTP)
Why Erlang?
- Concurrency: Handle millions of concurrent connections
- Fault Tolerance: Self-healing, no single point of failure
- Hot Code Swapping: Update code without downtime
- Message Passing: Natural fit for messaging systems
Responsibilities:
- Accept incoming messages
- Route messages to recipients
- Handle message delivery
- Manage message queues
- Handle offline messages
Message Flow:
1. User sends message
β
2. Client encrypts message (Signal Protocol)
β
3. Message sent to Message Service
β
4. Message Service stores in queue
β
5. If recipient online: Deliver immediately
β
6. If recipient offline: Store for later delivery
β
7. Message delivered to recipient
β
8. Delivery receipt sent to sender
Message Storage:
- Online Users: In-memory queue (fast delivery)
- Offline Users: Persistent queue (database)
- Message History: Stored in database (HBase)
3. Presence Service
Responsibilities:
- Track user online/offline status
- Track last seen timestamp
- Handle typing indicators
- Manage read receipts
Presence States:
- Online: User is active
- Offline: User is not connected
- Last Seen: Timestamp of last activity
- Typing: User is typing a message
Implementation:
- In-Memory: Store presence in memory (fast)
- Distributed: Share presence across servers
- TTL: Presence expires after inactivity
- Privacy: Users can hide last seen
4. End-to-End Encryption
Signal Protocol:
- Double Ratchet: Forward secrecy
- Pre-keys: Asynchronous messaging
- Key Exchange: Secure key exchange
- Message Authentication: Prevent tampering
How it Works:
1. User A wants to send message to User B
β
2. User A requests User B's public key
β
3. User A encrypts message with User B's public key
β
4. Encrypted message sent to server
β
5. Server forwards encrypted message to User B
β
6. User B decrypts message with private key
β
7. Server cannot read message (end-to-end encrypted)
Key Features:
- Forward Secrecy: Old messages cannot be decrypted if key compromised
- Message Authentication: Verify message integrity
- No Metadata Leakage: Server doesn't know message content
- Group Encryption: Encrypt group messages
5. Group Messaging
Challenges:
- Scale: Up to 256 members per group
- Delivery: Deliver to all members
- Consistency: Ensure all members receive messages
- Ordering: Maintain message order
Group Architecture:
Group Message
β
βΌ
βββββββββββββββββββ
β Group Service β
β - Get members β
β - Route message β
ββββββββββ¬ββββββββββ
β
β Fan-out
βΌ
βββββββββββββββββββββββββββββββββββ
β Deliver to each member β
β - Online: Immediate β
β - Offline: Queue for later β
βββββββββββββββββββββββββββββββββββ
Group Features:
- Member Management: Add/remove members
- Admin Controls: Group admins
- Group Metadata: Name, description, icon
- Message Delivery: Deliver to all members
6. Media Service
Responsibilities:
- Accept media uploads
- Store media files
- Generate thumbnails
- Deliver media to recipients
Media Flow:
1. User sends image/video
β
2. Client compresses media
β
3. Media uploaded to Media Service
β
4. Media Service stores in object storage
β
5. Thumbnail generated (for images/videos)
β
6. Media URL sent to recipient
β
7. Recipient downloads media
Storage:
- Object Storage: S3-like storage (AWS S3, Google Cloud Storage)
- CDN: Use CDN for media delivery
- Compression: Compress before storage
- Thumbnails: Generate thumbnails for fast preview
Optimizations:
- Compression: Compress images/videos before upload
- Progressive Loading: Load low quality first, then high quality
- Caching: Cache frequently accessed media
- CDN: Use CDN for global delivery
7. Database Layer
Database Architecture:
- MySQL: User data, contacts, group metadata
- HBase: Message history (time-series data)
- Redis: Presence, online users, caching
Data Sharding:
- User Data: Shard by user ID
- Messages: Shard by conversation ID
- Groups: Shard by group ID
Replication:
- Read Replicas: For read scalability
- Master-Slave: For write scalability
- Geographic Distribution: Replicate across regions
Data Flow
Message Send Flow
1. User A sends message to User B
β
2. Client encrypts message (Signal Protocol)
β
3. Message sent to Message Service
β
4. Message Service validates and stores
β
5. Check if User B is online
β
6. If online: Deliver immediately via WebSocket
β
7. If offline: Store in persistent queue
β
8. Send delivery receipt to User A
β
9. When User B comes online: Deliver queued messages
Group Message Flow
1. User sends message to group
β
2. Message Service gets group members
β
3. Fan-out: Deliver to each member
β
4. For each member:
- If online: Deliver immediately
- If offline: Queue for later
β
5. Track delivery status for each member
β
6. Update message status (sent, delivered, read)
Presence Update Flow
1. User comes online
β
2. Presence Service updates status
β
3. Notify user's contacts
β
4. Update last seen timestamp
β
5. User goes offline
β
6. Presence Service updates status
β
7. Notify contacts (optional, privacy settings)
Scaling Strategies
1. Horizontal Scaling
Erlang/OTP:
- Spawn processes for each connection
- Distribute across multiple servers
- No shared state (stateless)
- Easy to scale horizontally
Message Queues:
- Multiple queue servers
- Distribute load
- Handle message bursts
2. Database Sharding
User Data:
- Shard by user ID
- Distribute across databases
- Handle cross-shard queries
Messages:
- Shard by conversation ID
- Time-based sharding for history
- Archive old messages
3. Caching
Presence:
- Cache presence in Redis
- Fast lookups
- TTL-based expiration
User Data:
- Cache user profiles
- Cache contacts
- Cache group metadata
4. Geographic Distribution
Data Centers:
- Multiple regions globally
- Route users to nearest region
- Replicate critical data
Message Routing:
- Route messages within region
- Cross-region routing when needed
- Optimize for latency
Key Design Decisions
1. Erlang/OTP
Decision: Use Erlang/OTP for backend services
Rationale:
- High concurrency (millions of connections)
- Fault tolerance (self-healing)
- Message passing (natural fit)
- Hot code swapping (zero downtime)
Trade-offs:
- β Excellent concurrency
- β Fault tolerant
- β Natural fit for messaging
- β Smaller talent pool
- β Less common technology
2. End-to-End Encryption
Decision: Implement end-to-end encryption (Signal Protocol)
Rationale:
- User privacy
- Security requirement
- Regulatory compliance
- Competitive advantage
Trade-offs:
- β User privacy
- β Security
- β Compliance
- β Cannot read messages (support challenges)
- β More complex implementation
3. Message Queues
Decision: Use message queues for reliable delivery
Rationale:
- Reliable delivery
- Handle offline users
- Decouple services
- Handle message bursts
Trade-offs:
- β Reliable delivery
- β Handle offline users
- β Decouple services
- β More complexity
- β Message ordering challenges
4. Hybrid Storage
Decision: Use MySQL for metadata, HBase for messages
Rationale:
- MySQL: Fast queries, ACID transactions
- HBase: Time-series data, scalable writes
- Right tool for right job
Trade-offs:
- β Optimized for each use case
- β Better performance
- β More complexity
- β Multiple systems to manage
Challenges and Solutions
Challenge 1: Real-time Delivery at Scale
Problem: Deliver messages to 2B+ users in real-time
Solution:
- Erlang/OTP for high concurrency
- WebSocket connections
- Message queues for reliability
- Horizontal scaling
Challenge 2: End-to-End Encryption
Problem: Implement encryption without breaking functionality
Solution:
- Signal Protocol (proven)
- Client-side encryption
- Key management
- Forward secrecy
Challenge 3: Group Messaging
Problem: Deliver messages to groups (up to 256 members)
Solution:
- Fan-out architecture
- Message queues per member
- Optimize for common case (small groups)
- Handle large groups efficiently
Challenge 4: Presence Management
Problem: Track presence for 2B+ users
Solution:
- In-memory storage (Redis)
- Distributed presence
- TTL-based expiration
- Privacy controls
Challenge 5: Media Delivery
Problem: Deliver media files efficiently
Solution:
- Object storage (S3-like)
- CDN for global delivery
- Compression before upload
- Progressive loading
Monitoring and Observability
Key Metrics
Performance Metrics:
- Message delivery latency
- Message delivery rate
- Presence update latency
- Media upload/download time
Reliability Metrics:
- Message delivery success rate
- System uptime
- Error rate
- Queue depth
Business Metrics:
- Daily active users
- Messages per day
- Group count
- Media shared
Alerting
- Alert on high message delivery latency
- Alert on message delivery failures
- Alert on system errors
- Alert on queue buildup
Best Practices
1. Message Delivery
- Use message queues for reliability
- Handle offline users
- Retry failed deliveries
- Track delivery status
2. Encryption
- Use proven protocols (Signal Protocol)
- Implement forward secrecy
- Secure key management
- Regular security audits
3. Presence Management
- Cache presence for performance
- Respect privacy settings
- Handle presence storms
- TTL-based expiration
4. Group Messaging
- Optimize for small groups
- Handle large groups efficiently
- Maintain message order
- Track delivery to all members
Quick Reference Summary
WhatsApp: Global messaging platform with 2B+ users, 100B+ messages/day.
Key Components:
- Message Service (Erlang/OTP)
- Presence Service
- End-to-end encryption (Signal Protocol)
- Media Service
- Group Messaging
Key Design Decisions:
- Erlang/OTP for concurrency and fault tolerance
- End-to-end encryption for security
- Message queues for reliable delivery
- Hybrid storage (MySQL + HBase)
Scaling Strategies:
- Horizontal scaling with Erlang/OTP
- Database sharding
- Caching (Redis)
- Geographic distribution
Remember: WhatsApp's success comes from combining high concurrency (Erlang), reliable message delivery (queues), and strong security (end-to-end encryption) at massive scale.
Previous Topic: Twitter β
Next Topic: Uber β
Back to: Step 12 Overview | Main Index