WebSockets + Presence
Persistent connections for real-time chat, notifications, and live collaboration.
WebSockets are the tool you reach for when the browser and server both need to talk at any moment: chat messages, typing indicators, multiplayer moves, collaborative cursors, and live presence. They start as ordinary HTTP, then upgrade into a persistent full-duplex connection where either side can send frames without waiting for a request.
The problem: real-time state lives on one server
Plain HTTP is stateless: any request can land on any application server. WebSockets are different. Once a client connects, the TCP connection is physically held by one process on one machine. That server now owns a small piece of sticky, in-memory state: socket id, subscribed rooms, last heartbeat, and user identity.
client A ───── WebSocket ─────▶ ws-server-1
client B ───── WebSocket ─────▶ ws-server-7
User A sends "hello B"
app code on ws-server-1 must somehow reach B's socket on ws-server-7
If ws-server-7 crashes:
B's socket disappears
presence must expire
B must reconnect and resume missed messagesThe failure mode is not that one WebSocket is hard. The failure mode is that millions of WebSockets create millions of tiny stateful anchors spread across your fleet. If you only keep presence in local memory, other servers cannot find users. If you only broadcast locally, users on other boxes miss messages. If a server dies, stale "online" state can linger forever unless it has an expiry.
HTTP upgrade: how the connection begins
A WebSocket does not begin as magic new transport. The browser sends an HTTP request with Upgrade: websocket. If the server accepts, it responds with 101 Switching Protocols. After that point the same TCP connection stops carrying HTTP request/response messages and starts carrying WebSocket frames.
GET /socket?token=eyJ... HTTP/1.1
Host: api.example.com
Connection: Upgrade
Upgrade: websocket
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13
HTTP/1.1 101 Switching Protocols
Connection: Upgrade
Upgrade: websocket
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
-- from here on, both sides exchange WebSocket frames over the same TCP connection --- Authentication: validate the cookie or bearer token during the upgrade. After the upgrade there is no ordinary per-request auth middleware, so bind the authenticated
userIdto the connection context immediately. - Full-duplex frames: the server can send a notification while the client sends a typing event. Neither side has to wait for the other side to initiate a request.
- Load balancers: the load balancer must support HTTP upgrade and long-lived idle connections. Timeouts that are fine for REST endpoints can kill WebSockets unexpectedly.
Scaling: connection servers + Redis pub/sub
A common production shape is a stateless API tier plus a separate WebSocket tier. Each WebSocket server keeps only its own local sockets in memory. Cross-server routing goes through a backplane such as Redis Pub/Sub, Redis Streams, NATS, or Kafka, depending on durability needs.
onConnect(socket, userId):
localSockets.add(socket.id, socket)
redis.setex("presence:user:" + userId, 30, serverId + ":" + socket.id)
redis.sadd("room:doc:42", userId)
redis.subscribe("deliver:" + serverId)
sendToUser(userId, event):
location = redis.get("presence:user:" + userId)
if location == null:
storeOfflineNotification(userId, event)
return
targetServerId, socketId = parse(location)
redis.publish("deliver:" + targetServerId, { socketId, event })
onRedisMessage({ socketId, event }):
socket = localSockets.get(socketId)
if socket != null:
socket.send(JSON.stringify(event))What the backplane does
- Location lookup:Redis tells the system which connection server currently owns a user's socket. The API server does not need to know local process memory.
- Fanout: when a document changes, publish one event to a room channel and let every connection server deliver it to its local subscribers for that room.
- Decoupling: application services emit events; connection servers translate those events into WebSocket frames.
| Approach | How it routes | Failure mode |
|---|---|---|
| Local memory only | Each server knows only its own sockets | Messages to users on other servers are lost |
| Sticky sessions only | Load balancer tries to keep user on same box | Crash or rescale still loses location; no cross-server fanout |
| Redis presence + pub/sub | Shared presence keys and per-server delivery channels | Redis becomes critical infrastructure; design reconnect and expiry |
| Durable log backplane | Kafka or streams retain events for replay | More latency and complexity, but better resume semantics |
Presence: heartbeats, TTL keys, and fanout
Presence is not a boolean column called online. It is a lease. The client proves it is still connected by sending heartbeats; the server refreshes a Redis key with a short TTL. If the process crashes, the network disappears, or the user closes a laptop, the key expires naturally and the user becomes offline without a perfect cleanup path.
every 10 seconds from client:
socket.send({ type: "ping", lastSeenEventId: 18421 })
onHeartbeat(userId, socketId):
redis.setex("presence:user:" + userId, 30, serverId + ":" + socketId)
redis.zadd("presence:last_seen", nowMillis(), userId)
onDisconnect(userId, socketId):
// best effort only; TTL is the real safety net
redis.delIfValueMatches("presence:user:" + userId, serverId + ":" + socketId)For a chat room or collaborative document, each connection also subscribes to one or more topics: room:team-7, doc:42, or game:abc. When an event arrives, connection servers fan it out only to local sockets that subscribed to that topic. This avoids one central server writing to every user socket directly.
Reconnect and resume missed work
Reconnection is the normal path, not an edge case. Mobile devices switch networks, browsers sleep, load balancers drain nodes, and deploys restart processes. A resilient client reconnects with exponential backoff and sends the last event id it processed. The server then resumes from a durable store when the product requires no gaps.
client state:
lastSeenEventId = 18421
on reconnect:
CONNECT /socket?lastSeenEventId=18421
server:
missed = messageStore.readAfter(userId, 18421)
for event in missed:
socket.send(event)
socket.send({ type: "resume_complete" })- Ephemeral events: typing indicators and cursor positions can be dropped. Sending old typing events after reconnect is actively confusing.
- Durable events: chat messages, notifications, and document edits need ids and replay from a database or log.
- Backpressure: slow clients need bounded queues. If a socket cannot keep up, disconnect it and force resume rather than letting memory grow without limit.
Gotchas and real-world examples
Real systems usually combine this pattern with other primitives. A chat app stores messages in a database, publishes delivery events to Redis, and uses WebSockets for immediate delivery. A collaborative editor stores operations durably, uses WebSockets for low-latency cursors, and treats presence as TTL-backed hints. A delivery app may use WebSockets for the driver console, but use SSE for customer-facing one-way location updates.
- Ordering: Redis Pub/Sub does not give a global ordering across every channel. Put sequence numbers on events when clients must detect gaps or reorder.
- Multi-device users: one user may have a phone, laptop, and tablet connected. Store presence per connection, not just per user, when delivery to all devices matters.
- Deploys: drain servers gracefully: stop accepting new connections, tell clients to reconnect, then wait for active sockets to leave before killing the process.
- Redis limits: Pub/Sub is fast but not durable. If missing an event is unacceptable, publish to a durable log and use Redis only for live fanout.
- WebSockets begin with an HTTP upgrade, then become a persistent full-duplex connection over the same TCP socket.
- The scaling problem is sticky state: each socket lives on exactly one connection server, so other servers need a shared backplane to find it.
- Presence should be modeled as a short-lived lease refreshed by heartbeats, not as a permanent boolean column.
- Fanout works by publishing events to the servers or rooms that have subscribers, then each server writes only to its local sockets.
- Reconnect and resume are mandatory: clients send last-seen ids, durable events replay, and ephemeral events can be safely dropped.
Mark it complete to track your progress through the workbook.