Relational DB + Read Replicas
A Postgres primary for writes plus replicas for reads — the workhorse of most systems.
A relational primary with read replicas is the default shape for transactional systems that have one authoritative database but many read requests. The primary accepts every insert, update, and delete. Replicas continuously copy those changes and answer read-only traffic so the primary can stay focused on correctness-critical writes.
The problem: read load can drown the writer
Relational databases are wonderful at enforcing invariants: foreign keys, uniqueness, transactions, and constraints. The catch is that the same primary that protects those invariants is also doing buffer-cache work, query planning, index lookups, sorting, and connection management for reads. A dashboard query or hot profile page can consume the CPU and I/O budget needed by checkout, posting, or account updates.
clients
├─ POST /orders ─┐
├─ PATCH /users ─┼──▶ primary database
├─ GET /feed ─┤ writes + reads + constraints + indexes
└─ GET /profile ─┘ one CPU / IOPS budgetRead replicas split that work. They do not make the database magically distributed for writes; every write still lands on the primary first. But they give you more machines that can answer repeatable, read-only questions such as product pages, timelines, reporting dashboards, and search filters backed by SQL.
| Traffic pattern | Replica benefit | Why |
|---|---|---|
| Read-heavy app | High | Most GETs can leave the primary, freeing it for writes |
| Write-heavy app | Low | Every write still commits on the same primary and then replicates |
| Analytical dashboards | Medium to high | Long reads can run on a separate replica or reporting follower |
| Strong read-after-write workflow | Conditional | You may need to read from primary for a short window |
How it works: primary writes, replicas replay changes
In Postgres, MySQL, and similar systems, the primary records each committed change in a durable replication log. Replicas stream that log and replay it in the same order. The log is the source of truth for copying state, which is why replicas eventually converge on the same data as the primary.
1. client sends UPDATE users SET plan = 'pro' WHERE id = 42
2. primary validates constraints and commits the transaction
3. primary appends the change to its replication log / WAL / binlog
4. replica streams the log entry over the network
5. replica replays the change into its local storage
6. later GET /users/42 can be served by that replicaAsynchronous versus synchronous replication
| Mode | Write latency | Data-loss risk | Typical use |
|---|---|---|---|
| Async replica | Low | A tiny window if primary dies before replica catches up | Most read scaling |
| Sync replica | Higher | Lower because commit waits for another node | Critical HA pairs |
| Semi-sync | Middle | Reduced but not eliminated | Managed database defaults in some setups |
Many production systems combine these: one synchronous standby for high availability and several asynchronous read replicas for scale. The important mental model is that a replica is not an independent writer. It is a follower replaying the primary's history.
Connection routing: separate read and write pools
Applications usually expose two database paths: a writer pool that only connects to the primary, and a reader pool that load-balances across replicas. This can be explicit in code, hidden behind an ORM, or handled by a proxy such as PgBouncer, ProxySQL, RDS endpoints, or a service mesh.
function dbFor(request) {
if (request.method !== 'GET') return writerPool; // mutations go primary
if (request.needsFreshData) return writerPool; // read-your-writes path
return readerPool.pickLeastLoadedReplica(); // ordinary reads
}- Writer pool: used for transactions, locks, uniqueness checks, and any read that must observe the latest committed write.
- Reader pool: used for idempotent pages, feed browsing, analytics, exports, and other reads that tolerate slight staleness.
- Health-aware routing: a replica with high lag or failed replication should be removed from the read pool until it catches up.
Replica lag and the read-your-writes problem
Replication is often asynchronous, so a user can write successfully and then immediately read from a replica that has not replayed that write yet. This is the classic read-your-writes failure mode: the system accepted the change, but the user interface appears to deny it happened.
T+000ms POST /posts → primary commits post_id=900
T+020ms response returns 201
T+050ms GET /me/posts → routed to replica A
T+051ms replica A last replayed T-2s, so post_id=900 is missing
T+300ms replica A catches up; refresh now shows the postMitigations
| Technique | How it works | Trade-off |
|---|---|---|
| Read from primary after write | Mark the next few reads as fresh and send them to the writer | Adds primary read load |
| Sticky routing | Keep the user/session on primary or one caught-up replica for a short TTL | Requires session metadata |
| Version checks | Send last-seen commit timestamp or LSN and choose a replica that has reached it | More plumbing, best correctness |
| UI optimism | Render the submitted item locally while replicas catch up | Must reconcile failures carefully |
The simplest production rule is to set a short freshness window after a mutation. For example, after a profile update, store fresh_until = now() + 5sin the request context or session and route that user's reads to the primary until the window expires. Larger systems route by replication position: the primary returns a commit LSN, and the router only selects replicas that have replayed at least that LSN.
Failover: promotion changes who the primary is
Replicas are also your escape hatch when the primary dies. A failover controller detects that the primary is unhealthy, chooses the best caught-up replica, promotes it, and moves the writer endpoint to the new primary. Clients reconnect and continue writing to the promoted node.
before: app writer endpoint ─▶ primary P
replicas R1, R2 stream from P
failure: P stops accepting writes
failover:
1. fence P so it cannot come back as a second writer
2. choose R1 because it has the least replication lag
3. promote R1 to primary
4. repoint writer endpoint to R1
5. rebuild P as a replica after repair- Fencing matters: the old primary must be prevented from accepting writes after a network partition, otherwise you risk two primaries diverging.
- Lag matters: promoting a lagging replica can lose the last acknowledged async writes. Synchronous standbys reduce that window at the cost of slower writes.
- Connection draining matters: application pools cache old connections. Set sensible timeouts and retry transactions that fail during promotion.
Trade-offs, gotchas, and real-world examples
- Replicas help reads, not writes: a social app with 95% timeline reads can benefit enormously. A payment ledger with mostly inserts and balance updates probably needs schema/index tuning, batching, or sharding instead.
- Long replica queries can create lag: heavy exports and dashboards may delay replay or compete for disk. Use a dedicated reporting replica when business intelligence jobs are noisy.
- Schema changes replicate too: migrations must be designed for rolling deploys. A destructive migration can break readers on replicas just as surely as it breaks the primary.
- Replica reads are still database reads: replicas reduce load on the primary, but they do not replace caching, pagination, good indexes, or query limits.
Common examples include ecommerce product pages served from replicas while checkout writes to the primary, SaaS dashboards running against a reporting follower, and social profiles where recent edits read from the primary for a few seconds before returning to the replica pool.
- A relational primary handles all writes; read replicas stream and replay its log so they can serve read-only traffic.
- Replicas are best for read-heavy workloads; they do not increase the primary write capacity.
- Asynchronous replication creates lag, which causes stale reads and read-your-writes failures unless you route fresh reads carefully.
- Use separate read/write pools, health checks, lag thresholds, and read-only replica credentials to keep routing safe.
- Failover promotes a caught-up replica to primary, but fencing, connection retries, and potential async data loss must be planned.
Mark it complete to track your progress through the workbook.