DrawLintDrawLint.ai
🗺️Design Patterns·6 min read

Cassandra (Wide-Column)

A write-optimized NoSQL store for high-throughput, time-ordered, append-heavy data.

Cassandra is a distributed wide-column database built for enormous write throughput, predictable horizontal scaling, and availability across failures. It is excellent when your access patterns are known ahead of time, your data is naturally partitioned, and your app can tolerate tunable eventual consistency instead of relational joins.

🔭Think of it like…
Cassandra is like a warehouse with many identical packing stations. A label on each package says which station owns it, and packages at that station are stacked in a deliberate order. Add more stations and the warehouse can accept more packages per second, but workers will not run across the building to assemble a package from five unrelated stations.

The problem: one write leader cannot absorb every firehose

Relational databases are general-purpose and deeply powerful, but a single primary can become the write bottleneck for append-heavy systems: telemetry pings, chat messages, ad impressions, sensor readings, clickstream events, and time-series metrics. These workloads often need to accept millions of small writes, keep working during node failures, and read data by a narrow key plus time range.

append-heavy workload shape
writes:  device_id=abc, ts=10:00:01, temp=21.4
writes:  device_id=abc, ts=10:00:02, temp=21.5
writes:  device_id=xyz, ts=10:00:02, temp=19.9

reads:   latest readings for device abc in the last hour
         messages in conversation 123 after cursor T
         events for customer 42 on 2026-06-10

Cassandra chooses a different set of trade-offs. It gives up joins, foreign keys, and arbitrary ad hoc queries so it can route each write directly to the nodes responsible for that partition. If your query can be answered by a known partition key and ordered clustering columns, Cassandra can be extremely fast and resilient.

NeedCassandra fitReason
High write throughputStrongWrites are distributed across partitions and nodes
Time-series lookupsStrongPartition by entity/time bucket and cluster by timestamp
Complex joinsPoorNo join engine; model one table per query
Strict multi-row transactionsPoorConsistency is tunable, not classic relational ACID across many rows

Data model: partition key plus clustering columns

A Cassandra table is organized around its primary key, which has two jobs. The partition key decides which nodes store a group of rows. The clustering columns decide how rows are sorted inside that partition. This is why data modeling starts with the read query, not with an entity-relationship diagram.

time-bucketed readings table
CREATE TABLE readings_by_device_day (
  device_id text,
  day date,
  reading_ts timestamp,
  temperature double,
  battery_percent int,
  PRIMARY KEY ((device_id, day), reading_ts)
) WITH CLUSTERING ORDER BY (reading_ts DESC);

-- partition key: (device_id, day)
-- clustering column: reading_ts
-- efficient query: latest readings for one device on one day
  • Partition key: spreads data across the cluster. A good key has high cardinality and avoids routing most traffic to one hot node.
  • Clustering columns: define on-disk order within a partition. They make range scans such as newest messages or last hour of metrics efficient.
  • Time buckets: bound partition size. A partition per device forever can become huge; a partition per device per day or month is easier to compact, repair, and read.
Query-first means denormalized by design
If you need three query shapes, you often create three tables containing overlapping data. Cassandra optimizes predictable reads and writes, not normalized storage. The application or pipeline keeps those denormalized views in sync.

The LSM-tree write path: commit log → memtable → SSTable

Cassandra is fast at writes because it avoids random in-place updates. A write is appended to a durable commit log, placed in an in-memory memtable, and later flushed as immutable sorted files called SSTables. This Log-Structured Merge-tree design turns many tiny writes into large sequential disk writes.

Cassandra write path
client write
  └─▶ coordinator node receives mutation
        ├─ append to commit log          // durability before ack
        ├─ update memtable               // in-memory sorted structure
        ├─ send to replica nodes         // based on replication factor
        └─ ack when consistency level is satisfied

memtable fills
  └─▶ flush immutable SSTable to disk

background
  └─▶ compact SSTables, merging rows and discarding old tombstones

Why reads need more work

Because updates create new versions instead of editing old bytes, reading may consult the memtable, row cache, Bloom filters, partition indexes, and several SSTables before returning the latest value. Compaction is the background process that merges SSTables, removes overwritten data, and keeps read amplification under control.

ComponentWrite roleRead role
Commit logDurable append before acknowledgmentReplayed after crash recovery
MemtableReceives recent writes in memoryChecked first for freshest rows
SSTableImmutable flushed data on diskScanned with indexes and Bloom filters
CompactionMerges many immutable files laterReduces stale versions and tombstones

Tunable consistency: ONE, QUORUM, ALL

Cassandra is leaderless: for a given partition, multiple replicas can accept writes. The client chooses how many replicas must acknowledge a read or write through a consistency level. With replication factor 3, the common choices are ONE, QUORUM, and ALL.

LevelAcknowledgment ruleTrade-off
ONEOne replica repliesLowest latency and highest availability, but stale reads are more likely
QUORUMA majority repliesBalanced default; read quorum plus write quorum overlap
ALLEvery replica repliesStrongest freshness, but one slow replica can fail the operation
quorum overlap with replication factor 3
replication factor = 3
write CL = QUORUM  → at least 2 replicas store the write
read  CL = QUORUM  → at least 2 replicas answer the read

Any read quorum of 2 overlaps any write quorum of 2,
so the read should encounter the latest write or trigger repair.

This is not the same as full relational serializability. Clock skew, conflicting writes, tombstones, and repair behavior still matter. But tunable consistency lets you decide per operation whether latency, availability, or freshness is most important. This idea pairs with the quorum ideas in WAL + quorum.

No joins: model one table per query

Cassandra does not reward asking new questions at runtime. You design the table so the query can be answered by a partition lookup and a narrow ordered scan. If a product feature needs a different sort order or lookup key, you add another denormalized table rather than adding a join.

same messages stored for two query shapes
-- Query 1: load a conversation in time order
messages_by_conversation (
  conversation_id,
  bucket_day,
  message_ts,
  message_id,
  sender_id,
  body,
  PRIMARY KEY ((conversation_id, bucket_day), message_ts, message_id)
)

-- Query 2: load recent messages sent by a user
messages_by_sender (
  sender_id,
  bucket_day,
  message_ts,
  message_id,
  conversation_id,
  body,
  PRIMARY KEY ((sender_id, bucket_day), message_ts, message_id)
)
  • Write amplification is expected: one logical message may be inserted into multiple tables.
  • Deletes create tombstones, which remain until compaction and can hurt reads if overused.
  • Cross-partition scans are a smell. If a query needs to scan the whole cluster, use a search engine, analytics store, or different database.
Related scaling primitive
Cassandra bakes in partitioning as a first-class design constraint. For broader partitioning trade-offs, see sharding and partitioning.

When to use it, and the gotchas to respect

Cassandra is a strong fit for high-volume time-series metrics, IoT readings, chat/event timelines, fraud signals, activity feeds, and append-only audit-style data where the access patterns are known and locality is clear. It is usually a poor fit for highly relational product catalogs, financial ledgers needing strict transactions across accounts, and admin tools full of ad hoc filters.

  • Hot partitions: celebrity users, global counters, or a partition key such as country = 'US' can overload a small set of nodes.
  • Unbounded partitions: a forever-growing user timeline can become expensive to read and compact. Add time buckets.
  • Tombstone storms: TTLs and deletes create tombstones. Massive tombstone scans are a classic cause of slow reads.
  • Operational discipline: repairs, compaction strategy, replication factor, and disk utilization are part of the product, not background trivia.
Key takeaways
  • Cassandra is a wide-column, leaderless database optimized for massive, partitionable write throughput.
  • The partition key decides data placement; clustering columns decide order within a partition, so model from queries first.
  • The LSM write path appends to a commit log, updates a memtable, flushes immutable SSTables, and relies on compaction.
  • Consistency is tunable per operation with levels such as ONE, QUORUM, and ALL; quorum is a balanced default, not magic serializability.
  • Use Cassandra for known, append-heavy access patterns such as time-series and feeds; avoid it for joins, ad hoc queries, and strict multi-row transactions.
Cassandra can efficiently read by partition key and clustering order, but it does not join arbitrary tables at query time. Each important access pattern needs a table shaped for that lookup, even if that means storing denormalized copies.
The database avoids random in-place updates. It appends to a durable log, updates memory, and later flushes sorted immutable files. Background compaction pays the merge cost after the write has been accepted.
A quorum write reaches at least two replicas, and a quorum read checks at least two replicas. Those sets overlap, which gives a practical balance of freshness, latency, and availability for many workloads.
Finished this lesson?

Mark it complete to track your progress through the workbook.