DrawLintDrawLint.ai
🗺️Design Patterns·6 min read

Elasticsearch for Search

A search index fed from your database via CDC, for full-text, geo, and faceted queries.

Elasticsearch is a distributed search engine you run alongside your source-of-truth database. It powers full-text search, relevance ranking, faceted filters, geo queries, and aggregations by maintaining a purpose-built search index. It is not the system of record; it is a fast, query-optimized projection of authoritative data.

🔭Think of it like…
Your relational database is the courthouse archive: authoritative, carefully updated, and legally binding. Elasticsearch is the index desk at the front: it can quickly tell you which files mention a phrase, fall in a date range, or relate to a location, but the official answer still comes from the archive itself.

The problem: transactional indexes are not search engines

A database B-tree index is excellent for equality, ordering, and range predicates such as user_id = 42 or created_at > now() - interval '7 days'. Product search asks a different kind of question: find documents containing related words, rank them by relevance, tolerate typos, filter by facets, group counts by brand, and maybe restrict results to a map viewport.

queries that push past ordinary SQL indexes
search: "waterproof hiking boot"
filters: brand in ["Merrell", "Salomon"], size = 10, price < 150
rank: exact phrase > all terms > fuzzy matches > popularity boost
facets: count matching results by brand, size, color
geo: only stores within 25 km of the user

You can force some of this into SQL, but relevance scoring, analyzers, token positions, fuzzy matching, and distributed aggregations are what Elasticsearch was built to do. The trade-off is that the search index is eventually consistent with your database and must be rebuilt or replayed if it drifts.

SystemBest atNot best at
Relational databaseTransactions, constraints, authoritative writesFull-text ranking at large scale
ElasticsearchSearch, facets, geo, aggregationsBeing the final source of truth
CacheServing known hot answersDiscovering ranked matches across text

Index pipeline: feed search from the source of truth

The safest architecture writes business facts to the database first and then feeds Elasticsearch from those changes. Many teams use Change Data Capture from the database log; others use an outbox table that is written in the same transaction as the business change. Either way, search is downstream from truth.

CDC-fed search index
application
  └─▶ Postgres transaction commits product/order/venue row
        └─▶ WAL / outbox event records the change
              └─▶ CDC connector or outbox worker publishes event
                    └─▶ indexer transforms row into search document
                          └─▶ Elasticsearch bulk index / update / delete
  • Database first: writes, constraints, and inventory checks happen in the authoritative store.
  • Indexer second: a worker projects the row into a document optimized for search, often denormalizing related fields.
  • Validation last: correctness-critical actions such as booking a ticket, buying inventory, or editing permissions re-check the database even if Elasticsearch found the candidate.
Related pipeline pattern
The reliable way to move committed database changes into search is the Outbox / CDC pattern. For the lower-level indexing concepts, see Search Engines.

Inverted index recap: terms point to documents

A normal row store answers by starting from rows. An inverted index starts from terms. During indexing, Elasticsearch analyzes text into tokens and records which documents contain each token, plus positions, frequencies, and optional norms for scoring. At query time it jumps straight to candidate documents instead of scanning every product name or article body.

tiny inverted index
documents:
  1: "waterproof hiking boot"
  2: "lightweight trail shoe"
  3: "waterproof rain jacket"

inverted index:
  waterproof → [1, 3]
  hiking     → [1]
  boot       → [1]
  trail      → [2]
  jacket     → [3]

query "waterproof boot" → intersect/score postings for waterproof + boot

Real indexes add token positions for phrase queries, term frequencies for scoring, skip lists for fast traversal, doc values for sorting and aggregations, and segment-level metadata for pruning. The beginner mental model remains simple: search is fast because terms already point to candidate documents.

Mappings and analyzers shape search behavior

A mapping tells Elasticsearch how each field should be indexed: full-text text, exact keyword, numeric, date, geo point, nested object, and so on. An analyzercontrols how text becomes tokens: lowercasing, stemming, synonyms, stop-word removal, edge n-grams for autocomplete, or language-specific rules.

mapping with text, keyword, geo, and facets
PUT products
{
  "mappings": {
    "properties": {
      "title": { "type": "text", "analyzer": "english" },
      "title_suggest": { "type": "text", "analyzer": "edge_ngram" },
      "brand": { "type": "keyword" },
      "price_cents": { "type": "integer" },
      "available": { "type": "boolean" },
      "store_location": { "type": "geo_point" },
      "updated_at": { "type": "date" }
    }
  }
}
Field choiceUse it forCommon mistake
textFull-text search with analysis and relevanceUsing it for exact filters or aggregations
keywordExact match, sorting, facets, IDsExpecting stemming or typo tolerance
numeric/dateRanges, sorting, histogramsIndexing numbers as strings
geo_pointDistance and bounding-box queriesForgetting coordinate normalization
Mappings are hard to change in place
Changing a field from text to keyword, or replacing an analyzer, usually requires creating a new index and reindexing. Production systems use versioned index names and an alias cutover to avoid downtime.

Query types: match, term, bool, geo, aggregations

Elasticsearch exposes different query families because not every field should be interpreted the same way. A full-text search for a sentence is different from an exact filter on a brand, and both are different from a distance query or a facet count.

Query typeWhat it meansExample
matchAnalyze query text and score full-text matchesSearch title for waterproof boots
termExact token match, usually keyword fieldsbrand is exactly Patagonia
boolCombine must, should, filter, must_not clausesText query plus price and availability filters
geoDistance, polygon, or bounding-box searchStores within 25 km
aggregationsGroup, count, histogram, percentile over matchesFacet counts by brand and size
bool query with full-text, filters, geo, and facets
GET products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "waterproof hiking boot" } }
      ],
      "filter": [
        { "term": { "available": true } },
        { "range": { "price_cents": { "lte": 15000 } } },
        { "geo_distance": { "distance": "25km", "store_location": "47.6,-122.3" } }
      ]
    }
  },
  "aggs": {
    "brands": { "terms": { "field": "brand" } },
    "sizes": { "terms": { "field": "size" } }
  }
}

Filters versus scoring

Put yes-or-no constraints in filter context so Elasticsearch can cache bitsets and avoid spending relevance work on them. Put text relevance in query context so matches can be scored and ranked.

Near-real-time refresh, shards, and replicas

Elasticsearch is near real time. Indexed documents become searchable after a refresh, commonly around one second by default. That refresh publishes new immutable segments for search. It is fast enough for product search and logs, but it is not the same as reading your own committed database transaction immediately.

near-real-time visibility
T+000ms database transaction commits product price change
T+030ms CDC event reaches indexer
T+080ms Elasticsearch indexes the document into an in-memory buffer
T+1000ms refresh opens a new searchable segment
T+1001ms search results can now include the new price
ConceptWhat it doesDesign impact
Primary shardOwns a slice of the indexMore shards can spread indexing/search, but too many add overhead
Replica shardCopy of a primary shardImproves availability and read throughput
RefreshMakes buffered changes searchableLower intervals improve freshness but cost CPU/I/O
SegmentImmutable Lucene index fileMerges happen in the background
  • Shard count is a capacity decision. Oversharding creates cluster-state and heap overhead; undersharding can make shards too large to move or query efficiently.
  • Replicas help search throughput and availability, but they do not make stale CDC events fresher.
  • Bulk indexing is usually much faster and cheaper than one document per network request.

Why the relational DB stays authoritative

Search results are candidates, not final decisions. Elasticsearch may be behind because CDC lagged, a refresh has not happened, a shard was relocating, or a previous indexing job failed and needs replay. The source-of-truth database is where you enforce permissions, inventory, uniqueness, payment state, and business invariants.

  • Stale availability: search can show a hotel room or ticket that was just booked. Reservation must happen in the database.
  • Out-of-order events: CDC consumers need versions or timestamps so an older update does not overwrite a newer document.
  • Deletes: hard deletes, soft deletes, and privacy removals must be propagated and monitored carefully.
  • Reindexing: analyzer or mapping changes require building a new index from the authoritative store, then switching an alias.
Validate before committing user-visible truth
Use Elasticsearch to discover candidates quickly. Before charging, booking, granting access, or showing private data, re-read the database in the transaction or authorization path that owns the invariant.
Key takeaways
  • Elasticsearch is a search index fed from the source-of-truth database; it is not the system of record.
  • Inverted indexes make search fast by mapping analyzed terms to matching documents, positions, and scoring metadata.
  • Mappings and analyzers determine whether fields support full-text relevance, exact filters, facets, geo search, ranges, and sorting.
  • Search queries combine match, term, bool, geo, and aggregation clauses; filters constrain results while query clauses score them.
  • Near-real-time refresh, CDC lag, shards, and replicas create freshness and operational trade-offs, so correctness-critical actions must validate against the database.
The index is fed asynchronously and refreshed near real time, so it can be stale. Use it to find candidate tickets, then reserve and charge against the authoritative database transaction that enforces inventory.
A match query analyzes input text and computes relevance for full-text fields. A term query looks for an exact token, which is best for keyword fields such as IDs, brands, statuses, and facets.
The analyzer and mapping decide how tokens are stored in the inverted index. If those rules change, existing segments were built with the old rules, so a new index must be built and an alias cut over safely.
Finished this lesson?

Mark it complete to track your progress through the workbook.