Search Engines (Elasticsearch)
Full-text and faceted search built on inverted indexes — and why it's a separate system.
A search engine such as Elasticsearchis a specialized system for fast, ranked, full-text discovery over documents. It answers questions like "which products match these words, filters, typos, and facets?" much better than a primary database that is optimized for transactions and exact lookups.
The problem: search is not just filtering rows
The failure mode is treating a search box as a simple SQL filter. A query like WHERE title LIKE '%phone%' scans lots of text, misses synonyms and typos, cannot explain relevance, and struggles to return facet counts such as brand, size, and price range. Users expect the opposite: instant results, spelling tolerance, ranked matches, and filters that update as they type.
- Text is messy: users type plurals, punctuation, casing, misspellings, abbreviations, and phrases.
- Ranking matters: a document that matches the title should usually rank above one that only matches a footnote.
- Facets matter: commerce and document search often need counts by category, tag, price range, author, or geography.
- Operational isolation matters: expensive exploratory queries should not slow down checkout, payments, or writes.
The inverted index: the key data structure
A normal index maps a document ID to fields. An inverted indexflips that around: it maps each term to the documents that contain it, often with positions, frequencies, and field information. A query becomes a fast lookup of posting lists, then a merge and rank operation.
Documents
D1: "fast red bike"
D2: "red running shoes"
D3: "bike repair manual"
Inverted index
bike -> D1(pos=3), D3(pos=1)
fast -> D1(pos=1)
manual -> D3(pos=3)
red -> D1(pos=2), D2(pos=1)
running -> D2(pos=2)
shoes -> D2(pos=3)
Query: "red bike"
lookup red + bike -> intersect/merge -> D1 ranks high, D3 may rank lowerReal indexes store much more than document IDs. Positions enable phrase queries like "red bike". Term frequencies help scoring. Field names let a title match count more than a body match. Compression keeps posting lists compact enough to search quickly in memory and on disk.
Analysis and relevance scoring
Before text enters the inverted index, Elasticsearch runs ananalyzer. Analysis turns raw text into normalized tokens. The same analysis, or a compatible query-time analysis, runs when the user searches.
| Step | What happens | Example |
|---|---|---|
| Character filters | Clean or rewrite raw characters | Remove HTML, normalize punctuation |
| Tokenizer | Split text into tokens | Quick brown fox -> quick, brown, fox |
| Token filters | Normalize tokens | Lowercase, stem running -> run, remove stop words |
| Synonyms | Add equivalent terms | tv -> television, laptop -> notebook |
Why one match ranks above another
Relevance scoring asks, "How well does this document satisfy the query?" Classic TF-IDF rewards terms that appear often in a document but are rare across the corpus. Elasticsearch commonly usesBM25, a modern descendant that also accounts for document length so very long documents do not win just because they contain more words.
score(document, query) roughly increases when:
- query terms appear in important fields (title > description > comments)
- a term appears multiple times, up to a saturation point
- a term is rare across the whole corpus ("kubernetes" beats "the")
- matched words are close together for phrase/proximity queries
score decreases or normalizes when:
- the document is very long and matches only weakly
- filters exclude it (tenant, visibility, stock status)Why search is separate from the primary database
Your primary database is built for correctness: transactions, constraints, joins, point lookups, and writes. A search engine is built for retrieval: tokenized text, relevance scoring, aggregations, and horizontal query fan-out. Combining those jobs in one system usually makes both worse.
| Concern | Primary database | Search engine |
|---|---|---|
| Source of truth | Authoritative records and transactions | Derived projection that can be rebuilt |
| Query shape | Exact lookup, joins, constraints | Full-text, fuzzy, ranked, faceted search |
| Freshness | Committed data is immediately authoritative | Near-real-time; may lag seconds |
| Scaling | Protect write path and transactional load | Shard indexes and replicate for query throughput |
Because the index is derived, correctness-critical actions must still re-check the database. If search says a hotel room, seat, or product is available, the booking or checkout transaction confirms it against the source of truth before committing.
Feeding Elasticsearch with CDC and projections
Applications usually feed search asynchronously. On every database change, a pipeline builds a search document: a denormalized JSON projection containing exactly the fields needed for search, filters, display snippets, and ranking. Change Data Capture (CDC) is a common way to do this without making every write synchronously update Elasticsearch.
Postgres transaction commits
-> WAL/binlog records the change
-> Debezium captures it
-> Kafka topic: product.updated
-> indexer service builds search document
-> Elasticsearch _index /products/_doc/product_123
Search document example:
{
"id": "product_123",
"title": "Red running shoes",
"brand": "Contoso",
"category": "shoes",
"price": 49.99,
"in_stock": true,
"popularity": 0.82
}- At-least-once events: indexers must be idempotent because CDC events can be retried.
- Deletes matter: tombstone or remove documents when the source row is deleted or becomes invisible.
- Backfills matter: you need a safe way to rebuild an index from the database when mappings or analyzers change.
Shards, replicas, refresh, facets, and aggregations
Elasticsearch distributes an index into primary shards. Each shard is a Lucene index that owns a slice of the documents. Replica shards copy primaries for high availability and query throughput. A query fans out to relevant shards, each shard returns top candidates, and the coordinating node merges the results.
client query: "red shoes" + filters brand=Contoso
│
▼
coordinating node
├─ shard 0 searches local inverted index -> top 10 + facet counts
├─ shard 1 searches local inverted index -> top 10 + facet counts
└─ shard 2 searches local inverted index -> top 10 + facet counts
│
▼
merge scores, sort globally, combine aggregations, return page 1Elasticsearch is near-real-time. A write is indexed, then a refresh makes new segments visible to search, commonly about once per second. Lower refresh intervals improve freshness but increase indexing and merge overhead; higher intervals improve throughput but increase visible lag.
Facets and aggregations
Faceted search computes counts over the result set: brand, color, price range, rating, author, file type, or region. Aggregations also power log dashboards and analytics, such as errors per service over time.
query: "running shoes", filter: price < 100
hits:
1. Red running shoes, $49.99
2. Trail running shoes, $79.99
facets:
brand:
Contoso: 120
Fabrikam: 88
color:
red: 41
black: 132
size:
9: 57
10: 64- Elasticsearch uses inverted indexes: terms point to posting lists of documents, positions, and frequencies for fast lookup and ranking.
- Analysis turns messy text into searchable tokens; BM25/TF-IDF-style scoring ranks documents by term rarity, frequency, field importance, and length normalization.
- Search is a separate derived system because primary databases optimize for transactions and correctness, not fuzzy ranked retrieval and facets.
- CDC pipelines feed search indexes asynchronously, so search is near-real-time and must be revalidated against the database for correctness-critical actions.
- Shards distribute data, replicas add availability/query capacity, refresh controls visibility lag, and aggregations power faceted search and dashboards.
Mark it complete to track your progress through the workbook.