Elasticsearch
Table of Contents
Basic Concepts
-
Document - Basic unit of data (JSON)
-
Index - It’s a collection of documents
-
Mapping - it’s the scheme of the documents in a index. It defines the fields the document in an index will have, the datatype of each field, and other properties like the indexing behavior
-
Field - A key-value field inside the document
-
Lucene - it’s the core search engine library of Elasticsearch. It provides indexing, scoring, searching, segments etc.
-
Segment - is a low-level storage file inside a Lucene index. It’s an immutable block of indexed documents.
-
Shard - A lucene instance storing a part of the index
-
Primary Shard - is the original shard that receives documents for indexing. All write operations go here first. After indexing into the primary, Elasticsearch replicates data to replicas. They are created when an index is created.
-
Replica Shard - it’s the copy of the primary shard. They exist to provide high availability, load balancing search queries. They server only read/search requests only. It cannot be on the same node as primary shard.
-
Cluster - A collection of one or more nodes working together
-
Node - A single server instance in the cluster
Types of Nodes
-
Master Node
-
Basically it’s a master eligible node.
-
It maintains cluster state.
-
It elects master node
-
Manages creating/deleting indices, assigning shards, cluster management etc.
-
-
Data Node
-
Stores the actual data and handles indexing, search queries and aggregations
-
Manages shard storage and operations
-
-
Ingest Node
- used to preprocess and transform data before indexing
-
Coordinating only Node
-
a node becomes coordinating only when all roles are disabled
-
is responsible for coordinating the search requests across the cluster. It’s the node that receives the search request from the client and sends it to the appropriate nodes.
-
-
ML Node
- supports ml features of Elasticsearch like anomaly detection, regression, classification, outlier detection etc.
API’s
-
Index API - add or replace a single document in the index
-
Get API - get a single document by id
-
Update API - modify a single document by id using partial updates (send only fields that need to be modified), scripted update (using painless script to perform complex logic like incrementing a counter), upserts (if document doesn’t exist then create it, else update it)
-
Delete API - delete a document by id
-
Bulk API - batch index/update/delete/create operations
-
Multi-Get (MGET) API - retrieve multiple documents from one or more index by id
-
Delete By Query API - Delete all documents that match a query
-
Update By Query API - Update all documents that match a query
-
Reindex API - Copy data from one index to another. This is usually used to modify field mappings, upgrading cluster versions, changing index settings, data transformation etc.
-
Refresh API
-
A refresh makes recent operations performed on indices available for search. Elasticsearch performs refresh every 1 second. This can be changed using the
index.refresh_intervalsetting -
Following are the refresh API options
-
refresh - false (default) - Do not refresh after this operation. The document becomes searchable after the refresh
-
refresh - true - performs a refresh immediately after the operation
-
wait_for - the API waits until the next scheduled refresh, ensuring the operation is searchable before returning. No forced refresh
-
-
Refresh API can be used with following operations - index, update, update_by_query, delete, delete_by_query, _bulk
-
-
Search API - Perform full-text search, filtering, sorting, and aggregations
-
Explain API - a debugging tool used to understand why a specific document received its search score, or why it matched (or didn’t match) a specific query.
-
Profile API - to analyze query execution performance
-
Multi Search API - run multiple search queries in a single request
-
Validate Query API - used to validate a query without executing it
-
Field Capabilities API - used to get information about the capabilities of specific fields across one or more indices
-
Rank Evaluation API - evaluate quality of ranked search results
-
Open/Close index API - close indices to reduce resource usage. Can be reopened later
-
Aliases API - add/remove/update index aliases
-
Mapping API - to modify field mappings
-
Shrink / Split / Clone index API - used to manage and optimize shard distribution within a cluster
-
Force Merge API - used to reduce segments to improve it for read heavy workloads
-
Flush API - to permanently store in-memory index operations to disk and clear the internal transaction log (translog)
-
Clear Cache API - to manually evict data from internal memory caches to free up resources or prepare for performance testing
-
Index Stats API - get statistics on indexing, search, caching, merges
-
Segments API - inspect lucene segments inside shards.
-
Cluster Health API - view cluster health (green/yellow/red)
-
Cluster State API - used for debugging, diagnostics, and monitoring
-
Cluster Stats API - view operational statistics across nodes
-
Cluster Settings API - change number of replicas, refresh interval, number of shards and other cluster settings
-
Pending Tasks API - view tasks waiting to be processed by the cluster
-
Reroute API - used to manually move shards across the cluster
-
Allocation Explain API - used to identify why shards are not being distributed as expected within a cluster
-
Nodes Info API - view modules, plugins, roles, JVM info etc
-
Nodes Stats API - view node-level metrics such as memory, CPU, cache, utilization, latency, throughput etc.
-
Hot Threads API - a diagnostic tool to identify performance bottlenecks by providing a snapshot of the busiest Java threads running on node
-
Usage API - provide insights into how system features are being utilized, how much storage different components consume, and how frequently specific data structures are accessed
-
Cat APIs - provides a human-readable, command-line friendly way to quickly monitor and troubleshoot your cluster
-
/_cat/indices - lists all the indices in the cluster
-
/_cat/nodes - shows a summary of all the nodes in the cluster
-
/_cat/shards - shows shard level details for every index
-
/_cat/health - shows quick view of cluster wide health
-
/_cat/count - gives total number of documents in selected indices
-
/_cat/aliases - shows alias to index mapping
-
/_cat/repositories - lists snapshot repositories registered in the cluster
-
/_cat/snapshots - lists snapshots inside a specific repository
-
-
Ingestion Pipeline API - create/manage pipelines with processors. Put Pipeline / Get Pipeline / Delete Pipeline / Simulate Pipeline
-
Node/Stats for Ingest - Track ingest performance.
-
Snapshot API - create/restore/get/delete snapshots of indices/cluster state.
-
Snapshot Repository API - Register local or cloud-backed repositories.
-
User APIs - create/update/delete users
-
Role APIs - assign permissions
-
Role Mapping API - map users/groups to roles
-
API Keys API - create/manage API keys
-
Token API - manage access tokens for login flows
-
SSL / Certificates API - manager cluster TLS certificates
-
ML APIs - Anomaly Detection, Data Frame Analytics, Model Management, Deployment etc
-
Index Lifecycle Management API - define policies: hot, warm, cold, delete phases.
-
Explain Lifecycle API - see which phase a shard is in
-
Script API - used for managing, storing, testing, and executing custom scripts within the cluster
-
Template API - they are the mechanism by which Elasticsearch applies settings, mappings, and other configurations when creating indices
Mapping & Schema
-
Mapping defines datatype of the field and it’s indexing behavior
-
Automatic field creation and mapping can be achieved using dynamic mapping. Elasticsearch creates new fields automatically based on incoming document structure.
-
Manual schema can be achieved using explicit mapping
-
Use
{index_name}/_mappingto create, update or delete mappings -
Following are the field types supported - keyword, text, numeric types, date, boolean, geo_point, geo_shape, nested object
-
Analyzers in text fields determines the tokenization strategy, stemming, case folding, stopwords etc.
Search & Query Concepts
-
Full-text Queries - used for human language search where analysis occurs
-
match - analyzes search text and finds relevant docs
-
multi_match - searches multiple fields at once
-
match_phrase - exact sequence of words with ordering and proximity control
-
query_string - lucene syntax, supports AND/OR/NOT, fields, wildcards
-
simple_query_string - a safer but limited version of query_string
-
-
Term-level Queries - No analysis. useful for exact matches
-
term - exact match for keyword, numbers, boolean
-
terms - match any from a list
-
range - filter by numeric, date, or string ranges
-
exists - field must be present
-
prefix, wildcard, regexp - pattern-based queries
-
-
Boolean
- bool - must (AND), should (OR), must_not (NOT)
-
Pagination
-
from/size - basic pagination
-
search_after - for deep pagination. It uses the sort values of the last result as the starting point for the next page.
-
scroll - used for large exports or batch processing. Not meant for real-time user queries
-
-
Aggregations
-
Metric aggregations - compute values
-
min/max - minimum/maximum value of numeric field
-
avg/sum - average/sum of numeric field
-
stats - returns count, min, max, sum, average
-
extended_stats - returns variance, standard deviation, sum of squares, standard deviation bounds
-
cardinality - approximate count of unique values
-
percentiles / percentile_ranks - compute percentiles
-
top_hits - return sample documents from each bucket
-
value_count - count number of values for a field
-
-
Bucket aggregations - group documents into buckets
-
terms - group by a field
-
range - group into buckets based on the ranges provided for numeric fields
-
date_range - group into buckets based on the ranges provided for date fields
-
histogram - group numeric data based on provided fixed interval
-
date_histogram - group date data based on provided fixed interval
-
filters - manually define buckets based on queries
-
-
Nested aggregations - work on nested fields
- nested - if your document contains an array of objects, nested is used
-
Pipeline aggregations - perform computations on the output of other aggregations, rather than directly on the documents in an index. They enable complex statistical and mathematical calculations like moving averages, cumulative sums, and derivatives by chaining aggregation results together
-
Multi-Level aggregations (Sub-Aggs) - aggregations can be nested to to build hierarchical analytics
-
FAQs
-
Why does update query in Elasticsearch doesn’t update immediately?
-
Elasticsearch updates the document immediately in the transaction. The delay happens because of how Elasticsearch handles index refreshes. The updates won’t be available for searches until the next refresh. The default refresh interval is 1 second. Until that refresh happens that document won’t appear in the searches.
-
Elasticsearch avoids refreshing for every update, because refresh is really expensive. Refreshing for every update would create too many segments and hurt performance significantly. Instead it batches and refreshes periodically based on refresh_interval setting.
-
When a document is updated, Elasticsearch fetches the old version, rewrites it with new changes, marks the old version as deleted and indexes the new version.
-
If you want to force a refresh, then pass ?refresh=true in the query parameter of the update request.
-