Elasticsearch: Search Architecture

Elasticsearch

The complexity of full-text search

To understand why full-text search is a difficult problem to solve, let’s think of an example. Suppose you are hosting a blog publishing website with hundreds of millions or even billions of blog posts, each containing hundreds of words, similar to CSDN.

Performing a full-text search means that any user can search for something like “java” or “learn to program” and you need to find all the blog posts in which those words appear in milliseconds. Not only that, but you also want to rate these blog posts based on a variety of factors, like how often these words appear in these posts, or how many claps or comments each post has, or you might want to show the most recently written posts at the top , or you might want to highlight certain top content creators, or you might want to place posts higher up where those words appear in the title, etc.

Also, you know that users can accidentally make mistakes, so you need to handle that. You also need to consider the order of the words, “learn Java” should have a similar meaning to “learn Java”, but sometimes the order is more important, for example “carbon dioxide” may be very different from “carbon dioxide” (this is just An example, I don’t know if that’s a word, I don’t understand chemistry).

Just matching words won’t work either. Some words provide more context to a post than others. For example, a blog post titled “Learn Java” is a relevant result when a user searches for “Java,” but is less relevant when a user only searches for “learn.” This is also a relevant blog post when a user searches for “programming” even though the term never appears in the blog post!

These challenges are so complex that at first glance they may seem almost impossible to search, but you open a food ordering app and search among tens of thousands of dishes in thousands of restaurants, or search for people who perform a specific job on Linkedin, which hundreds of millions of people do every day. Play a role among users, or search billions of blog posts for a specific topic.

Elasticsearch is a database designed to solve this problem. Let’s see how it works.

Understanding terminology

Before we start using Elasticsearch, we should familiarize ourselves with the terminology. To understand things better, let’s take an example. Let’s say you store blog posts on Elasticsearch.

Nodes

Nodes are just individual Elasticsearch processes.

Typically, you’ll run one Elasticsearch process on each machine, so it’s easier to treat them as separate servers. Each of these processes runs independently of the others and is connected only through the public network. Elasticsearch typically runs as a large distributed system, which means you typically run multiple machines (or nodes).

Once all these nodes are running together, they form a “cluster”. A cluster is more than the sum of its parts; it is more than a certain number of nodes running in isolation. Instead, nodes know that they are part of a cluster and communicate with each other when performing different operations. In a way, an Elasticsearch cluster is a completely new entity.

An Elasticsearch cluster has a large number of responsibilities such as storing documents, searching those documents, performing different analysis and aggregation tasks, backing up data, etc. It also has to manage itself, such as ensuring which nodes are healthy and which are healthy. Therefore, in any large cluster, it is important to have different nodes for different operating domains.

While there may be many such distinctions, one obvious one is the nodes that store data and perform heavy data-intensive tasks such as searching and having dedicated nodes that manage the cluster, ensuring node health, deciding which document to send to which node wait. Creating this distinction is important because these nodes may even require different hardware resources. Data nodes may require larger machines, with higher-performance networks and disks, and large amounts of memory, while nodes that perform more administrative tasks may have completely different requirements.

The node that stores data and searches can be a “data” node, and the node that performs more administrative tasks can be called a “master” node.

For more description of nodes, please read the article “Some important concepts in Elasticsearch: cluster, node, index, document, shards and replica”. In fact, in addition to data and master nodes, Elasticsearch also has other types of nodes.

Index and documentation

Documents are simple JSON objects that you store in Elasticsearch. They are synonymous with rows in relational databases or individual documents in MongoDB.

For our example, a single document might look like this ?

{
  "_id": "9a91473c-522e-4174-bf7f-f55293b8e526",
  "post_title": "Learning about Elasticsearch",
  "author_name": "Zhang san",
  .....
}

An index is a collection of similar documents. They are synonymous with tables in relational databases (where each row is a single item) and with sets in MongoDB.

So for our example we will have an index that stores blog posts. Let’s call them blog_posts. If we want to store some other data, say users, we can create another index, users. The blog_posts index stores various blog post documents, each of which contains fields related to blog posts, while the users index stores user documents containing fields such as user_name, email, etc.

Shards – Sharding

Documents in the index are divided into shards. Each shard stores a subset of the indexed documents. We’ll understand later why it’s important to divide a document into shards, but for now let’s focus on how sharding works.

For example, let’s say we have some blog_posts documents.

If we create three shards for this index (e.g. shard A, shard B, shard C), then all our documents will be divided into these three shards.

These shards will then reside on different data nodes in the cluster.

This is important because distributing these documents across multiple shards gives you several advantages,

Searches can be parallelized. When a user wants to perform a search, all documents are searched. This would be very time consuming if all documents were searched on a single server. Sharding allows you to distribute documents across multiple servers, allowing a single search to be performed in parallel on different hardware.
Other queries, such as inserting a document (called an index in Elasticsearch) or retrieving a document by a specific ID, will be distributed across all nodes.

However, our architecture is still incomplete. If a node dies, the shards it stores (and the data on those shards) are lost forever.

Let’s look at primary shard and replica shard to understand this better.

Primary shards, replica shards and distinct shards

Just a quick revision of what we’ve covered so far: a single shard contains multiple documents. For example,

Each shard resides on a specific node,

One problem with our architecture so far is that if a specific node (let’s say 10.192.0.3) dies or becomes unavailable, the data in “shard A” will be lost forever. To solve this problem, we introduce the concepts of “replica sharding” and “primary sharding”. Primary shards are the shards we have been discussing so far (label them “Primary” for now),

Replica shards are shards that store only the same documents associated with the primary shard. Therefore, a replica shard simply “duplicates” or replicates a specific primary shard.

In the above image, you can see that each primary shard has an associated replica shard, and each replica shard stores the same documents as the primary shard. Here, we have one replica shard per primary shard, but we can also modify this number to be larger – there can be two replicas per primary shard. Now, we continue with one replica per primary shard.

These replica shards do not need to be on the same node as the primary shard (it makes sense for each replica to be on a different node than its primary shard). Both primary and replica shards are distributed across all nodes in the cluster.

In the above diagram, the primary and replica shards of each shard exist on different nodes. A single node failure will not render data unavailable. For example, if node 10.192.0.3 becomes unavailable, neither shard A nor shard B data will be lost. Shard A’s data is still available on node 10.192.0.2, and similarly, shard B’s data is still available on node 10.192.0.1.

This means our cluster can survive the loss of a single node. However, our cluster may not survive the loss of two nodes. For example, the simultaneous loss of the 10.192.0.3 and 10.192.0.2 nodes will cause the documents of shard A to be completely unavailable. We can configure higher replication, for example, using two replicas per primary shard to mitigate this situation. But for now, we continue with one replica per primary shard.

Finally, let’s look at “distinct shards”. Different shards are just a term used to group identical primary shards and replicas together. So, in our current example, we have three primary shards, three replica shards (1 replica per primary), six total shards (three primary + three replicas), and three different shards,

It will become clear why it is important to group primary shards and their corresponding replica shards into a single “different shard”. To reiterate, “different shards” are just logical groupings of shards, and do affect the architecture we’ve drawn so far.

Let’s look at some real query examples

To wrap up the architecture discussion, let’s see how search queries and get queries work in our example cluster.

The first step…

Let’s see what happens when we perform a search or get query.

This is what our cluster looks like now,

The client API sends a search or get query to any of these nodes. The node to which it sends the query becomes the “coordinator” node. Larger clusters might even have a dedicated coordinator node (a dedicated coordinator node is a node that does not have any node role, it can only receive requests from clients), but we don’t need that right now. At the beginning of the article, we can see a more detailed architecture diagram.

This coordinator node is responsible for receiving requests, communicating with other nodes (if needed), combining the results received from multiple nodes and returning the results.

Search

When searching, the search query must hit all different shards. This is because all shards perform searches locally individually using the documents they hold.

The coordinator node will then communicate with multiple nodes to get data from each different shard. Recall that in our example, there is one replica per primary shard, so the query only hits half of the shards in the cluster (primary or replica).

For details on how to complete a request, please read the article “Elasticsearch: How is the data read?”.

Query based on id

When a query is executed for a specific document by ID, the coordinator node already knows which shard will hold the document, so there is no need to hit all nodes. It just forwards the request to the node where the data is stored and sends the response back to the client. This is because every time a document comes in, a hash calculation is automatically performed based on the document’s ID and stored in the calculated shard instance. This result can make all shards have a more balanced storage without any Shard is very busy.

shard_num = hash(_routing) % num_primary_shards

We can calculate which shard it is based on the document’s id.

Conclusion

This is a very simple introduction to the Elasticsearch architecture. I hope everyone can have a clearer understanding of the cluster architecture of Elasticsearch. For more information about the terms and concepts of Elasticsearch, please read the article “Some important concepts in Elasticsearch: cluster, node, index, document, shards and replica”.