Internal principles of Elasticsearch sharding-near real-time search, persistent changes

Table of Contents

1. Near real-time search

refresh API

2. Persistent changes

flush API

1. Near real-time search

With the development of per-segment search, the delay between a new document being indexed and becoming searchable has been significantly reduced. New documents can be retrieved within a few minutes, but that’s still not fast enough.

The disk becomes the bottleneck here. Commiting a new segment to disk requires an fsync to ensure that the segment is physically written to disk so that data is not lost in the event of a power outage. However, the fsync operation is very expensive; if it is executed every time a document is indexed, it will cause a huge performance problem.

What we need is a more lightweight way to make a document searchable, which means fsync needs to be removed from the entire process.

Between Elasticsearch and disk is the file system cache. As described previously, documents in the memory index buffer (Figure 19, “The Lucene index containing the new document in the memory buffer”) are written to a new segment (Figure 20, “Buffer The content has been written to a searchable segment but has not yet been committed”). But here the new segment will be written to the file system cache first – this step will be relatively cheap, and then flushed to disk later – this step will be more expensive. However, as long as the file is in the cache, it can be opened and read like other files.

Figure 19. Lucene index containing new document in memory buffer

Lucene allows new segments to be written and opened-making their containing documents visible to searches without making a full commit. This method is much less expensive than performing a commit and can be executed frequently without affecting performance.

Figure 20. The contents of the buffer have been written to a searchable segment, but have not yet been committed

refresh API

In Elasticsearch, the lightweight process of writing and opening a new segment is called refresh. By default each shard is automatically refreshed every second. This is why we say Elasticsearch is near real-time search: changes to a document are not immediately visible to search, but become visible within a second.

This behavior can be confusing to new users: they index a document and then try to search for it, but don’t find it. The solution to this problem is to perform a manual refresh using the refresh API:

POST /_refresh
POST /blogs/_refresh

Refresh all indexes.

Refresh only the blogs index.

Although flushing is a much lighter operation than committing, it still has a performance overhead. Manual refreshes are useful when writing tests, but don’t do it every time you index a document in a production environment. Instead, your application needs to be aware of the near-real-time nature of Elasticsearch and accept its shortcomings.

Not all situations require refreshing every second. Maybe you are using Elasticsearch to index a large number of log files. You may want to optimize indexing speed rather than near real-time search. You can reduce the refresh frequency of each index by setting refresh_interval:

PUT /my_logs
{
  "settings": {
    "refresh_interval": "30s"
  }
}

Refresh the my_logs index every 30 seconds.

refresh_interval can dynamically update existing indexes. In a production environment, when you are building a large new index, you can turn off automatic refreshes and then turn them back when you start using the index:

PUT /my_logs/_settings
{ "refresh_interval": -1 }

PUT /my_logs/_settings
{ "refresh_interval": "1s" }

Turn off automatic refresh.

Automatically refresh every second.

2. Persistent changes

Without using fsync to flush data from the file system cache to the hard disk, we cannot guarantee that the data will still exist after a power outage or even a normal program exit. In order to ensure the reliability of Elasticsearch, it is necessary to ensure that data changes are persisted to disk.

? In dynamically updating the index, we say that a complete commit flushes the segments to disk and writes a commit point containing a list of all segments. Elasticsearch uses this commit point during startup or reopening of an index to determine which segments belong to the current shard.

Even if we achieve near real-time search by refreshing every second, we still need to make full commits frequently to ensure that we can recover from failures. But what about documents that changed between commits? We don’t want to lose this data either.

Elasticsearch adds a translog, or transaction log, which records every operation on Elasticsearch. Through translog, the entire process looks like this:

After a document is indexed, it is added to the memory buffer and appended to the translog.

Figure 21. New documents are added to the memory buffer and appended to the transaction log
Refresh (refresh) puts the shards in the state described in Figure 22, “After the refresh (refresh) is completed, the cache is cleared but the transaction log will not”. The shards are refreshed (refreshed) once every second:
- The documents in the memory buffer are written to a new segment without fsync operations.
- The segment is opened, making it searchable.
- The memory buffer is cleared.
Figure 22. After the refresh is completed, the cache is cleared but the transaction log is not
The process continues to work, and more documents are added to the memory buffer and appended to the transaction log.

Figure 23. The transaction log accumulates documents
Every once in a while-for example, as the translog grows larger-the index is flushed; a new translog is created, and a full commit is performed:
- All documents in the memory buffer are written to a new segment.
- The buffer is cleared.
- A commit point is written to disk.
- The file system cache is flushed via fsync.
- The old translog is deleted.

translog provides a persistent record of all operations that have not been flushed to disk. When Elasticsearch starts, it restores known segments from disk using the last commit point, and replays all changes in the translog that occurred after the last commit.

translog is also used to provide real-time CRUD. When you try to query, update, or delete a document by ID, it will first check the translog for any recent changes before trying to retrieve from the corresponding segment. This means it always has the latest version of the document available in real time.

Figure 24. After the flush, the segment is fully committed and the transaction log is cleared.

flush API

This act of performing a commit and truncating the translog is called a flush in Elasticsearch. Shards are automatically flushed every 30 minutes, or when the translog becomes too large. It can be used to control these thresholds:

The flush API can be used to perform a manual flush:

POST /blogs/_flush

POST /_flush?wait_for_ongoing

	Flush the `blogs` index.
	Flushes all indexes and waits for all flushes to complete before returning.

You rarely need to manually perform a flush operation yourself; often, automatic flushing is sufficient.

This means that it will be beneficial for your index to perform a flush before restarting the node or shutting down the index. When Elasticsearch attempts to restore or reopen an index, it needs to replay all operations in the translog, so the recovery is faster if the log is shorter.

How secure is Translog?

The purpose of translog is to ensure that operations are not lost. This begs the question: How secure is Translog?

Files written before they are fsync to disk will be lost across reboots. The default translog is flushed to the hard disk by fsync every 5 seconds, or executed after each write request is completed (e.g. index, delete, update, bulk). This process occurs on both primary and replica shards. Ultimately, this basically means that your client won’t get a 200 OK response until the entire request has been fsync to the primary shard and the replica shard’s translog.

Doing an fsync after every request incurs some performance penalty, although practice shows that this penalty is relatively small (especially with bulk imports, which amortize the overhead of a large number of documents in a single request).

However, for some large-capacity clusters where the problem of occasionally losing data for a few seconds is not serious, it is more beneficial to use asynchronous fsync. For example, the written data is cached in memory, and fsync is executed every 5 seconds.

This behavior can be enabled by setting the durability parameter to async:

PUT /my_index/_settings
{
    "index.translog.durability": "async",
    "index.translog.sync_interval": "5s"
}

This option can be set individually for the index and can be modified dynamically. If you decide to use asynchronous translog, you need to guarante that when a crash occurs, it does not matter if the data in the sync_interval time period is lost. Please be aware of this feature before deciding.

If you are unsure about the consequences of this behavior, it is best to use the default parameters ( "index.translog.durability": "request" ) to avoid data loss.

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Java Skill TreeHomepageOverview 137284 people are learning the system