Elasticsearch index principle

Introduction

Elasticsearch is a distributed and scalable real-time search and analysis engine, a search engine based on the full-text search engine Apache Lucene(TM). Of course, Elasticsearch is not just as simple as Lucene, it not only includes full-text search functions, but also The following work can be done:

Distributed real-time file storage, and each field is indexed so that it can be searched.
Distributed search engine for real-time analysis.
It can be extended to hundreds of servers and handle PB-level structured or unstructured data.

Basic concepts

Let’s talk about the file storage of Elasticsearch first. Elasticsearch is a document-oriented database. A piece of data is a document here. JSON is used as the document serialization format. For example, the following user data:

{
    "name": "John",
    "sex" : "Male",
    "age": 25,
    "birthDate": "1990/05/01",
    "about" : "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

It is easy to think of creating a User table with a database such as Mysql, with balabala fields, etc. In Elasticsearch, this is a document. Of course, this document will belong to a User’s type< /i>, various types exist in an index. Here is a simple comparison table of Elasticsearch and relational data terms:

Relational database? Database? Table? Row? Columns

Elasticsearch ? Index (Index) ? Type (type) ? Documents (Docments) ? Fields (Fields)

An Elasticsearch cluster can contain multiple indexes (databases), which means it contains many types (tables). These types contain many documents (rows), and each document contains many fields (columns). The interaction of Elasticsearch can use the Java API or the Restful API method of HTTP directly. For example, if we intend to insert a record, we can simply send an HTTP request:

PUT /megacorp/employee/1 { "name": "John", "sex" : "Male", "age": 25, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ] }

Update and query are also similar to this operation. For specific operation manuals, please refer to Elasticsearch Authoritative Guide

Index

The most important thing about Elasticsearch is to provide powerful indexing capabilities. In fact, InfoQ’s Secrets of Time Series Database (2)–Index is very well written. I also combine my own understanding around this article. After further sorting out, I hope it can help you better understand this article.

The essence of Elasticsearch index:

Everything is designed to improve search performance

Another meaning: In order to improve the performance of the search, it is inevitable to sacrifice some other aspects, such as insert/update, otherwise other databases do not need to be mixed. I saw earlier that inserting a record into Elasticsearch is actually directly PUTing a json object, which has multiple fields, such as name, sex, age, about, interests in the above example, then in While inserting these data into Elasticsearch, Elasticsearch also silently 1 builds an index for these fields – an inverted index, because the core function of Elasticsearch is search.

How does Elasticsearch achieve fast indexing

The InfoQ article said that the inverted index used by Elasticsearch is faster than the B-Tree index of the relational database. Why?

What is a B-Tree index?

When we were studying in university, the teacher taught us that the search efficiency of a binary tree is logN, and it is not necessary to move all nodes when inserting new nodes at the same time. Therefore, storing indexes in a tree structure can take into account both insertion and query performance. Therefore, on this basis, combined with the read characteristics of the disk (sequential read/random read), the traditional relational database adopts a data structure such as B-Tree/B+Tree:

In order to improve query efficiency and reduce the number of disk seeks, multiple values are stored as an array through continuous intervals, and multiple data are read by one seek, while reducing the height of the tree.

What is an inverted index?

Continuing the above example, suppose there are several pieces of data (for simplicity, remove the two fields about and interests):

|ID|Name|Age|Sex| | -- |:------------:| -----:| -----:| | 1 | Kate | 24 | Female | 2 | John | 24 | Male | 3 | Bill | 29 | Male

ID is the document id built by Elasticsearch, then the index built by Elasticsearch is as follows:

Name:

| Term | Posting List | | -- |:----:| | Kate | 1 | | John | 2 | | Bill | 3 |

Age:

| Term | Posting List | | -- |:----:| | 24 | [1,2] | | 29 | 3 |

Sex:

| Term | Posting List | | -- |:----:| | Female | 1 | | Male | [2,3] |

Posting List

Elasticsearch has created an inverted index for each field, Kate, John, 24, Female are called terms, and [1,2] is Posting List. Posting list is an int array, which stores all document ids matching a certain term.

Seeing this, don’t think it’s over, the exciting part has just begun…

The posting list index method seems to be able to search quickly. For example, if you want to find a classmate with age=24, Xiao Ming, who likes to answer questions, immediately raises his hand and answers: I know, the classmate whose id is 1, 2. But what if there are tens of millions of records here? What if you want to find it by name?

Term Dictionary

In order to quickly find a certain term, Elasticsearch sorts all the terms, searches for terms by dichotomy, and the search efficiency of logN is just like searching through a dictionary. This is Term Dictionary. Looking at it now, it seems to be similar to the way traditional databases use B-Tree. Why is it faster than B-Tree query?

Term Index

B-Tree improves query performance by reducing the number of disk seeks. Elasticsearch also uses the same idea to search for terms directly through memory without reading disks. However, if there are too many terms, the term dictionary will be very large, and it is unrealistic to store them in memory. With Term Index, just like the index page in the dictionary, which terms start with A, and which pages are they on, it can be understood that the term index is a tree:

This tree does not contain all terms, it contains some prefixes of terms. Through the term index, you can quickly locate a certain offset in the term dictionary, and then search sequentially from this position.

Therefore, the term index does not need to save all the terms, but only the mapping relationship between some of their prefixes and the blocks of the Term Dictionary. Combined with the compression technology of FST (Finite State Transducers), the term index can be cached in memory. After the block position of the corresponding term dictionary is found from the term index, the term is searched on the disk, which greatly reduces the number of random disk reads.

At this time, Xiao Ming, who likes to ask questions, raised his hand again: “That FST is a goddamn dongdong?”

At a glance, you can tell that Xiao Ming is a child who didn’t pay attention to the lectures like me when he was studying in college. The data structure teacher must have talked about what FST is. But there is no way, I also forgot, here is another lesson:

FSTs are finite-state machines that
map a
term (byte sequence) to an arbitrary
output.

Suppose we now want to map mop, moth, pop, star, stop and top (term prefix in term index) to serial numbers: 0, 1, 2, 3, 4, 5 (block position of term dictionary). The easiest way is to define a Map, and everyone can find their own seats corresponding to their seats. But from the perspective of less memory usage, is there a better way? The answer is: FST (Theoretical basis is here, but I believe 99% of people will not read it carefully)

Indicates a state

–>Indicates the state change process, the letters/numbers above indicate state changes and weights

Divide words into individual letters through and –>, and 0 weights are not displayed. If there is a branch after , the weight is marked, and finally the weight on the entire path is added up to be the serial number corresponding to the word.

FSTs are finite-state machines that map a term (
byte sequence) to an arbitrary output.

FST stores all terms in bytes. This compression method can effectively reduce the storage space, making the term index enough to fit in the memory, but this method also requires more CPU resources when searching.

The following is more exciting, and students who are tired of watching can have a cup of coffee…

Compression techniques

In addition to compressing term index with FST mentioned above, Elasticsearch also has compression techniques for posting list.
After drinking coffee, Xiao Ming raised his hand again: “Isn’t the posting list only storing the document id? It needs to be compressed?”

Well, let’s look back at the first example. What if Elasticsearch needs to index the gender of classmates (at this time, the traditional relational database has already fainted in the toilet…), what will happen? If there are tens of millions of classmates, and there are only two genders in the world, male/female, each posting list will have at least one million document ids. How does Elasticsearch effectively compress these document ids?

Frame Of Reference

Incremental encoding and compression, converting large numbers to small numbers and storing them in bytes

First of all, Elasticsearch requires the posting list to be in order (in order to improve the performance of the search, even the capricious requirements must be met). One advantage of this is that it is convenient for compression. See the following illustration:

If mathematics is not taught by a physical education teacher, it is relatively easy to see this compression technique.

The principle is to change the original large number into a decimal and only store the incremental value through the increment, and then carefully calculate and line up the bit, and finally store it in bytes instead of using int (4 bytes) even though it is 2 to store.

Roaring bitmaps

When it comes to Roaring bitmaps, we must start with the bitmap. Bitmap is a data structure, assuming there is a posting list:

[1,3,4,7,10]

The corresponding bitmap is:

[1,0,1,1,0,0,1,0,0,1]

Very intuitive, use 0/1 to indicate whether a certain value exists, for example, the value of 10 corresponds to the 10th bit, and the corresponding bit value is 1, so that one byte can represent 8 document ids, the old version (before 5.0) Lucene is compressed in this way, but this compression method is still not efficient enough. If there are 100 million documents, then 12.5MB of storage space is required, which only corresponds to an index field (we often have many indexes field). So someone came up with a more efficient data structure like Roaring bitmaps.

The disadvantage of Bitmap is that the storage space increases linearly with the number of documents. Roaring bitmaps must use certain exponential features to break this curse:

Divide the posting list into blocks according to the boundary of 65535, for example, the document id range contained in the first block is between 0~65535, the id range of the second block is 65536~131071, and so on. Then use the combination of to represent each group of ids, so that the range of ids in each group is within 0~65535, and the rest is easy to handle. Since each group of ids will not become infinitely large, then We can store the id here in the most efficient way.

Careful Xiao Ming raised his hand again at this time: “Why is 65535 the limit?”

In the programmer’s world, besides 1024, 65535 is also a classic value, because it = 2^16-1, which happens to be the largest number that can be represented by 2 bytes, a short storage unit, and notice the last in the above picture One line “If a block has more than 4096 values, encode as a bit set, and otherwise as a simple array using 2 bytes per value”, if it is a large block, save it with bitset, and save it with bitset for small blocks, 2 I don’t care about the bytes, it’s convenient to use a short[].

Then why use 4096 to distinguish between large and small blocks?

Personal understanding: It is said that the world of programmers is binary, 4096*2bytes = 8192bytes < 1KB, one disk seek can read out the contents of a small block sequentially, and if it is larger, it will exceed 1KB, which requires two reads .

Joint Index

As mentioned above, it is a single-field index for a long time. If multiple field indexes are jointly queried, how can the inverted index meet the requirements of fast query?

Use the data structure of the Skip list to quickly perform “AND” operations, or

Utilize the above-mentioned bitset bitwise “AND”

First look at the data structure of the jump table:

Take an ordered linked list level0, and select several elements from it to level1 and level2. The higher each level is, the fewer pointer elements are selected. When searching, search from high level to low, such as 55, find level2 first 31, then find 47 of level1, and finally find 55, a total of 3 searches, the search efficiency is equivalent to that of the binary tree, but it is also exchanged for a certain amount of space redundancy.

Assume that the following three posting lists need a joint index:

If you use a jump table, for each ID in the shortest posting list, check whether it exists in the other two posting lists one by one, and finally get the intersection result.

If you use bitset, it is very intuitive, and the result is the final intersection.

Summary and reflection

Elasticsearch index ideas:

Move the things on the disk into the memory as much as possible, reduce the number of random reads from the disk (also use the sequential read feature of the disk), combine various tricks and ingenious compression algorithms, and use the memory with an extremely harsh attitude.

Therefore, when using Elasticsearch for indexing, you need to pay attention to:

Fields that do not need to be indexed must be clearly defined, because the default is to automatically build the index

In the same way, for String type fields, those that do not require analysis also need to be clearly defined, because the default is also analysis

It is very important to choose a regular ID. IDs with too much randomness (such as java UUID) are not conducive to query

Regarding the last point, I personally think there are multiple factors:

One of the (maybe not the most important) factors: the compression algorithms seen above all compress a large number of IDs in the Posting list, then if the IDs are sequential, or have a certain regularity such as a common prefix , the compression ratio will be higher;

Another factor: it may be the most affecting query performance. It should be the last step to find Document information on the disk through the ID in the Posting list, because Elasticsearch is stored in segments, and the segment is located according to the large-scale term of ID. Efficiency directly affects the performance of the final query. If the ID is regular, you can quickly skip the Segment that does not contain the ID, thereby reducing unnecessary disk reads. For details, please refer to this How to choose an efficient global ID scheme (the comments are also wonderful)