Elasticsearch inverted index principle

An inverted index is also a type of index. Indexing is essentially to quickly retrieve the data we store.

Each database has its own problems to be solved (or areas of expertise), corresponding to its own data structure, and different usage scenarios and data structures require different indexes to maximize The purpose of speeding up the query.

For MySQL, the use of B + tree index is to optimize the storage structure of existing data. When fast update is not required, pre-sorting and other methods are used in exchange for smaller storage space and faster retrieval speed. But at the same time, due to Each update needs to adjust the B + tree, resulting in slower updates. Elasticsearch uses Lucene’s inverted index technology to achieve faster filtering than relational databases. Especially its support for multi-condition filtering is very good.

Elasticsearch is a search engine based on the full-text search engine library Lucene. It hides the complexity of Lucene and instead provides a simple and consistent RESTful API, but it cannot hide the fact that its underlying layer is also Lucene.
The inverted index of Elasticsearch is actually the inverted index of Lucene.

The origin of the name of the inverted index

When there is no search engine, we directly enter a URL and then get the content of the website. At this time, our behavior is:

document -> to -> words

Through the article, get the words inside, this is the so-called “forward index” (forward index).

Later, we hope to be able to enter a word and find articles that contain this word or are related to this word:

word -> to -> documents

So this kind of index is called inverted index. When it is literally translated, it should be called “reverse index”, and it is translated into “inverted index” in China.

Internal structure of inverted index

First of all, when data is generated, such as inserting a document, the content is “survival or death”. At this time, by using a tokenizer, it will be decomposed into three words “survival”, “or” and “death”, and then Maybe also get rid of the meaningless word “or”.

Then, these two words and the corresponding document id will be saved:

word	documentId
survive	1
death	1

Then we insert another document, this content is “survival”, so the index becomes:

word	documentId
survive	1, 2
death	1

Next time when searching for “survival”, two documents 1 and 2 will be returned.

But this is far from enough. There are so many languages in the world. If you don’t search for a word, you have to traverse it globally, which is very inefficient. At this time, sorting is needed to improve traversal efficiency by using binary search and other methods. Here, lucene uses the data structure of the jump table, which is Term Dictionary. On the other hand, using sorting alone will also As a result, the disk IO speed is too slow (because the data is placed on the disk), and if the data is put into the memory, it will cause the memory to be full.

Therefore, Lucene’s inverted index, on the basis of the above table, adds a layer of dictionary tree term index on the left. It does not store all the words, but only the word prefixes, which can be found through dictionaries. The block where the word is located, that is, the approximate position of the word, is then binary searched in the block to find the corresponding word, and then find the document list corresponding to the word.

In addition, in order to further save memory, Lucene also uses FST (Finite State Transducers) to further compress the Term Index. The term index is stored in the form of FST (finite state transducers) in memory, which is characterized by saving memory. The Term dictionary is stored in blocks on the disk. A block is compressed with a common prefix. For example, if all words start with Ab, Ab can be omitted. In this way, term dictionary can save disk space more than b-tree.

Improvements to the Posting List

The native Posting List has two areas that can be improved:

How to compress to save disk space
How to quickly find the union intersection

Compression

Suppose there is such an array:

[73, 300, 302, 332, 343, 372]

How to compress it?

In Lucene, data is stored in segments, and each segment can store up to 65536 document IDs, so the range of document IDs is from 0 to 2^16-1, so if no processing is performed, each element will occupy 2 bytes, corresponding to the above array, is 6 * 2 = 12 bytes.

Compression is to reduce the space occupied by each data as much as possible, and at the same time, it can restore the information without distortion.

Incremental encoding

The data only records the increment between elements, so the array becomes:

[73, 227, 2, 30, 11, 29]

Split into chunks

Each block in Lucene has 256 document IDs, which can ensure that each block, after incremental encoding, each element will not exceed 256 (1 byte), and it is also convenient for the subsequent jump table operation for intersection and union.

For demonstration purposes, let’s assume each chunk is 3 document IDs:

[73, 227, 2], [30, 11, 29]

Allocate space on demand

For the first block, [73, 227, 2], the maximum element is 227, which requires 8 bits, so each element is allocated 8 bits of space.

But for the second block, [30, 11, 29], the largest element is only 30, and only 5 bits are needed, so the space allocated to each element with only 5 bits is enough.

The above three steps together constitute a coding technique, Frame Of Reference (FOR):

Quick intersection and union

When querying in Lucene, there is usually more than one query condition, for example, if you want to search:

Documents containing the term “survival”
The document was published in the last month
The document publisher is a contributing author of the platform

In this way, it is necessary to search in the three inverted indexes based on the three fields. Of course, the data in the disk, as mentioned in the previous section, is compressed using FOR, so we need to reverse the data, that is, decompress it. , to restore the original document ID, and then make an intersection of these three document ID arrays in memory.

Even if there is no multi-condition query, Lucene needs frequent unions, because Lucene is stored in pieces.

The problems encountered by Lucene can be simplified into an algorithm problem.

Suppose you have the following three arrays:

[64, 300, 303, 343]

[73, 300, 302, 303, 343, 372]

[303, 311, 333, 343]

Find their intersection.

Integer array

Directly use the original document ID, if you traverse the array one by one, you can find it, which is not ideal in terms of space or performance.

In fact, for ordered arrays, using a skip table can be more efficient, but Integer arrays are unreliable in terms of performance and space. Suppose there are 100M document IDs, and each document ID occupies 2 bytes. , that is already 200 MB, and these data are to be put into the memory for processing. If such a large amount of data is decompressed from the disk and thrown into the memory, the memory will definitely not be able to hold it.

Bitmap

Suppose there is such an array:

[3,6,7,10]

Then it can be represented by using bitmap (bitmap):

[0,0,1,0,0,1,1,0,0,1]

We use 0 to indicate that the number corresponding to the subscript does not exist, and 1 to indicate that it exists.

This brings two benefits:

Save space: only 0 and 1 are needed, then each document ID only needs 1 bit, or assuming that there are 100M documents, then only 100M bits = 100M * 1/8 bytes = 12.5 MB, compared with the previous 200 using the Integer array MB saves a lot of memory.
Faster operations: 0 and 1 are naturally suitable for bit operations, intersection, “and”, union, “or”, everything returns to the starting point of the computer

Roaring Bitmaps

Bitmap has a flaw, that is, no matter how many documents you have, the space you occupy is the same. As I said before, each segment of the Lucene Posting List can store up to 65536 document IDs. To give an extreme example, there is an array , which only has two document IDs:

[0, 65535]

If you use bitmap representation, you need:

[1,0,0,0,….(super multiple 0),…,0,0,1]

65536 bits are required, that is, 65536/8 = 8192 bytes, but with an Integer array, only 2 * 2 bytes = 4 bytes are required

It can be seen that when the number of documents is small, using an Integer array saves more memory.

It is very simple to calculate the critical value. Regardless of the number of documents, the bitmap requires 8192 bytes, while the Integer array is linearly related to the number of documents. Each document ID occupies 2 bytes, so:

8192 / 2 = 4096

When the number of documents is less than 4096, use an Integer array, otherwise, use a bitmap.

For Integer, use Skip List (skip list) for combined calculation

Create a jump table for each int array that needs to be searched, and then start traversing from the shortest posting list. During the traversal, each can skip a lot of elements, such as the following example:

The above are three posting lists. Now we need to combine them with the relationship of AND to get the intersection of posting list. First select the shortest posting list, and then traverse from small to large. Some elements can be skipped during the traversal process. For example, when we traverse to the green 13, we can skip the blue 3, because 3 is smaller than 13.

The whole process is as follows:

Next -> 2
Advance(2) -> 13
Advance(13) -> 13
Already on 13
Advance(13) -> 13 MATCH!!!
Next -> 17
Advance(17) -> 22
Advance(22) -> 98
Advance(98) -> 98
Advance(98) -> 98 MATCH!!!

The relationship between Roaring bitmaps and Frame Of Reference

Frame Of Reference is to compress data to reduce disk space, so when we fetch data from disk, we also need a reverse process, that is, decompression. After decompression, we can see what we saw above The document ID array: [73, 300, 302, 303, 343, 372], then we need to process the data, find the intersection or union, at this time the data needs to be put into the memory for processing, we have three such Arrays, these arrays may be very large, and the memory space is more precious than the disk, so a more powerful compression algorithm is needed, and it is also conducive to fast intersection and union, so there is the Roaring Bitmaps algorithm.

In addition, Lucene also caches the data fetched from the disk into memory after being processed by Roaring bitmaps, which Lucene calls filter cache.

Why Elasticsearch/Lucene retrieval can be faster than mysql

Mysql only has the term dictionary layer, which is stored on disk in the form of b-tree sorting. Retrieving a term requires several random IO disk operations. Lucene adds a term index to the term dictionary to speed up retrieval, and the term index is cached in memory in the form of a tree. After the block position of the corresponding term dictionary is found from the term index, the term is found on the disk, which greatly reduces the number of random access (random IO) of the disk.

Reference Documentation

Talk about Elasticsearch’s inverted index

elasticsearch inverted index principle