MySQL index module

1. Why does MySQL database use B + tree to store indexes?

If you only select one piece of data, hashing is indeed faster. However, multiple items are often selected in the database. At this time, because the B + tree index is ordered and connected by a linked list, its query efficiency is much faster than hash.
The indexes of the file system and database are stored on the hard disk, and if the amount of data is large, it may not be loaded into the memory at one time. If it is in memory, the red-black tree is more efficient than the b-tree, but when it comes to disk operations, the b-tree is even better.
Moreover, the index in the database is generally on the disk. If the amount of data is large, it may not be loaded into the memory at one time. The design of the B + tree can allow data to be loaded in batches. At the same time, the height of the tree is low, which improves search efficiency.
1. Reduce disk operations: B+ trees do not store Value in non-leaves, only keys, so more nodes can be loaded at one time and the number of times the disk is loaded into memory is reduced. The non-leaf nodes of the B+ tree do not store actual record data, but only store indexes. Therefore, when the amount of data is the same, compared with the B-tree that stores both indexes and records, the non-leaf nodes of the B+ tree can store more index, so the B + tree can be "chunkier" than the B tree, and the number of disk I/Os required to query the underlying nodes will be less.

2. More advantages of range search: MySQL often has range search. There is a doubly linked list between the leaf nodes of the B + tree, which supports range search very well. The leaf nodes of the B + tree are connected with a linked list, which is conducive to range queries. However, the B tree needs to implement range queries, so the range queries can only be completed through tree traversal, which will involve disk I/O operations on multiple nodes. , the range query efficiency is not as good as B + tree.

3. Fast search speed: B+ tree is similar to a multi-path tree, so for the same node, the height of the tree is lower and the search speed is faster (of course, the more important point is still the reduction). B+ tree has a large number of redundant nodes (all non-leaf nodes are redundant indexes). These redundant indexes make B+ tree more efficient in inserting and deleting. For example, when deleting the root node, it will not be like B Trees can undergo complex tree changes

2 What is a B-tree? Why do file indexes use B-trees instead of binary search trees?

Tree structures such as B-tree, B+ tree, and binary search tree are all ordered, so the query efficiency is very high, and the target data can be found in O(logn) time complexity.

Although the hash table can find the target data in O(1), if we want to perform a fuzzy search, we can only traverse all the data. And if an extreme situation occurs and there are too many conflicting elements in the hash table, it will also cause Linear time search efficiency.

In terms of search efficiency (that is, the number of comparisons), binary trees can actually be said to be the fastest. However, our file index is stored on the disk, so we must not only consider search efficiency, but also disk addressing. Loading times, and this is why we use B-trees.

When loading data from disk into memory, it is loaded in units of pages, and we also know that the data between nodes is discontinuous, so different nodes are likely to be distributed in different locations. in the disk page.

For the B-tree, since each node of the B-tree can store multiple elements, the number of disk addressing and loading will be relatively small, and the operation speed in the memory is very fast, at least faster than the disk addressing and loading speed. Hundreds of times faster, and when we perform numerical comparisons, they are done in memory. Although the number of comparisons of the B-tree may be more than that of the binary search tree, the number of disk operations is fewer, so overall, the B-tree is faster. Many, which is why we use B-trees for storage.

In fact, the number of loading times on the disk is basically related to the height of the tree. The higher the height, the more loading times, and the shorter the height, the fewer loading times. So for the storage of this kind of file index, we usually choose a short and fat tree structure. For example, if there are 1000 elements, if it is a binary search tree, the height may be as high as 10 levels, but if a 10-order B-tree is used, only three or four levels are needed.

In addition to being used in a small number of file indexes (database indexes), B-trees are most commonly used in file systems.

3. Index

The emergence of indexes is actually to improve the efficiency of data query, just like the table of contents of a book.
Common index models: hash table, ordered array, search tree

Hash table: key – value. Put the value in the array, use a hash function to convert the key to a certain position, and then place the value at this position in the array. How to deal with hash conflicts: linked lists. Applicable scenarios for hash tables: Scenarios with only equivalent queries

Ordered array: stored in order. The query can be quickly queried using the dichotomy method, and the time complexity is: O(log(N)). Ordered array query efficiency is high, but update efficiency is low. Applicable scenarios for ordered arrays: static storage engines.

Binary search tree: The left son of each node is smaller than the parent node, and the parent node is smaller than the right son. The query time complexity is O(log(N)), and the update time complexity is O(log(N)). Most database storage is not suitable for binary trees because the tree height is too high, so N-ary trees will be suitable.

InnoDB: B+Tree

Index type: primary key index, non-primary key index

The leaf nodes of the primary key index store the entire row of data (clustered index), and the content of the leaf nodes of the non-primary key index is the value of the primary key (secondary index)
The difference between primary key index and ordinary index: primary key index only needs to search the B + Tree of ID to get the data. Ordinary indexes first search the index to get the primary key value, and then search the primary key index tree once (return to the table)
When a data page is full, according to the B + Tree algorithm, a new data page is added, which is called page splitting, which will cause performance degradation. Space utilization is reduced by about 50%. When the utilization rate of two adjacent data pages is very low, the data pages will be merged. The merging process is the reverse process of the splitting process.

Covering index

Query only primary key values: covering index. Because covering indexes can reduce the number of tree searches and significantly improve query performance, using covering indexes is a common performance optimization method.

We know that the ID number is the unique identifier of a citizen. In other words, if there is a need to query citizen information based on ID number, we only need to create an index on the ID number field. Is it a waste of space to create a joint index of (ID card number, name)? If there is a high-frequency request to query a citizen’s name based on his ID card number, this joint index will make sense. It can use covering indexes for this high-frequency request, eliminating the need to look up the entire row of records in the table and reducing the execution time of statements.

Leftmost prefix

In an index structure such as a B+ tree, you can use the “leftmost prefix” of the index to locate records and how to arrange the order of fields in the index when building a joint index.

Our evaluation criterion here is the reusability of the index. Because the leftmost prefix can be supported, when there is already a joint index of (a, b), there is generally no need to create a separate index on a. Therefore, the first principle is that if one less index can be maintained by adjusting the order, then this order is often the one that needs to be prioritized.

So, what if there are both joint queries and queries based on a and b? If there is only b in the query condition, the joint index of (a, b) cannot be used. At this time, you have to maintain another index, which means that you need to maintain both (a, b) and (b) at the same time. index. At this time, the principle we need to consider is space.

Index pushdown

select * from tuser where name like 张%’ and age=10 and ismale=1;
Union index (name, age)

Before MySQL 5.6, you could only return tables one by one starting from ID3. Find the data row on the primary key index, and then compare the field values.

The index pushdown optimization (index condition pushdown) introduced in MySQL 5.6 can first judge the fields included in the index during the index traversal process, directly filter out records that do not meet the conditions, and reduce the number of table returns.

4. What are the differences between the indexes of MyISAM and InnoDB?

MyISAM

MyISAM’s index and row records are stored separately, called an UnClustered Index.
There is no essential difference between its primary key index and ordinary index:

(1) There are continuous aggregation areas that store row records separately;

(2) The leaf node of the primary key index stores the primary key and the pointer of the corresponding row record;

(3) The leaf node of the ordinary index stores the index column and the pointer of the corresponding row record;

The primary key index and the ordinary index are two independent index B+ trees. When searching through the index column, the leaf node of the B+ tree is first located, and then the row record is located through the pointer.

InnoDB

InnoDB’s primary key index and row records are stored together, so it is called a clustered index:

(1) There is no separate area to store row records;

(2) The leaf node of the primary key index stores the primary key and the corresponding row record (instead of a pointer)

Because of this feature, InnoDB tables must have clustered indexes:

(1) If the table defines PK, then PK is the clustered index;

(2) If the table does not define a PK, the first non-empty unique column is a clustered index;

(3) Otherwise, InnoDB will create a hidden row-id as a clustered index;

There can only be one clustered index, because the data rows can only have one clustered storage on the physical disk.

Both MyISAM and InnoDB use B+ trees to implement indexes:

(1) MyISAM’s index and data are stored separately;

(2) The index leaf storage pointer of MyISAM is not much different from the primary key index and the ordinary index;

(3) InnoDB’s clustered index and data rows are stored uniformly;

(4) InnoDB’s clustered index stores the data row itself, and the ordinary index stores the primary key;

(5) InnoDB must have and only one clustered index;

(6) InnoDB recommends using trend increasing integers as PK, rather than using longer columns as PK;
nnoDB supports row locks and transactions, but MyISAM does not