MySQL index pitfalls, who knows who steps on them

The index can be said to be a big heart of the database. If a database lacks an index, then the existence of the database itself will be meaningless, and it is no different from an ordinary file. Therefore, a good index is particularly important for the database system. Today, let’s talk about MySQL index. From the perspective of details and actual business, we will look at the benefits of B + tree index in MySQL, as well as the knowledge points we need to pay attention to when using indexes.

1. Proper use of indexes

At work, the most direct way for us to judge whether a field in the data table needs to be indexed is: whether this field will often appear in our where condition. From a macro perspective, there is no problem with thinking like this, but from a long-term perspective, sometimes more detailed thinking may be needed. For example, do we not only need to create an index on this field? Is a joint index on multiple fields better? Taking a user table as an example, the fields in the user table may include the user’s name, user’s ID number, user’s home address, etc.

1. Disadvantages of ordinary indexes

Now there is a need to find the user’s name based on the user’s ID number. At this time, the first way to think of is to create an index on id_card. Strictly speaking, it is a unique index, because the ID number must be unique. So when we execute the following query:

SELECT name FROM user WHERE id_card=xxx

Its process should be like this:

  • First search on the id_card index tree to find the primary key id corresponding to id_card
  • Search on the primary key index by id to find the corresponding name

From an effect point of view, the result is no problem, but from an efficiency point of view, it seems that this query is a bit expensive, because it retrieves two B + trees. Assuming that the height of one tree is 3, then the height of the two trees is 6 , because the root node is in memory (two root nodes here), so the final number of IOs to be performed on the disk is 4 times. If the average time of a random disk IO is 10ms, then it will ultimately take 40ms. . This number is average, not fast.

2. Traps of primary key index

Since the problem is table return, which results in retrieval in both trees, the core question is to see if it can be retrieved in only one tree. From a business perspective, you may have found an entry point here. The ID number is unique, so can our primary key not use the default auto-increment ID? We set the primary key to our ID number, so that the entire table Only one index is needed, and all the required data including our name can be found through the ID number. It seems reasonable to think about it simply. As long as every time you insert data, specify the ID as the ID number. But if you think about it carefully, There seems to be a problem.

Let’s look at the characteristics of the B + tree. The data of the B + tree are stored on the leaf nodes, and the data is managed in pages. One page is 16K. What does this mean? Even if we have one row of data now, it will occupy a 16K data page. Only when our data page is full will it be written to a new data page. The new data page and the old data page are physically different. It must be continuous, and one key point is that although the data pages are physically discontinuous, the data is logically continuous.

Maybe you are curious, what does this have to do with our ID card number being the primary key ID? At this time, you should pay attention to the keyword “continuous”. The ID number is not consecutive. What does this mean? When we insert a piece of discontinuous data, in order to maintain continuity, the data needs to be moved. For example, the original data on a page is 1->5, and now a 3 is inserted, then 5 needs to be moved after 3. Maybe you will say that this does not cost much, but if the new data 3 causes the page A to be full, then it depends on whether the page B behind it has space. If there is space, the starting data of page B should be The one that overflows from page A also needs to move data. If page B does not have enough space at this time, then it is necessary to apply for a new page C, and then move part of the data to this new page C, and will cut off the relationship between page A and page B, and insert it between the two. A page C, from a code level, is a pointer that switches the linked list.

In summary, using discontinuous ID numbers as primary keys may cause overhead related to page data movement, random IO, and frequent application for new pages. If we use an auto-incrementing primary key, then the ID must be sequential, there will be no data movement problems due to random IO, and the insertion overhead must be relatively small.

In fact, there is another reason why it is not recommended to use the ID number as the primary key: the ID number is too large as a number and needs to be stored in bigint. Normally, int is enough for a school student. We know that one page It can store 16K. When an index itself takes up more space, it will result in less data that can be stored on one page. Therefore, under a certain amount of data, using bigint requires more pages than int, which means more storage.

3. The spear and shield of joint index

It can be drawn from the above two conclusions:

  • Try not to return the watch
  • ID number is not suitable as a primary key index

So it is natural to think of a joint index and create a joint index of [ID card number + name]. Pay attention to the order of the joint index and comply with the leftmost principle. So when we also execute the following SQL:

select name from user where id_card=xxx

We can get the name field we need without returning the table. However, it still does not solve the problem that the ID number itself takes up too much space. This is a problem with the business data itself. If you want to solve it, we can use some conversion algorithms to The originally large data is converted into small data, such as crc32:

crc32.ChecksumIEEE([]byte("341124199408203232"))

The ID card number that originally required 8 bytes of storage space can be replaced with a 4-byte crc code. Therefore, our database needs to add a field crc_id_card, and the joint index has also changed from [ID card number + name] to [ crc32 (ID card number) + name], the joint index takes up less space. But this conversion also comes at a cost:

  • Each additional crc requires more CPU resources
  • Although the additional fields make the index space smaller, they also take up space themselves.
  • There is a probability of conflict in crc. This requires us to query the data and then filter it based on id_card. The cost of filtering depends on the number of duplicate data. The more repetitions, the slower the filtering.

Regarding joint index storage optimization, here is a small detail. Assume that there are two fields A and B, which occupy 8 bytes and 20 bytes respectively. When the joint index is already [A, B], we still To support separate queries of B, it is natural that we also create an index on B. Then the space occupied by the two indexes is 8 + 20 + 20 = 48. Now we can use the index whether we query through A or through B. If If the business permits, can we establish [B, A] and A indexes? In this case, not only can the index be used to query data through A or B alone, but it can also occupy less space: 20 + 8 + 8= 36.

4. Short and concise prefix index

Sometimes the field we need to index is of string type, and the string is very long. We want this field to be indexed, but we don’t want this index to take up too much space. In this case, we can consider creating a prefix index. Create an index based on the first part of the characters of this field, so that you can enjoy the index and save space. What needs to be noted here is that when the prefix repetition is high, there should be a difference in the speed of the prefix index and the ordinary index.

alter table xx add index(name(7));#The first 7 characters of name create an index
select xx from xx where name="JamesBond"

5. The speed and slowness of unique index

Before talking about unique indexes, let’s first understand the characteristics of ordinary indexes. We know that for B + trees, the data of leaf nodes are ordered.

Suppose now we want to query the data 2. When 2 is found through the index tree, the storage engine does not stop searching because there may be multiple 2s. This means that the storage engine will continue to search backwards on the leaf nodes. After finding the second 2, do you stop? The answer is no, because the storage engine does not know whether there are more 2s later, so it has to continue searching backwards until it finds the first data that is not 2, which is 3. After finding 3, it stops retrieval. This is a normal index. retrieval process.

The unique index is different. Because of the uniqueness, there is no possibility of duplicate data. Therefore, we return directly after retrieving our target data. We will not have to search backwards one more time like a normal index. From this perspective, The unique index is faster than the ordinary index, but when the data of the ordinary index is all in one page, it is not much faster. In terms of data insertion, unique indexes may be slightly inferior. Because of uniqueness, every time you insert, you need to judge whether the data to be inserted already exists, while ordinary indexes do not need this logic, and the most important point is uniqueness. The index will not use the change buffer (see below).

6. Don’t blindly add indexes

At work, you may encounter a situation like this: Do I need to index this field? . For this problem, our common judgment method is: will this field be used in queries? If this field is often included in the query conditions, we may consider adding an index. But if you only judge based on this condition, you may add a wrong index. Let’s look at an example: Suppose there is a user table with about 1 million data. There is a gender field in the user table that represents men and women. Men and women account for almost half. Now we want to count the information of all boys, and then we add the gender field index, and we wrote the SQL like this:

select * from user where sex="male"

If nothing else, InnoDB will not choose the gender index. If you use the gender index, you must return to the table. When the amount of data is large, what are the consequences of returning to the table? I posted a picture similar to the one above, I think everyone knows it:

The main reason is a large amount of IO. A piece of data requires 4 times. So what about 50w of data? The results can be imagined. Therefore, in this case, MySQL’s optimizer will most likely perform a full table scan and directly scan the primary key index, because the performance may be higher.

7. Things about index failure

In some cases, because of our improper use, MySQL cannot use the index. This usually easily occurs in type conversion. Maybe you will say, doesn’t MySQL already support implicit conversion? For example, there is an integer user_id index field. Because we did not pay attention when querying, we wrote:

select xx from user where user_id="1234"

Note that this is character 1234. When this happens, MySQL is indeed smart enough to convert character 1234 into numeric 1234, and then happily use the user_id index. But if we have a character user_id index field, because we didn’t pay attention when querying, we wrote:

select xx from user where user_id=1234

There will be a problem at this time. The index will not be used. Maybe you will ask, why won’t MySQL convert it at this time? Isn’t it enough to convert the numeric 1234 into the character 1234? The conversion rules need to be explained here. When comparing strings and numbers, remember: MySQL will convert strings into numbers.

Maybe you will ask again: Why does the index not need to be used to convert the character user_id field into a number? This is about the structure of the B + tree index. We know that the B + tree index is forked according to the index value. And sorting, when we type-convert the index field, the value will change. For example, it is originally an A value. If an integer conversion is performed, it may correspond to a B value (int(A)=B). At this time, the index The tree cannot be used, because the index tree is constructed according to A, not B, so the index will not be used.

2. Index optimization

1. change buffer

We know that when updating a piece of data, we must first determine whether the page of this data is in the memory. If it is, update the corresponding memory page directly. If not, we can only go to the disk and read the corresponding data page into the memory. Come and then update, what’s the problem?

  • Reading from the disk is a little slow.
  • If a lot of data is updated at the same time, a lot of discrete IO may occur.

In order to solve the speed problem in this case, change buffer appears. First of all, don’t be misled by the word buffer. In addition to being in the public buffer pool, change buffer will also be persisted to the disk. After we have the change buffer, during the update process, if we find that the corresponding data page is not in the memory, we will not go to the disk to read the corresponding data page, but put the data to be updated into the change buffer. When will the change buffer data be synchronized to the disk? What if a read action occurs at this time? First, there is a thread in the background that will regularly synchronize the change buffer data to the disk. If the thread has not had time to synchronize, but another read operation occurs, the event of merging the change buffer data to the disk will also be triggered.

It should be noted that not all indexes can use the changer buffer, such as primary key indexes and unique indexes. Because of their uniqueness, they must determine whether the data exists when updating. If the data page is not in the memory, , you have to go to the disk and read the corresponding data page into the memory, but it doesn’t matter with the ordinary index, and there is no need to verify uniqueness.

The larger the change buffer, the greater the theoretical benefit. This is because firstly there are fewer discrete read IOs, and secondly when multiple changes occur on a data page, it only needs to be merged to the disk once. Of course, not all scenarios are suitable for change buffers. If your business needs to be read immediately after an update, change buffers will be counterproductive because merge actions need to be triggered continuously, resulting in the number of random IOs not decreasing but increasing. The overhead of maintaining the change buffer is reduced.

2. Index pushdown

We talked about the joint index earlier. The joint index must satisfy the leftmost principle, that is, when the joint index is [A, B], we can use the index through the following SQL:

select * from table where A="xx"
select * from table where A="xx" AND B="xx"

In fact, joint indexes can also use the leftmost prefix principle, that is:

select * from table where A like "Zhao%" AND B="Shanghai"

But what needs to be noted here is that because part of A is used, before MySQL 5.6, the above sql will immediately return to the table (using select *) after retrieving all data starting with “Zhao”, and then Comparing the judgment of whether B is “Shanghai City”, are you a little confused here? Why is the judgment of B not made directly on the joint index? In this case, wouldn’t the number of table returns be reduced? The reason for this problem is still the use of the leftmost prefix. Although the index can use part of A, it cannot use B at all. It seems a bit “silly”, so after MySQL5.6, the index appears. Push this optimization (Index Condition Pushdown). With this function, although the leftmost prefix is used, you can also search for data that matches A% on the joint index and also filter non-B data, which greatly reduces the time of returning to the table. frequency.

3. Refresh the adjacent page

Before we talk about refreshing adjacent pages, let’s talk about dirty pages first. We know that when updating a piece of data, we must first determine whether the page where the data is located is in the memory. If not, the data page needs to be read into the memory first. , and then update the data in the memory. At this time, you will find that the page in the memory has the latest data, but the page on the disk is still old data, so the page in the memory where this data is located is a dirty page. Needs to be flushed to disk to be consistent.

So the question is, when to brush? How many dirty pages should be flushed each time? If you flush every time there is a change, the performance will be very poor. If you flush after a long time, a lot of dirty pages will accumulate, resulting in fewer pages available in the memory pool, thus affecting normal functions. Therefore, the brushing speed should not be too fast but must be timely. MySQL has a cleaning thread that will be executed regularly to ensure that it is not too fast. When there are too many dirty pages or the redo log is almost full, the disk brushing will be triggered immediately to ensure timely .

During the process of flushing dirty pages, InnoDB has an optimization: if the neighbor pages of the dirty page to be flushed are also dirty, then they will be flushed together. The advantage of this is that it can reduce random IO. In the case of mechanical disks, The optimization should be quite large, but there may be pitfalls here. If the neighbor dirty pages of the current dirty page are flushed together, and the neighbor pages immediately become dirty again due to data changes, then it feels like it is unnecessary at this time. And instead it wastes time and expense. What’s worse is that if the neighbor page’s neighbor is also a dirty page… then this chain reaction may cause temporary performance problems.

4. MRR

In actual business, we may be told to use covering indexes as much as possible and not to return the table, because returning the table requires more IO and takes longer, but sometimes we have to return the table. Returning the table will not only cause excessive IO, and more seriously, too much discrete IO.

select * from user where grade between 60 and 70

Now we want to query the user information with scores between 60-70, so our sql is written as above. Of course, our grade field is indexed. According to common sense, we will first find the entry grade=60 on the grade index. data, and then search the primary key index based on the id corresponding to the data grade=60, and finally return to the grade index again, repeating the same action…

Suppose now that grade=60 corresponds to id=1, the data is on page_no_1, grade=61 corresponds to id=10, the data is on page_no_2, grade=62 corresponds to id=2, and the data is on page_no_1, so it is true The situation is to first find data on page_no_1, then switch to page_no_2, and finally switch back to page_no_1. But in fact, id=1 and id=2 can be completely merged. Just read page_no_1 once, which not only saves IO, but also avoids random IO. , this is MRR. When MRR is used, the auxiliary index will not return to the table immediately. Instead, the obtained primary key ID will be placed in a buffer, and then sorted. After sorting, the primary key index will be read sequentially, which greatly reduces discrete IO.

Benefits of this article, Receive it for freeC + + Learning materials package, technical video< strong>/Code, 1000Daochang interview questions, including (C + + Basics, network programming, database, middleware, back-end development, audio and video development, Qt development, etc. learning routes) ↓↓↓↓Those who need it You can get it at Penguin Skirt927239107~↓↓

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. MySQL entry skill treeSQL advanced skillsCTE and recursive query 77787 people are learning the system