MySQL’s optimization tool: Pushing down index conditions improves performance by 273% under tens of millions of data

MySQL’s optimization tool: the index conditions are pushed down, and the performance of tens of millions of data is improved by 273%

Foreword

In the previous stage, we talked about the selection of field types in MySQL and lamented the performance differences of different types under tens of millions of data.

Time type: How to choose the time type of MySQL field? Performance improvement of 10%~30% under tens of millions of data

Character type: How to choose the character type of MySQL field? There is a 30% difference in performance between varchar and char under tens of millions of data

In the new stage, let’s talk about the optimization measures of indexes in MySQL. This article mainly talks about the index condition push in MySQL.

Students can read this article with these questions:

  1. How are statements with multiple query conditions executed in MySQL? How do the server layer and the storage engine layer interact?
  2. What is the difference between clustered index and secondary index storage content?
  3. What is a return form? What are the costs of returning the table? How to avoid table return?
  4. What is index condition pushdown?
  5. When can I use index conditional pushdown?
  6. What problems can be solved by pushing down index conditions?
  7. How much performance can be improved by pushing down index conditions under tens of millions of data volumes?

server layer and storage engine layer

The MySQL server can be divided into server layer and storage engine layer. The storage engine layer mainly stores records and can be implemented with different storage engines (innodb, myisam).

The server layer has different components that handle different functions, such as: receiving client requests (connector), checking SQL syntax (parser), determining cache hits (removed in query cache 8.0), optimizing SQL and selecting indexes to generate execution plans ( Optimizer), calling the storage engine to obtain records (executor)

image.png

Interaction between server layer and storage engine layer

Take the student table as an example

CREATE TABLE `student` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `age` smallint(6) DEFAULT NULL COMMENT 'age',
  `student_name` varchar(20) DEFAULT NULL COMMENT 'name',
  `info` varchar(30) DEFAULT NULL COMMENT 'information',
  PRIMARY KEY (`id`),
  KEY `idx_age_name` (`age`,`student_name`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

Clustered (primary key) index stores the value of the entire record in order with the primary key id

image.png

Secondary indexes only store specified index columns and primary keys, and are ordered in the order of index columns and primary key values

When the secondary index is a (age, student_name) joint index, age is in order as a whole. When ages are equal, student_name is in order. When student_name is equal, the primary key is in order.

image.png

When a multi-condition query occurs (where has multiple conditions), after the executor obtains the data from the storage engine layer, it still needs to filter other query conditions at the server layer.

For example, select * from student where age = 18 and student_name like 'c%'; (query students whose age is 18 and whose name starts with c in the student table)

There is a joint index of (age, student_name). The optimizer will think that the joint index is optimal, so it generates an execution plan using the joint index of (age, student_name). The executor calls the storage engine layer according to the execution plan.

At the storage engine layer, matching will be performed based on age = 18. When this condition is met, the clustered index will be returned to the table first.

What is a return form?

Secondary indexes only store the required columns and primary keys, while clustered (primary key) indexes store all data

Since the index we use does not store the columns required for the query list, we need to query again in the clustered (primary key) index to obtain the values of other columns.

image.png

During this processthe primary key values may be out of order, so when querying the clustered index back to the table, random IO (high overhead) will occur

The unit of interaction between the server layer and the storage engine layer is the record

image.png

  1. The server layer optimizer generates an execution plan based on the index, and the executor calls the storage engine layer
  2. The storage engine layer searches for records that satisfy age=18 in the joint index.
  3. Each time a record is found, return it to the table and query the clustered index to obtain the values of other columns.
  4. Then return to the server layer for where filtering
  5. 2-4 is actually a loop until the first record that does not meet the conditions is found.

A problem will be found in this process: student_name like 'c%' can be judged in the joint index of the storage engine layer. There is no need to return to the table to query the clustered index and then return to the server layer for judgment.

Index Condition Push

English name of index condition push: Index Condition Push

Push the judgment where condition from the server layer to the storage engine layer, which means that the storage engine layer will also judge other query conditions.

For example, age=18 and student_name like 'c%', you need to determine whether student_name is satisfied before returning to the table.

The first and third records in the figure do not satisfy student_name like 'c%', so they are skipped without returning to the table.

image.png

Pushing ICP down under index conditions prevents judgments from the storage engine layer, which can be judged at the storage engine layer, but still returns to the server layer after querying the table, thus reducing the number of table returns

image.png

Execution steps after joining ICP:

  1. The server layer optimizer generates an execution plan based on the index, and the executor calls the storage engine layer
  2. The storage engine layer searches the index for records that satisfy age=18
  3. After finding a record that satisfies the condition, other query conditions are judged based on the existing columns on the index. If the record is not met, the record will be skipped.
  4. If satisfied, return to the table to query the values of other columns in the clustered index.
  5. After obtaining the value that needs to be queried, return to the server layer for where filtering
  6. Steps 2-5 are executed in a loop until the first record that does not meet the conditions is found.

Test

Enable function creation

#Enable function creation
set global log_bin_trust_function_creators=1;

#ON means it is turned on
show variables like 'log_bin_trust%';

Define a randomly generated string function

#The delimiter is changed from; to $$
delimiter$$
#Function name ran_string requires a parameter int type and return type varchar(255)
create function ran_string(n int) returns varchar(255)
begin
#Declare variable chars_str default'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
declare chars_str varchar(100) default 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
#Declare variable return_str defaults to ''
declare return_str varchar(255) default '';
#Declaration variable i defaults to 0
declare i int default 0;
#Loop condition i<n
while i < n do
set return_str = concat(return_str,substring(chars_str,floor(1 + rand()*52),1));
set i=i + 1;
end while;
return return_str;
end $$

Define range generation shaping function

# Generate range-generating function for integer
delimiter$$
create function range_nums(min_num int(10),max_num int(10)) returns int(5)
begin
declare i int default 0;
set i = FLOOR(RAND() * (max_num - min_num + 1)) + min_num;
return i;
end $$

Define insert function

#Insert starting from parameter start and insert max_num pieces of data
delimiter$$
create procedure insert_students_tests(in start int(10),in max_num int(10))
begin
declare i int default start;
set autocommit = 0;
repeat
set i = i + 1;
#SQL statement
insert into student(student_name,age,info)
values (ran_string(10),range_nums(0,100),ran_string(20));
until i=max_num
end repeat;
commit;
end $$

implement

#Execute the insert function
delimiter;
call insert_students_tests(0,19000000);

The amount of data I tested is 1900 million

Remember to create an index

alter table student add index idx_age_name(age,student_name);

Index condition pushdown is enabled by default, SQL_NO_CACHE does not use cache (MySQL version 5.7 also has cache)

select SQL_NO_CACHE * from student where age = 18 and student_name like 'c%'
> OK
> Time: 1.339s

So how to judge whether index condition pushdown is used?

We use explain to view the execution plan. When there is Using index condition in the additional information, it means using index condition to push down.

image.png

So how to turn off index condition pushdown?

Here we use session level shutdown

SET optimizer_switch = 'index_condition_pushdown=off';

After closing, check the execution plan and find that there is no more Using index condition in the additional information.

image.png

select SQL_NO_CACHE * from student where age = 18 and student_name like 'c%'
> OK
> Time: 5.039s

(5.039 – 1.339) / 1.339 = 276%, the performance improved by using index condition pushdown is actually 276%

After the previous analysis, index condition pushdown optimizes performance by reducing the number of table returns, so the performance improved here actually saves unnecessary table return overhead

When querying a large amount of data, returning the table not only requires multiple clustered index queries, but may also lead to random IO (increased interaction with the disk)

Although the number of table returns can be reduced through index condition pushdown optimization, there will still be records that meet the conditions that need to be returned to the table.

So is there any way to avoid table returns or reduce the cost of table returns?

If you have already obtained the columns that need to be queried on the secondary index (such as querying age, student_name, id), then there is no need to return the table

So if we still need to query other columns in the clustered index, how to reduce the cost of returning the table?

This issue will be discussed in the next chapter. If you think of any solutions, you can also share them in the comment area~

Summary

MySQL server is divided into server layer and storage engine layer. The storage engine layer can store records through different implementations (innodb, myisam)

The server layer has different components with a clear division of labor: connector (manage request connections), analyzer (process SQL syntax, part-of-speech analysis), optimizer (optimize SQL, generate execution plans based on different indexes), executor (according to execution Plan to call the storage engine to obtain records)

The server layer and the storage engine layer interact in units of records. The server layer executor calls the storage engine layer to obtain records according to the execution plan

The secondary index stores the values of the index column and primary key, and is sorted by the index column and primary key. When there are multiple index columns, the current index column is ordered only when the previous index column is equal; the clustered index stores the entire record. values, ordered by primary key

When using a secondary index and the columns on the secondary index do not meet the query conditions, you need to query the clustered index back to the table to obtain the values of other columns; when querying the clustered index back to the table, the primary key values are out of order and may cause random IO

When index conditions are pushed down, in the case of multiple query conditions, the storage engine layer judges where other query conditions once more, and uses other columns on the secondary index to determine whether the record satisfies other query conditions. If not, there is no need to return to the table. , reduce the number of table returns

When querying a large amount of data, the cost of table return is very high. Only when the columns in the secondary index meet the columns required by the query, the table will not be returned. The random IO generated by the table return must be optimized through other means< /strong>

Finally (don’t do it in vain, just click three times in a row to get it~)

This article is included in the column. From point to line, from line to surface, build the MySQL knowledge system. Interested students can continue to pay attention.

The notes and cases of this article are included in gitee-StudyJava and github-StudyJava. Interested students can continue to pay attention under stat~

If you have any questions, you can discuss them in the comment area. If you think Cai Cai’s writing is good, you can like, follow, and collect it to support it~

Follow Cai Cai and share more useful information, public account: Cai Cai’s back-end private kitchen

This article is published by OpenWrite, a blog posting platform!

syntaxbug.com © 2021 All Rights Reserved.