MySQL advanced articles – sorting, grouping, paging optimization

Navigation:

[Java Notes + Stepping Pit Summary] Java Basics + Advanced + JavaWeb + SSM + SpringBoot + St. Regis Takeaway + SpringCloud + Dark Horse Travel + Guli Mall + Xuecheng Online + MySQL Advanced Chapter + Design Patterns + Niuke Interview Questions

Directory

5. Sorting optimization

5.1 Sorting Optimization Suggestions

5.2 Testing

5.2.1 Case verification

5.3.2 Exercises

5.3 Index field selection in range query

5.4 filesort algorithm

5.4.1 Two-way sorting and one-way sorting

5.4.2 Tuning filesort

6. Group optimization

7. Paging query optimization

7.1 Deep paging query optimization

7.2 Deep paging optimization with sorting

5. Sorting optimization

5.1 Sorting Optimization Suggestions

Question: Add an index on the WHERE condition field, but why add an index on the ORDER BY field?

In MySQL, two sorting methods are supported, namely FileSort and Index sorting.

Index sorting:In index sorting, the index can ensure the order of the data, no need to sort again, more efficient, recommended.
FileSort sorting:FileSort sorting is generally performed in memory, which takes up more CPU. If the result to be sorted is large, temporary file I/O will be generated to disk for sorting, lower efficiency.

Optimization suggestion:

The optimizer automatically selects the sorting method:MySQL supports index sorting and FileSort sorting. The index ensures the order of records and has high performance. It is recommended to use. FileSort sorting is in-memory sorting. When the amount of data is large, temporary files are generated and sorted on the disk. The efficiency is low and it takes up a lot of CPU. It’s not that FileSort is necessarily inefficient, it may be efficient in some cases. For example, for left fuzzy and “not equal to” queries that do not cover the index, the efficiency of full table scanning is higher than that of index traversal and then returning to the table.
To meet the leftmost prefix: where post-condition and order by field create a joint index, the order needs to meet the leftmost prefix. For example, index (a,b,c), query where a=1 order by b,c.
The sorting index on the right side of the range query is invalid: For example, the index (a,b,c), query where a>1 order by b,c, resulting in b,c sorting cannot use the index, and filesort is required.
Either all ascending or all descending: The sort order must be either all DESC or all ASC. Out of order will cause the index to fail.
When the amount of data to be sorted is large, the index will fail: The amount of data to be sorted exceeds about 10,000, so the index will not be used and filesort will be used. It is recommended to use limit and where to filter to reduce the amount of data. When the amount of data is large, you need to go back to the table to check all the data after sorting the index, and the performance is very poor. It is not as efficient as FileSort for sorting in memory. It does not mean that using limit will definitely use index sorting. The key is the amount of data. When the amount of data is too large, the optimizer will use FileSort to sort.
Priority range field plus index: When [range condition] and [group by or order by] field appear two options, if there are enough filtered data but not many data to be sorted, priority Put an index on the range field. In this way, even if the range query causes the sorting index to fail, the efficiency is still higher than when only the sorting field is indexed. If you can only filter a little bit, then put the index on the sort field first.
Tuning FileSort: When Index sorting cannot be used, the FileSort method needs to be tuned. For example, increase sort_buffer_size (sort buffer size) and max_length_for_sort_data (maximum length of sorted data)

5.2 Test

5.2.1 Case verification

Delete the created indexes in the student table and class table.

# way 1
DROP INDEX idx_monitor ON class;
DROP INDEX idx_cid ON student;
DROP INDEX idx_age ON student;
DROP INDEX idx_name ON student;
DROP INDEX idx_age_name_classId ON student;
DROP INDEX idx_age_classId_name ON student;

# Method 2: call delete function
call proc_drop_index('atguigudb2','student');

Whether the index can be used in the following, Can you remove using filesort

Direct filesort without indexing:

# Indexing failed. no limit
EXPLAIN SELECT SQL_NO_CACHE * FROM student ORDER BY age,classid;

After adding the index, the limit of the order by will cause the data volume to be too large, so the index will fail:

CREATE INDEX idx_age_classid_name ON student(age,classId,name);
# Indexing failed. no limit
EXPLAIN SELECT SQL_NO_CACHE * FROM student ORDER BY age,classid;

#The index is successful, key_len is 73
EXPLAIN SELECT SQL_NO_CACHE * FROM student ORDER BY age,classid LIMIT 10000;

Process 3: The order is wrong when order by, and the index is invalid

#Create index age, classid, stuno
#call proc_drop_index('atguigudb2','student');
CREATE INDEX idx_age_classid_stuno ON student (age,classid,stuno);
#Index is invalid, does not meet the leftmost prefix
EXPLAIN SELECT * FROM student ORDER BY classid LIMIT 10;
#Index is invalid, does not meet the leftmost prefix
EXPLAIN SELECT * FROM student ORDER BY classid,name LIMIT 10;
#Index is invalid, does not meet the leftmost prefix
EXPLAIN SELECT * FROM student WHERE classid=1 ORDER BY age, stuno;
#Full index, although it does not meet the leftmost prefix, but because the query volume is small, the optimizer first sorts three fields, and then finds 10 where to return.
# The optimizer thinks that the index is more efficient than filesort, so it uses the index
EXPLAIN SELECT * FROM student WHERE classid=1 ORDER BY age, stuno LIMIT 10;
#Index is successful, conforming to the leftmost prefix
EXPLAIN SELECT * FROM student ORDER BY age, classid, stuno LIMIT 10;
#Index is successful, conforming to the leftmost prefix
EXPLAIN SELECT * FROM student ORDER BY age, classid LIMIT 10;

Process 4: The rules are inconsistent when order by, and the index fails (the order is wrong, no indexing; the direction is reversed, no indexing)

Must match the leftmost prefix and “all ascending or all descending”

#Create index age, classid, stuno
CREATE INDEX idx_age_classid_stuno ON student (age,classid,stuno);
# Does not meet the "all ascending or descending order", the index is invalid
EXPLAIN SELECT * FROM student ORDER BY age DESC,classid ASC LIMIT 10;
# Does not meet the leftmost prefix, the index is invalid
EXPLAIN SELECT * FROM student ORDER BY classid DESC,name DESC LIMIT 10;
# Does not meet the "all ascending or descending order", the index is invalid
EXPLAIN SELECT * FROM student ORDER BY age ASC,classid DESC LIMIT 10;
#Conform to the leftmost prefix, conform to "all ascending or all descending order", the index is successful
EXPLAIN SELECT * FROM student ORDER BY age DESC,classid DESC LIMIT 10;

Process 5: The amount of limit data is small, and it is possible to go to the index if the leftmost prefix is not satisfied. First sort and then where filter.

CREATE INDEX idx_age_classid_stuno ON student (age,classid,stuno);
CREATE INDEX idx_age_classid_name ON student(age,classId,name);

# All gone index.
EXPLAIN SELECT * FROM student WHERE age=45 ORDER BY classid LIMIT 10;
# All gone index.
EXPLAIN SELECT * FROM student WHERE age=45 ORDER BY classid,name;
#No index is used, it does not meet the leftmost prefix
EXPLAIN SELECT * FROM student WHERE classid=45 order by age;
# All gone index. Because the amount of limit data is small, the optimizer directly uses the sort field index to sort first, and then filters 10 where
EXPLAIN SELECT * FROM student WHERE classid=45 order by age limit 10;

The range search causes the index to fail: there are indexes (userDbid, addressDbid, createTime) below, userDbid, addressDbid go to the index, because addressDbid is a range search, causing the createTime index to fail.

5.3.2 Exercise

INDEX a_b_c(a,b,c)
order by can use the leftmost prefix of the index
- ORDER BY a
-ORDER BY a,b
-ORDER BY a,b,c
- ORDER BY a DESC, b DESC, c DESC
If WHERE is defined as a constant using the leftmost prefix of the index, then order by can use the index
- WHERE a = const ORDER BY b,c
- WHERE a = const AND b = const ORDER BY c
- WHERE a = const ORDER BY b,c
- WHERE a = const AND b > const ORDER BY b,c
cannot use index for sorting
- ORDER BY a ASC,b DESC,c DESC /* sorting is inconsistent */
- WHERE g = const ORDER BY b,c /*losing a index*/
- WHERE a = const ORDER BY c /*losing b index*/
- WHERE a = const ORDER BY a,d /*d is not part of the index*/
- WHERE a in (...) ORDER BY b,c /*For sorting, multiple equality conditions are also range queries*/

5.3 Index field selection for range query

mysql automatically selects the optimal solution: If two indexes exist at the same time, mysql automatically selects the optimal solution. (For this example, mysql chooses idx_age_stuno_name). However, as the amount of data changes, the selected index will also change accordingly.
When the filter ratio is high, add index to the filter field first: When the [range condition] and [group by or order by] fields appear to choose one or the other, the filter quantity of the condition field is given priority, if the filtered data is sufficient When there is a lot of data and there is not much data to be sorted, the index is first placed on the range field. vice versa.

For the ORDER BY clause, try to use the Index method for sorting, and avoid using the FileSort method for sorting.

Before executing the case, clear the index on the student, leaving only the primary key:
DROP INDEX idx_age ON student;
DROP INDEX idx_age_classid_stuno ON student;
DROP INDEX idx_age_classid_name ON student;

#or
call proc_drop_index('atguigudb2','student');

Scenario: Query students whose age is 30 and whose student number is less than 101000, sort by user name

EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age = 30 AND stuno <101000 ORDER BY NAME ;

The query results are as follows:

mysql> SELECT SQL_NO_CACHE * FROM student WHERE age = 30 AND stuno <101000 ORDER BY NAME;
 + --------- + -------- + -------- + ------ + --------- +
| id | stuno | name | age | classId |
 + --------- + -------- + -------- + ------ + --------- +
| 922 | 100923 | elTLXD | 30 | 249 |
| 3723263 | 100412 | hKcjLb | 30 | 59 |
| 3724152 | 100827 | iHLJmh | 30 | 387 |
| 3724030 | 100776 | LgxWoD | 30 | 253 |
| 30 | 100031 | LZMOIa | 30 | 97 |
| 3722887 | 100237 | QzbJdx | 30 | 440 |
| 609 | 100610 | vbRimN | 30 | 481 |
| 139 | 100140 | ZqFbuR | 30 | 351 |
 + --------- + -------- + -------- + ------ + --------- +
8 rows in set, 1 warning (3.16 sec)

Conclusion: type is ALL, which is the worst case. Using filesort also appeared in Extra, which is also the worst case. Optimization is a must.

Scheme 1: In order to remove filesort, we create an index, and the query efficiency is a little higher

#Create a new index
CREATE INDEX idx_age_name ON student(age,NAME);
EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age = 30 AND stuno <101000 ORDER BY NAME;

Only age has gone index:

So we optimized out using filesort

The query results are as follows:

Scheme 2: Try to use the upper index for the filter conditions and sorting of where, and find that the query efficiency is higher

Build a combined index of three fields and find that using filesort still exists:

DROP INDEX idx_age_name ON student;
CREATE INDEX idx_age_stuno_name ON student (age,stuno,NAME);
EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age = 30 AND stuno <101000 ORDER BY NAME;

age and stuno left the index:

We found that using filesort still exists, so the name does not use the index, and the type is still range, just looking at the name is actually not good. The reason is that because stuno is a range filter, the fields behind the index will not use the index anymore.

The result is as follows:

mysql> SELECT SQL_NO_CACHE * FROM student
-> WHERE age = 30 AND stuno <101000 ORDER BY NAME ;
 + ----- + -------- + -------- + ------ + --------- +
| id | stuno | name | age | classId |
 + ----- + -------- + -------- + ------ + --------- +
| 167 | 100168 | AClxEF | 30 | 319 |
| 323 | 100324 | bwbTpQ | 30 | 654 |
| 651 | 100652 | DRwIac | 30 | 997 |
| 517 | 100518 | HNSYqJ | 30 | 256 |
| 344 | 100345 | JuepiX | 30 | 329 |
| 905 | 100906 | JuWALd | 30 | 892 |
| 574 | 100575 | kbyqjX | 30 | 260 |
| 703 | 100704 | KJbprS | 30 | 594 |
| 723 | 100724 | OTdJkY | 30 | 236 |
| 656 | 100657 | Pfgqmj | 30 | 600 |
| 982 | 100983 | qywLqw | 30 | 837 |
| 468 | 100469 | sLEKQW | 30 | 346 |
| 988 | 100989 | UBYqJl | 30 | 457 |
| 173 | 100174 | UltkTN | 30 | 830 |
| 332 | 100333 | YjWiZw | 30 | 824 |
 + ----- + -------- + -------- + ------ + --------- +
15 rows in set, 1 warning (0.00 sec)

The result turned out to be that filesort ran faster than the index, and much faster, and the results appeared almost instantly.

Reason:

All sorting is performed after conditional filtering. Therefore, if the conditions filter out most of the data, the remaining hundreds or thousands of data to be sorted is not very performance-intensive. Even if the index optimizes the sorting, the actual performance improvement is very limited. Relative to the stuno<101000 condition, if no index is used, tens of thousands of pieces of data must be scanned, which consumes a lot of performance. Therefore, indexing on this field is the most cost-effective and the best choice

Conclusion:

Two indexes exist at the same time, mysql automatically chooses the optimal solution. (For this example, mysql chooses idx_age_stuno_name). However, As the amount of data changes, the selected index will also change accordingly.
When the fields of [range condition] and [group by or order by] appear, the number of filters in the condition field will be observed first. If there are enough filtered data and there are not many data to be sorted, the index will be placed in the range first. field. vice versa.

Thinking: Here we use the following index, is it feasible?
DROP INDEX idx_age_stuno_name ON student;

CREATE INDEX idx_age_stuno ON student(age,stuno);
sure.

5.4 filesort algorithm

5.4.1 Dual-way sorting and single-way sorting

If the sorted field is not on the index column, filesort will have two algorithms: two-way sorting and one-way sorting

Two-way sort (slow)

Before MySQL 4.1, two-way sorting is used, which literally means scanning the disk twice to finally get the data, read the row pointer and order by column, sort them, and then scan the sorted list. Re-read the corresponding data output from the list according to the value in the list
Get the sorting fields from the disk, sort in the buffer, and then get other fields from the disk.

To fetch a batch of data, it is necessary to scan the disk twice. As we all know, IO is very time-consuming, so after mysql4.1, a second improved algorithm appeared, which is one-way sorting.

Single-way sorting (fast)

Read all the columns required by the query from the disk, sort them in the buffer according to the order by column, and then scan the sorted list for output. It is more efficient and avoids the second reading of data. And turn random IO into sequential IO, but it will use more space, because it keeps each row in memory.

Conclusion and raised questions

Generally speaking, it is better than two-way
But there is a problem with single channel
- In sort_buffer, single-way takes up a lot more space than multi-way, because single-way takes out all the fields, so it is possible that the total size of the data taken out exceeds the capacity of sort_buffer, resulting in each Only the data with the capacity of sort_buffer can be taken and sorted (creating tmp files, multi-way merged), after sorting, the capacity of sort_buffer can be taken, and then sorted… thus multiple I/O.
- The single channel originally wanted to save an I/O operation, but it resulted in a large number of I/O operations, and the gain outweighed the loss.

5.4.2 Tuningfilesort

1. Try increasing sort_buffer_size

2. Try increasing max_length_for_sort_data

SHOW VARIABLES LIKE '%max_length_for_sort_data%';
#default 1924 bytes

Increasing this parameter will increase the probability of using the improved algorithm. But if it is set too high, the probability of the total data capacity exceeding the sort buffer size will increase. The obvious symptoms are high disk IO activity and low processor usage. If the total length of the columns to be returned is greater than max_length_for_sort data, use a two-way algorithm, otherwise use a single-way algorithm. Adjust between 1024-8192 bytes

3. When ordering by, select * is a taboo. It is best to only query the required fields.

When the sum of the field sizes of Query is less than max_ength_for_sort_data, and the sorting field is not of TEXTBLOB type, the improved algorithm – single-way sorting will be used, otherwise the old algorithm – multi-way sorting will be used.
The data of both algorithms may exceed the capacity of sort_bufer_size. After exceeding, a tmp file will be created for merge sorting, resulting in multiple I/0s. However, the risk of using a single-way sorting algorithm will be greater, so the sort_bufer_size should be increased.

6. Group optimization

Similar to sorting optimization: The principle of using indexes in group by is almost the same as that of order by. Group by can use indexes directly even if there is no filter condition to use indexes.
Leftmost prefix:group by sort first and then group, follow the best left prefix rule built by index
Tuning FileSort: When index columns cannot be used, increase the settings of max_length_for_sort_data and sort_buffer_size parameters
Where is more efficient than having, so don’t write in having the conditions that can be limited in where. Where is filtering before grouping, and having is filtering after grouping.
Try not to sort groups, save cpu: Reduce the use of order by, and communicate with business without sorting, or put the sorting on the terminal. Order by, group by, and distinct statements consume more CPU, and the CPU resources of the database are extremely precious.
Use limit: Contains query statements such as order by, group by, and distinct. Please keep the result set filtered by the where condition within 1000 rows, otherwise the SQL will be very slow.

7. Paging query optimization

7.1 Deep paging query optimization

In general pagination query, the performance can be better improved by creating covering index.

Current problem: When the offset is very large, it is necessary to query a large amount of useless data and then page, and the performance is poor.

A common and very troublesome problem is limit 2000000,10 At this time, MySQL needs to sort the first 200000010 records, only return 2000000~2000010 records, other records are discarded, and the cost of query sorting is very high. And select * needs to return to the table, which is more time-consuming.

EXPLAIN SELECT * FROM student LIMIT 2000000,10;

Table with primary key auto-increment: Directly check the 10 data after the range. A Limit query can be converted into a location query.

EXPLAIN SELECT * FROM student WHERE id > 2000000 LIMIT 10;

Tables with non-incrementing primary keys: The current table inner join sorts and intercepts the primary key table, and the connection field is the primary key.

EXPLAIN SELECT * FROM student t,(SELECT id FROM student ORDER BY id LIMIT 2000000,10) a WHERE t.id = a.id;

You can also use subqueries, which are optimized into associated queries.

7.2 Deep paging optimization with sorting

Before optimization: Query deep pagination arranged in reverse order according to age

EXPLAIN SELECT * FROM student order by age desc LIMIT 2000000,10;

Optimization plan 1: The optimization idea is the same as before, the inner connection field is id

EXPLAIN SELECT * FROM student t1,(SELECT id FROM student ORDER BY age desc LIMIT 2000000,10) t2 WHERE t1.id=t2.id

Optimization plan 2: If you turn pages sequentially, you can get the last record x of the previous page, then all record ids of the target page number are smaller than x.id (because of the reverse order , and the sorting basis is actually age, id), and the age of all records of the target page number is less than or equal to x.age.

EXPLAIN SELECT * FROM student WHERE id<#{x.id} AND age>=#{x.age} ORDER BY age DESC LIMIT 10;