MySQL index optimization practice & EXPLAIN analysis

Let’s introduce the specific business scenario first

When the user logs in, he needs to view the courses that he can learn. Different users see different courses. The courses have permissions. The permissions are controlled by the following lesson_user_permissions table, where sys_user_id and lesson_id are used as the joint primary key

There is also a lesson table

Our business requirement is to sort the courses in descending order according to the most recent learning time.

Initial logic and general logic

Take out the user id from the token, then go to the lesson_user_permissions table to find all the lesson_ids, then go to the lessons table to find the data according to the lesson_id set, put the data of the two tables together, and then use the structure object according to the latest Learning time to sort in descending order.

The idea of using index and linked table query for optimization

We create a joint index for lesson_user_permissions. The three fields of the joint index are sys_user_id in ascending order, updated_at in descending order, and lesson_id in descending order. At this time, when checking the lesson_user_permissions table, the lesson_id we found out is already sorted according to updated_at. Then we take the sorted lesson_id and directly search for the course according to the primary key.

I checked navicat and found that it cannot create a descending index through the visualized pages. Navicat must also have a terminal command line

Then execute the command to index the table in the terminal

The SQL statement is as follows:

Explain
select l2.name,l1.updated_at from lesson_user_permissions as l1 INNER JOIN lessons as l2
where l1.lesson_id = l2.id and l1.sys_user_id = 1 ORDER BY l1.updated_at DESC

The Explain result is as follows

We can just use this result to compare the theoretical knowledge we have learned

Analysis of Explain

One, id

The id of both fields is 1

It is the same as the above theory, so we won’t explain too much

Second, select_type

Our big query actually contains a lot of small queries, and this field indicates the query type of each small query. This field has several values. The above result is SIMPLE. In addition to SIMPLE, there are also PRIMARY, UNION, etc. Since it is not used, it will not be expanded. Explain the meaning of SIMPLE. Queries that do not contain UNION or subqueries in the query statement are counted as SIMPLE types. Both of our records above meet this requirement, so they are SIMPLE.

Three, table

is the name of the query table

Fourth, partitions can be ignored

5. type is the access method for querying this table (single table access method)

The complete access methods are as follows: system, const, eq_ref, ref, fulltext , ref_or_null, index_merge, unique_subquery, index_subquery, range, index , ALL.

The query method of our first table is ref, let’s analyze it in detail, and then look at the original SQL statement

select l2.name,l1.updated_at from lesson_user_permissions as l1 INNER JOIN lessons as l2
where l1.lesson_id = l2.id and l1.sys_user_id = 1 ORDER BY l1.updated_at DESC

For the access process of lesson_user_permissions, l1.lesson_id = l2.id and l1.sys_user_id = 1 is the filter condition, we ignore the previous filter condition, and only look at the latter filter condition l1.sys_user_id = 1

The definition of the ref access method is: an ordinary secondary index column is compared with a constant for equality.

First of all, the leftmost column of the index idx_userId_updatedAt_lessonId_desc we created above is userId. If it is not the leftmost column, it should not work. As for why it is the leftmost column, I will not introduce it in detail here, as long as everyone knows the conclusion. At the same time, we are also satisfied with the equivalent query, so our access method here is ref.

Next, analyze the access method PRIMARY of the second table, which is easy to understand here. We query in the lessons table through the lesson_id of the lesson_user_permissions table, l1.lesson_id = l2.id At this time, the filter condition comes into play. The final lessons are queried based on the primary key.

The above analysis may be a bit vague. Let’s analyze in detail the process of linking table query through theory, so that we can understand the query process of the second table more clearly.

Then the general execution process of this join query is as follows (the principle of join table query):

  1. Select the driver table, select lesson_user_permissions as the driver table, find the filter condition sys_user_id = 1 for this table, select the access method ref to access, and then get the result set.
  2. For each record in the result set taken from the driver table, search for records in the lessons table respectively. At this time, for the lessons table, the filter condition l1.lesson_id = l2.id is changed. For example, l2 at this time .id = 5, this 5 is one of the result set found in the first step, and then the lessons table is queried through the primary key at this time, and the equivalent query through the primary key is the fastest. According to the normal situation, the type field here It should be const, but it shows an eq_ref. Let’s take a look at the specific meaning of eq_ref: when connecting to a query, the driven table is accessed through the primary key or the unique secondary index column for equivalent matching. Perfectly fit our situation.

The following is possible_keys, this field is the index that may be used, and the following key is the specific index used

It can be seen that for lesson_user_permissions, the index possible_keys that may be used is PRIMARY, idx_userId_updatedAt_lessonId_desc, and the index key actually used is idx_userId_updatedAt_lessonId_desc. There is theoretical support for why the index used by lesson_user_permissions is idx_userId_updatedAt_lessonId_desc instead of a primary key in the joint primary key as an index. At the same time, this is what we deliberately designed to achieve index optimization. The specific principles are as follows:

Let’s first check the situation of all the indexes in a table (the index situation of the analysis table)

Through Non_unique, we can know that there are two indexes in this table, Key_name is the name of the index, Non_unique, Seq_in_index two fields are known together, both indexes are joint indexes, Seq_in_index is the field sorting in the joint index, and Column_name is The specific name of the sort field, and Collation is whether the index field is sorted in ascending or descending order.

Cardinality indicates the estimated value of the number of unique values in the index column. We see that the value of sys_user_id in the first row is 377, that is to say, in the entire table, 337 IDs of different students are stored. This value is normal, because The group is about that many people. Then the second line of code is lesson_id, indicating that there are 1575 different courses in this table.

Sub_part For columns that store strings or byte strings, sometimes we only want to index the first n characters or bytes, so this attribute represents the n value. If the complete column is indexed, the value of this property is Null.

Packed, ignored

Null represents whether the index column is stored as a Null value

Index_type uses the type of index, our most common is BTREE, B + tree index

Comment Index column comment information

Index_comment index comment information

So why choose the composite index we built instead of the primary key index? (Index Eligibility & Index Cost Analysis)

Let’s look at the SQL statement again

select l2.name,l1.updated_at from lesson_user_permissions as l1
INNER JOIN lessons as l2
where l1.lesson_id = l2.id
and l1.sys_user_id = 1
ORDER BY l1.updated_at DESC

According to the analysis of the join table query connection principle, we can know that lesson_user_permissions is used as the driving table. When we use the single table method, the filter condition is l1.sys_user_id = 1 ORDER BY l1.updated_at DESC. We have two indexes PRIMARY, idx_userId_updatedAt_lessonId_desc, The key to choosing idx_userId_updatedAt_lessonId_desc to build a self-index is that we use updated_at for descending order. Because idx_userId_updatedAt_lessonId_desc is first sorted according to userid in ascending order, and then according to updated_at in descending order, if we use idx_userId_updatedAt_lessonId_desc, it will be sorted directly. If we use PRIMARY, we have to sort again.

In order to verify this, we rewrite a SQL and change the descending order to ascending order (the default is ascending order if the descending order is not written)

select l2.name,l1.updated_at from lesson_user_permissions as l1
INNER JOIN lessons as l2
where l1.lesson_id = l2.id
and l1.sys_user_id = 1
ORDER BY l1.updated_at

At this time, our self-built index idx_userId_updatedAt_lessonId_desc is no longer valid.

Let’s look at other fields later

Six, key_len

This represents the maximum length of the index record when the optimizer thinks that an index is used to execute a query, and it consists of the following three parts:

  1. For a fixed-length index column, the maximum length of the storage space it actually occupies is the fixed value. For a variable-length index column with a specified character set, for example, the type of the index column is VARCHAR(100), If the character set used is utf8, then the maximum storage space actually occupied by this column is 100 * 3 = 300 bytes.
  2. If the index column can store NULL values, key_len is 1 byte more than when it cannot store NUll values
  3. For variable-length fields, there will be 2 bytes of space to store the actual length of the variable-length field

This explains why the two records we found are both 8.

In the lesson_user_permissions table, we use the idx_userId_updatedAt_lessonId_desc index. According to my expectation, the first field must be used, and the second field must also be used. At this time, key_len must be greater than 8, but this The time structure is indeed 8, and I haven’t analyzed the reason for it yet.

In the lessons table, we use the primary key index, the type of the primary key is bigint type, the primary key cannot be empty, and the primary key is a fixed-length field, so it is 8.

Seven, ref

The meaning of the representative is: when using the index column equivalent matching condition to execute the query, that is, when the access method is const, eq_ref, ref_or_null, unique_subquery, index_subquery , the ref column shows the equivalent matching with the index column, for example, a constant, or a column

In the lesson_user_permissions driver table, ref is const, indicating that when using idx_userId_updatedAt_lessonId_desc, the equivalent match with the user_id column is a constant 1

In the lessons driven table, the access method is eq_ref, and the value of the corresponding ref column is exam.l1.lesson_id, which means that the PRIMARY index will be used when accessing the driven table, that is, the clustered index will be associated with a The conditions for equivalence matching in the column, the object for equivalence matching in the lessons table is exam.l1.lesson_id, and l1 is the lesson_user_permissions table.

Eight, rows

If the query optimizer decides to use a full table scan to execute a query on a table, the rows column of the execution plan represents the expected number of rows to be scanned. If an index is used to execute the query, the rows column of the plan execution represents the expected number of rows Number of index record rows scanned.

Nine, filtered

This field is relatively unfamiliar. Let’s introduce it in detail. We said before that the query in MySQL adopts the circular nested join algorithm. The driving table is accessed once, and the driven table may be accessed multiple times. Therefore, for two-table join query , his query cost consists of two parts: one is the cost of a single query of the driving table, and the other is the cost of multiple queries of the driven table (the specific number of queries depends on how many records are in the result set of the query on the driving table)

We call the number of entries obtained by querying the driver table fan-out. The smaller the fan-out value, the fewer queries are made to the driven table, and the lower the overall cost of join queries. When the query optimizer wants to calculate the cost of the entire joint query query, it needs to calculate the fan-out value of the driving table.

Sometimes the calculation of the fan-out value is easy, such as a full table scan of the driver table, or a query through a secondary index.

It is not easy to calculate. When querying the driving table, there are no indexed fields, only non-indexed fields, or there are indexed fields, but there are also non-indexed fields, or there are multiple indexed fields. At this time, we can only determine If one index field is applied, the other index fields will no longer be index fields at this time.

If it is not for the filtering of index fields, and the query optimizer will not actually execute it, then it can only be guessed.

If you use a single-table query executed by full table scan, you need to guess how many records meet the search conditions when calculating the fan-out value of the driving table.

If you are using a single-table scan performed by an index, you need to guess how many records satisfy other search conditions except for the search conditions using the corresponding index column when calculating the fan-out of the driving table.

The process of guessing is condition filtering.

Our value here is 100 because we have no query conditions other than indexed columns.

Ten, Extra

Explain some additional situations. Through this additional information, we can more accurately understand how MySQL will execute the given query statement. There are many situations here, and the result of this field is Using index. The meaning of this field is: when my query list and search conditions only contain columns that belong to the index, that is, when the index can be used to cover, Extra will prompt this additional information. This shows that we indeed sorted by idx_userId_updatedAt_lessonId_desc, and did not perform the return table operation.

Explain is gone here, but we have not seen the cost of executing the index to query. If you want to check the execution cost, you can use the following method

?
Explain FORMAT = JSON

select l2.name,l1.updated_at from lesson_user_permissions as l1
INNER JOIN lessons as l2
where l1.lesson_id = l2.id
and l1.sys_user_id = 1
ORDER BY l1.updated_at Desc

?
{
  "query_block": {
    "select_id": 1, #The entire query has only one SELECT keyword, and the id number corresponding to this keyword is 1
    "cost_info": {
      "query_cost": "3.85" #Estimated cost of executing the entire query
    },
    "ordering_operation": {
      "using_filesort": false, #No file sorting is used
      "nested_loop": [ #Nested loop join algorithm is used between several tables to execute
        {
          "table": {
            "table_name": "l1", #l1 is the driver table
            "access_type": "ref", #access method
            "possible_keys": [ #possible index
              "PRIMARY",
              "idx_userId_updatedAt_lessonId_desc"
            ],
            "key": "idx_userId_updatedAt_lessonId_desc", #The index used
            "used_key_parts": [ #The fields of the index used
              "sys_user_id"
            ],
            "key_length": "8", #index length
            "ref": [ #When using the index column equivalence matching conditions to execute the query, that is, when the access method is const, eq_ref, ref_or_null, unique_subquery, index_subquery, the ref column shows the same as the index column value matches something
              "const" #Represents a constant
            ],
            "rows_examined_per_scan": 8, #To query the driver table once, you need to scan about 8 pieces of data
            "rows_produced_per_join": 8, #The fan-out of the driver table is 8
            "filtered": "100.00", #Satisfies the data other than the filter condition of the index column / the data filtered by the index condition
            "using_index": true, #Whether there is index coverage
            "cost_info": {
              "read_cost": "0.25", #The following highlights
              "eval_cost": "0.80",#The following highlights
              "prefix_cost": "1.05",#The following highlights
              "data_read_per_join": "320"#The following highlights
            },
            "used_columns": [ #The columns used in this table, that is, the columns included in the results returned to the user
              "sys_user_id",
              "lesson_id",
              "updated_at"
            ]
          }
        },
        {
          "table": {
            "table_name": "l2",
            "access_type": "eq_ref",
            "possible_keys": [
              "PRIMARY"
            ],
            "key": "PRIMARY",
            "used_key_parts": [
              "id"
            ],
            "key_length": "8",
            "ref": [
              "exam.l1.lesson_id"
            ],
            "rows_examined_per_scan": 1,
            "rows_produced_per_join": 8,
            "filtered": "100.00",
            "cost_info": {
              "read_cost": "2.00",
              "eval_cost": "0.80",
              "prefix_cost": "3.85",
              "data_read_per_join": "54K"
            },
            "used_columns": [
              "id",
              "name"
            ]
          }
        }
      ]
    }
  }
}

read_cost consists of two parts: IO cost + cpu cost for detecting rows * (1 – filter) records

eval_cost is the cost of detecting rows * filter records

prefix_cost The cost of querying the s1 table alone read_cost + eval_cost

data_read_per_join indicates the amount of data that needs to be read in this query

Summary

Knowledge is not only to be learned, but also to be applied. Use theory to guide practice, and practice to further test the theory.

Legacy issues

When we use Explain, we can view the indexes that may be used and the indexes that are actually used, but we cannot see how MySQL specifically selects the index that is actually used, which involves MySQL Deciding which index to use for cost calculations. These contents are going to be written in the next blog.