SELECT COUNT(*) will cause a full table scan? Go back and wait for the notification

Foreword

Will SELECT COUNT(*) lead to slow query caused by full table scan?

SELECT COUNT(*) FROM SomeTable

There is a saying on the Internet that for COUNT(*) without where_clause, MySQL is optimized, and the optimizer will choose the auxiliary index query count with the lowest cost, but in fact it has the highest performance , is this statement correct?

In response to this question, I first went to production to find a table with tens of millions of levels and used EXPLAIN to query the execution plan

EXPLAIN SELECT COUNT(*) FROM SomeTable

The result is as follows

As shown in the figure: I found that this statement does not use the primary key index in this example, but the auxiliary index. In fact, I tested it in this example, whether it is COUNT(1) , or COUNT(*), MySQL will use the auxiliary index query method with minimum cost to count, that is, use COUNT(*) because MySQL The optimization has guaranteed its query performance is the best! Incidentally, COUNT(*) is a standard syntax for counting rows defined by SQL92, and it is highly efficient, so please use COUNT(*) directly to query the number of rows in a table !

So this statement is indeed true. But there is a premise that this kind of optimization is only available in versions after MySQL 5.6.

So how to define the minimum cost? Sometimes multiple conditions are specified in WHERE. Why does MySQL finally choose another index or even no index when executing it?

This article will give you the answer, this article will analyze from the following two aspects

How to Calculate the Execution Cost of SQL Selected Index
Examples

How to calculate the execution cost of SQL selected index

As mentioned above, in the case of multiple indexes, before querying data, MySQL will choose to use the corresponding index based on the principle of minimum cost. The cost here mainly includes two aspects.

IO cost: the cost of loading data from disk to memory. By default, the IO cost of reading data pages is 1. MySQL reads data in the form of pages, that is, when a certain data is used, it does not Only this data will be read, and the data adjacent to this data will also be read into the memory together. This is the well-known principle of program locality, so MySQL will read a whole page each time, and the cost of one page is 1. So the cost of IO is mainly related to the page size
CPU cost: After the data is read into the memory, the cost of CPU operations such as checking whether the data meets the conditions and sorting is also required. Obviously it is related to the number of rows. By default, the cost of checking records is 0.2.

Example Description

In order to calculate the final cost of using the index based on the above two costs, we first prepare a table (the following operations are based on MySQL 5.7.18)

CREATE TABLE `person` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) NOT NULL,
  `score` int(11) NOT NULL,
  `create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `name_score` (`name`(191), `score`),
  KEY `create_time` (`create_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

In addition to the primary key index, this table has two other indexes, name_score and create_time. Then we insert 10 w rows of data in this table, just write a stored procedure call, as follows:

CREATE PROCEDURE insert_person()
begin
    declare c_id integer default 1;
    while c_id<=100000 do
    insert into person values(c_id, concat('name',c_id), c_id + 100, date_sub(NOW(), interval c_id second));
    set c_id=c_id + 1;
    end while;
end

After inserting, we now use EXPLAIN to calculate which index is used to count the total number of rows

EXPLAIN SELECT COUNT(*) FROM person

From the results, it selects the create_time auxiliary index. Obviously, MySQL believes that using this index to query has the lowest cost, which is also in line with our expectations. Using the auxiliary index to query is indeed the highest performance!

Let’s look at which index the following SQL will use

SELECT * FROM person WHERE NAME >'name84059' AND create_time>'2020-05-23 14:39:18'

With a full table scan! Theoretically, you should use name_score or create_time index. From the WHERE query conditions, you can indeed hit the index, so whether to use SELECT * The cost of returning to the table is too high, let’s try it in the form of a covering index

SELECT create_time FROM person WHERE NAME >'name84059' AND create_time > '2020-05-23 14:39:18'

As a result, MySQL still chose a full table scan! This is more interesting. In theory, using a covering index for search performance is definitely better than full table scanning. Why does MySQL choose full table scanning? Since it thinks that full table scanning is more performant than using a covering index Even better, let’s compare the query time with these two executions

-- full table scan execution time: 4.0 ms
SELECT create_time FROM person WHERE NAME >'name84059' AND create_time>'2020-05-23 14:39:18'
  
-- Execution time with covering index: 2.0 ms
SELECT create_time FROM person force index(create_time) WHERE NAME >'name84059' AND create_time>'2020-05-23 14:39:18'

From the actual execution effect, the execution time of using the covering index query is twice as fast as that of using the full table scan! Explain that the cost estimation made by MySQL before the query is inaccurate! Let’s first take a look at the cost of MySQL’s full table scan.

We said earlier that the cost is mainly related to IO cost and CPU cost. For a full table scan, it is the number of pages occupied by the clustered index and the number of records in the table. Execute the following command

SHOW TABLE STATUS LIKE 'person'

It can be found

The number of rows is 100264. Didn’t we insert the data of 10w rows? How could the calculated data be more? In fact, the calculation here is an estimate, and it is also possible that the number of rows here is more than 10w Less, if you are interested in the estimation method, you can search it online. This is not the focus of this article, so I won’t expand it. Knowing the number of rows, then we know that the CPU cost is 100264 * 0.2 = 20052.8.
The data length is 5783552, the size of each InnoDB page is 16 KB, and the number of pages can be calculated to be 353.

That is to say, the cost of full table scan is 20052.8 + 353 = 20406.

Is this result correct? We can use a tool to verify it. In MySQL 5.6 and later versions, we can use the optimizer trace function to view the entire process of the optimizer generating the plan, which lists the execution plan cost of selecting each index and the final selection result. We can rely on this information to further optimize our SQL.

The optimizer_trace function is used as follows

SET optimizer_trace="enabled=on";
SELECT create_time FROM person WHERE NAME >'name84059' AND create_time > '2020-05-23 14:39:18';
SELECT * FROM information_schema. OPTIMIZER_TRACE;
SET optimizer_trace="enabled=off";

After execution, we mainly observe the cost of using name_score, create_time index and full table scan.

Let’s first look at the estimated execution cost of using the name_score index:

{
    "index": "name_score",
    "ranges": [
      "name84059 <= name"
    ],
    "index_dives_for_eq_ranges": true,
    "rows": 25372,
    "cost": 30447
}

It can be seen that the execution cost is 30447, which is higher than the full table scan cost we calculated before: 20406. So this index is not selected for execution

Note: 30447 here is the sum of the IO cost and CPU cost of querying the secondary index, plus the sum of the IO cost and CPU cost of querying the clustered index back to the table.

Let’s look at the estimated execution cost of index execution using create_time:

{
    "index": "create_time",
    "ranges": [
      "0x5ec8c516 < create_time"
    ],
    "index_dives_for_eq_ranges": true,
    "rows": 50132,
    "cost": 60159,
    "cause": "cost"
}

It can be seen that the cost is 60159, which is much greater than the full table scan cost of 20406, so naturally this index was not selected.

Let’s look at the calculated full table scan cost:

{
    "considered_execution_plans": [
      {
        "plan_prefix": [
        ],
        "table": "`person`",
        "best_access_path": {
          "considered_access_paths": [
            {
              "rows_to_scan": 100264,
              "access_type": "scan",
              "resulting_rows": 100264,
              "cost": 20406,
              "chosen": true
            }
          ]
        },
        "condition_filtering_pct": 100,
        "rows_for_plan": 100264,
        "cost_for_plan": 20406,
        "chosen": true
      }
    ]
}

Notice the cost: 20406, which is exactly the same as what we calculated earlier! This value is the smallest among the execution costs calculated by the above three, so in the end MySQL chose to use the full table scan method to execute this SQL.

In fact, optimizer trace lists the cost statistics of the covering index and return table in detail. If you are interested, you can study it.

From the above analysis, it can be seen that the execution plan selected by MySQL may not be the best. There are many reasons. For example, the row count statistics mentioned above are not accurate. It can be considered that the shortest execution time is optimal, but the small cost that MySQL considers does not necessarily mean the shortest execution time.

Summary

This article uses an example to deeply analyze how MySQL’s execution plan is selected, and why its selection may not be the best we think. This also reminds us that if there are multiple indexes in production, use WHERE to filter The index you think may not be selected, we can use EXPLAIN,optimizer trace in advance to optimize our query statement.

Thanks for reading, I hope it can help you 🙂

Source: Programmer Dabin

Recommended

A Collection of Java Interview Questions

Technical introversion group, let’s learn together! !

PS: Because the official account platform has changed the push rules, if you don’t want to miss the content, remember to click “Looking” after reading and add a “Star”, so that every time The push of new articles will appear in your subscription list for the first time. Click “Watching” to support us!

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge Java skill treeHomepageOverview 108415 people are studying systematically