JOIN algorithm principle and optimization

Article directory

Preface

1. JOIN operation method

2. Driver table and driven table

1. What is a driver table?

3. Principle of JOIN algorithm

4. Summary of JOIN algorithm

Summarize

Foreword

JOIN: used in mysql to perform join table operations, to match data from two tables, and to filter and merge the result set that meets our requirements.

Experiment creation table:

create table student(
id int(11) not null,
number VARCHAR(20) default null,
username VARCHAR(20) default null,
primary key(id)
)engine=innodb default charset=utf8mb4;

create table score(
id int(11) not null,
number varchar(20) default null,
chinese double(4,0) default null,
math double(4,0) default null,
PRIMARY key(id)
)engine=innodb default charset=utf8mb4;

insert into student values(1,'10086','Big Brother');
insert into student values(2,'10087','Great Priest');
insert into student values(3,'10088','Da Sima');
insert into student values(4,'10081','General');



insert into score values(1,'10086',99,60);
insert into score values(2,'10087',99,60);
insert into score values(3,'10089',78,100);

Tips: The following is the text of this article, the following cases are for reference

1. JOIN operation method

Left outer join:

#Left outer join, query all the data in the left table and the matching data in the right table. If it does not exist, it is represented by empty.
explain
select * from student s1 left join score c1 on s1.number = c1.number;

Right outer join:

#Right outer join, query all the data in the right table and the matching data in the left table. If it does not exist, it is represented by empty.
explain
select * from student s1 right join score c1 on s1.number = c1.number;

Inner join:

#Inner join, only obtain the intersection data of two tables.
explain
select * from student s1 inner join score c1 on s1.number = c1.number;

2. Driver table and driven table

1. What is a driver table

1) When performing multi-table related queries, the first table to be processed is the driver table, and the driver table is used to associate other tables.

2) The determination of the driving table is very critical, it will directly affect the order of multi-table association, and also determines the query of subsequent association queries.

Verify the selection of driver tables under different connection query situations by explaining the execution plan.

1) The connection query does not have other conditions for where except the connection condition.

left outer join

# When left outer join, the selection of driver table: the left one is the driver table, so student is the driver table
explain
select * from student s1 left join score c1 on s1.number = c1.number;

Note: In order of execution, the first table is the driver table.

right outer join

# When right outer join, the selection of the driver table, the following indicates the driver table
explain
select * from student s1 right join score c1 on s1.number = c1.number;

Inner join

#Inner join, selection of driver table: Whoever has less data will be the driver table
explain
select * from student s1 inner join score c1 on s1.number = c1.number;

2) The connection query has a where condition (the table with the where condition is the driving table, otherwise it is the driven table)

#Left outer join
explain
select * from student s1 left join score c1 on s1.number = c1.number where c1.id = 1;

The above where condition is the primary key index column, and the lower one is the ordinary column. It can be seen that the driving table is still the table of the where condition.

#Left outer join
explain
select * from student s1 left join score c1 on s1.number = c1.number where c1.number = '10086';

The same goes for right joins and inner joins.

Summary: The driving table selection principle: when it has no impact on the final result set, give priority to the table with the smaller result set as the driving table.

3. Principle of JOIN algorithm

Simple Nested Loop JOIN Simple nested loop connection

A simple nested loop connection is a double-layer for loop, which loops through the row data of the outer table and compares it with the row data of the inner table one by one to obtain the result.

#Connect users and order tables, connection conditions, user table id = user_id of order table
select * from user u left join order o on u.id = o.id;


#After converting into code
for(URow: user table){
    for(ORow : order){
        if(URow.id = ORow.id){
            return URow;
        }
    }
}

Features of SNL:

It is simple, crude, and easy to understand. It compares data through a double-layer loop to obtain the results.

The query efficiency is very low. Assume that table A has N rows and table B has M rows. SNL overhead: N*M

Index Nested Loop joinindex nested loop join

The difference from SNL: the number of data matches in the inner table is reduced, mainly because the index is created on the join field (which must be the field used for join in the driven table).

Principle: The number of matching times in the inner loop is reduced. After using INLJ, only 3 ios are needed to match.

Note: To use the Index Nested loop join algorithm, the premise is that the matching fields of the driven table must be indexed.

Block Nested Loop Join (cache block nested loop join)

The optimization idea is to reduce the number of scans of the inner table. Through the simple nested loop query diagram, we can see that each record in the left table will scan the right table once. This process is actually very performance consuming. .

The bnlj algorithm reduces the number of outer loops by caching multiple pieces of data in the outer table at once (using Join Buffer cache). Therefore, the number of scans of the inner table is reduced, thereby improving performance.

If index nested loop join cannot be used, the database uses the bnlj algorithm by default.

1. To use the bnlj algorithm, you need to turn on the optimizer_switch and set block_nested_loop to on. It is turned on by default:

show variables like ‘%optimizer_switch%’;

2. Set the join buffer size

The size of the join buffer can be set through the join_buffer_size parameter.

show variables like '%join_buffer_size%';

262144/1024 = 256Kb

It is known that a page size is 16Kb and can accommodate 256/16 = 16 pages.

In the case where the driven table has no index, the bnlj (block nested loop) algorithm is used:

n data pages are loaded from the driver table every time (n<=join_buffer_size/1024/16). After optimization, the number of IO times in the left table is reduced.

4. Summary of JOIN algorithm

INLJ uses the index mechanism to reduce the number of loop matching of the inner table to achieve optimization effects. BNLJ reduces the number of IOs of the outer table by caching multiple pieces of data at a time and batch matching, and also reduces the number of table scans of the inner table. .

Optimization ideas for table joins:

1. Use a small table (small result set) to drive a large table (large result set) – essentially reducing the amount of data in the outer loop

2. Add indexes to the conditions matched by the driven table (reduce the number of matches in the inner loop)

3. Increase the join buffer size

4. Reduce unnecessary field queries (the fewer fields, the more data cached in the join buffer)

5. Regarding the execution plan of the join driven table and the driven table with or without indexes

Situation 1: Both tables are indexed. The driving table will not be indexed, but the driven table will be indexed.

explain
select * from student s inner join score c on s.number = c.number;

Summary: Because the driver table is an outer loop, all data needs to be taken out, so a full table scan is better. If you want to drive a table to use an index, you need a where clause, and the where column needs to have an index.

explain
select * from student s inner join score c on s.number = c.number where c.id=2;