MySQL InnoDB data storage structure

Content involved in this article

1. Database storage structure: page

The index structure provides us with an efficient indexing method, but the index information and data records are stored in the file, or in the page structure to be precise. On the other hand, indexes are implemented in the storage engine. The storage engine on the MySQL server is responsible for reading and writing data in the table.

The formats stored in different storage engines are generally different, and even some storage engines such as Memory do not use disks to store data.

Since InnoDB is the default storage engine of MySQL, this chapter mainly introduces the data storage structure of the InnoDB storage engine.

1.1 The basic unit of disk and memory interaction: page

1.2 Overview of page structure

Page a, page b, page c . . . page n These pages do not need to be connected in the logistics structure, as long as they are related through a two-way linked list. The records in each data page will form a one-way linked list in order from small to large primary key values. Each data page will generate a page directory for the records stored in it. When searching for a record through the primary key, you can Use the dichotomy method to quickly locate the corresponding slot in the page directory, and then traverse the records in the corresponding group of the slot to quickly find the specified record.

1.3 page size

Different database management systems (DBMS for short) have different page sizes. For example, in MySQL’s InnoDB storage engine, the default page size is 16KB. We can check it with the following command:

show variables like '%innodb_page_size%'

The page size in SQL Server is 8KB, while in Oracle we use the term “block” (Block) to represent “page”. The block size supported by Oracle is 2KB. 4KB, 8KB, 16KB, 32KB and 64KB.

Superstructure of the 1.4 page

In addition, in the database, there are also concepts of extent, segment and tablespace. The relationship between rows, pages, extents, segments, and table spaces is as shown in the figure below:

Extent: It is a storage structure one level larger than a page. In the InnoDB storage engine, an extent will allocate 64 consecutive pages. Because the default page size in InnoDB is 16KB, the size of an extent is 64 * 16KB = 1MB
Segment: It consists of one or more areas. The area is a continuously allocated space in the file system (in InnoDB, it is 64 consecutive pages). However, the area and area are not required in the segment. Districts are adjacent to each other. Segment is the allocation unit in the database, and different types of database objects exist in different segment forms. When we create data tables and indexes, corresponding segments will be created accordingly. For example, when creating a table, a table segment will be created, and when creating an index, an index segment will be created.
Tablespace: It is a logical container. The objects stored in table space are segments. There can be one or more segments in a table space, but a segment can only belong to one table space. The database consists of one or more table spaces. Table spaces can be managed into system table spaces, user table spaces, undo table spaces, temporary table spaces, etc.

2. Internal structure of the page

If pages are divided by type, common ones include data pages (save B + tree nodes), system tables, Undo pages and transaction data pages, etc. The data page is the page we use most often.

The 16KB storage space of the data page is divided into seven parts, namely File Header, Page Header, Maximum and Minimum Records (Infimum + supremum), User Records ( User Records), Free Space, Page Directory and File Tailer.

A schematic diagram of the page structure is shown below:

As shown in the following table:

We can divide these 7 structures into 3 parts.

Part 1: File Header and File Trailer

See the file InnoDB database storage structure.mmap

Part 2: User Records, maximum and minimum records, Free Space

See the file InnoDB database storage structure.mmap

Part 3: Page Directory and Page Header

See the file InnoDB database storage structure.mmap

2.3 How to query B + tree from the perspective of database page

A B+ tree can be divided into two parts according to byte type:

Leaf node, the node at the bottom of the B + tree, the height of the node is 0, and stores row records.
Non-leaf nodes, the height of the node is greater than 0, store index keys and page pointers, and do not store the row record itself.

When we understand the structure of the B + tree from the page structure, it can help us understand some principles of retrieval through indexes:

3. InnoDB row format (or record format)

See the file InnoDB database storage structure.mmap

4. Area, segment and fragment area

4.1 Why should there be zones?

The pages in each level of the B+ tree will form a doubly linked list. If storage space is allocated in units of pages, the logistics positions between two adjacent pages in the doubly linked list may be very far apart. When introducing the applicable scenarios of the B + tree index, it was specifically mentioned that the range query only needs to locate the leftmost record and the rightmost record, and then scan along the doubly linked list. If there are two adjacent pages in the linked list, The physical locations are very far apart, which is what is called random I/O. Once again, the speed of the disk is several orders of magnitude different from the speed of the memory. Random I/O is very slow, so we should try to make the logistics positions of adjacent pages in the linked list adjacent, so that when performing range queries Only then can you use so-called sequential I/O.

Introducing the concept of area, an area is 64 consecutive pages in physical location. Because the default page size in InnoDB is 16KB, the size of an extent is 64 * 16KB = 1MB. When the data in the table is the largest, when allocating space for an index, it is no longer allocated in units of pages, but in units of locations. Even when there is a lot of data in the table, more than one index can be allocated at once. a continuous area. Although it may cause a little waste of space (the data is not enough to fill the entire area), from a performance perspective, it can eliminate a lot of random I/O, and the merits outweigh the disadvantages! ! !

4.2 Why paragraphs?

For range queries, the records in the leaf nodes of the B + tree are actually sequentially scanned. If the leaf nodes and non-leaf nodes are not distinguished, and all the pages represented by the nodes are placed in the applied area, the effect of range scanning will be It’s a big discount. Therefore, InnoDB treats the leaf nodes and non-leaf nodes of the B + tree differently, that is to say, the leaf nodes have their own unique areas, and the non-leaf nodes also have their own unique frontal areas. The set of areas that store leaf nodes is considered a segment, and the set of areas that store non-leaf nodes is also considered a segment. In other words, an index will generate 2 segments, a leaf node segment and a non-leaf node segment.

In addition to the leaf node segments and non-leaf node segments of the index, InnoDB also has segments defined for storing some special data, such as rollback segments. Therefore, common segments include data segments, index segments, and rollback segments. The data segment is the leaf node of the B + tree, and the index segment is the non-leaf node of the B + tree.

In the InnoDB storage engine, the management of segments is completed by the engine itself, and the DBA cannot and does not need to control it. This simplifies the DBA’s management of segments to a certain extent.

A segment does not actually correspond to a continuous physical area in the table space, but is a logical concept consisting of several scattered pages and some complete areas.

4.3 Why is there a fragmentation area?

By default, a table using the InnoDB storage engine has only one clustered index. One index will generate 2 segments, and the segments apply for storage space in units of zones. By default, one zone occupies 1M (64 * 16KB = 1024KB) storage. space, so by default a small table that only stores a few records requires 2M of storage space. In the future, every time you add an index, you will need to apply for 2M more storage space. This is too wasteful for tables that store relatively few records. The crux of this problem is that the areas introduced so far are very pure, that is, an area is allocated entirely to a certain segment, or all pages in the area exist to store data in the same segment, even if If the data in a segment cannot fill all the pages in the zone, the remaining pages cannot be used for other purposes.

In order to consider the situation that allocating a complete area to a segment is too wasteful of storage space for a table with a small amount of data, InnoDB proposes the concept of a fragment area. In a fragmented area, not all pages exist to store data of the same segment, but the pages in the fragmented area can be used for different purposes. For example, some pages are used for segment A, and some pages are used for segment A. Segment B, some pages do not even belong to any segment. The fragmented area belongs directly to the table space and does not belong to any segment.

Therefore, the strategy for allocating storage space for a certain segment in the future is as follows:

When you first want to insert data into the table, the segment is allocated storage space in single page units from a fragmented area.
When a segment has occupied 32 fragment area pages, it will apply to allocate storage space in complete area units.

Now a segment cannot just be defined as a collection of certain areas, but more accurately it should be a collection of some scattered pages and some complete areas.

4.4 Classification of areas

Districts can generally be divided into 4 types:

Free area (FREE): No pages in this area are currently used.
Fragmentation area with remaining space (FREE_FRAG): Indicates that there are still available pages in the fragmentation area.
Fragmentation area with no remaining space (FULL_FRAG): Indicates that all pages in the fragmentation area are used and there are no free pages.
Extent attached to a segment (FSEG): Each index can be divided into leaf node segments and non-leaf node segments.

Areas in the three states of FREE, FREE_FRAG and FULL_FRAG are independent and belong directly to the table space. The area in FSEG state is attached to a certain segment.

If the table space is compared to a group army, sections are equivalent to divisions and districts are equivalent to regiments. Generally, groups belong to a certain division, just like the areas in FSEG all belong to a certain segment, but the areas in the three states of FREE, FREE_FRAG and FULL_FRAG are directly affiliated with the table space, just like independent groups. It’s like taking orders directly from the military.

5. Table space

The table space can be regarded as the highest level of the logical architecture of the InnoDB storage engine. All data is stored in the table space. Table space is a logical container. The objects stored in table space are segments. There can be one or more segments in a table space, but a segment can only belong to one table space. Table space database consists of one or more table spaces. Table spaces can be divided into system table space (System Tablespace), independent table space (File-Per-Table Tablespace), undo table space (Undo Tablespace) and temporary table from the management point of view. Space (Temporary Tablespace)

5.1 Independent table space

Independent table space, that is, each table has an independent table space, that is, the data and index information will be stored in its own table space. Independent tablespaces (ie: single tables) can be migrated between different databases.

The space can be reclaimed (the DROP TABLE operation can automatically reclaim the table space; otherwise, the table space cannot be reclaimed by itself). For statistical analysis or log tables, after deleting a large amount of data, you can use: alter table TableName engine=innodb; to reclaim unused space. For tables that use independent table spaces, no matter how they are deleted, the fragmentation of the table space will not seriously affect the performance, and there is still a chance to deal with it.

Independent table space structure

An independent table space consists of segments, areas, and pages.

The file size corresponding to the real table space

When we look in the data directory, we will find that the .ibd file corresponding to a newly created table only occupies 96K, which is only 6 pages in size (in MySQL5.7). This is because the space occupied by the table space is very small at the beginning. There is no data in it. But don’t forget that these .ibd files are self-expanding. As the data in the table increases, the files corresponding to the table space also gradually increase.

View the table space type of InnoDB:

show variables like 'innodb_file_per_table'

You can see innodb_file_per_table=ON, which means that each table will be saved in an .ibd file.

5.2 System table space

The structure of the system table space is basically similar to that of the independent table space, except that since the entire MySQL process has only one system table space, some additional pages about the entire system information will be recorded in the system table space, which is not available in the independent table space.

InnoDB Data Dictionary

Whenever we insert a record into a table, the MySQL verification process is as follows: First, we must verify whether the table corresponding to the insertion statement exists, and whether the inserted columns match the columns in the table. If there is no problem with the syntax, You also need to know which page the root page corresponds to the clustered index and all secondary indexes of the table, and then insert the record into the B-tree of the corresponding index. Therefore, in addition to saving the user data we inserted, MySQL also needs to save a lot of additional information, for example:

Which table space a certain table belongs to and how many columns are there in the table.
What is the type of each column corresponding to the table.
How many indexes does the table have, which fields do each index correspond to, and which page in which table space the root page corresponding to the index is located.
What foreign keys does the table have, and which columns of which table do the foreign keys correspond to?
What is the file path on the file system corresponding to a certain table space.

Deleting this data is not the user data we inserted using the INSERT statement. It is actually some additional data that was introduced to better manage our user data. These data pages are called metadata. The InnoDB storage engine specifically defines a number of internal system tables to record these metadata:

These system tables are also called data dictionary, and they are stored in a page in the system table space in the form of B + tree. Among them, SYS_TABLES, SYS_COLUMNS, SYS_INDEXES, SYS_FIELDS are particularly important. They are called basic system tables. Let’s first look at the structure of these four tables:

Note: Users cannot directly access these internal system tables of InnoDB unless you directly parse the files on the file system corresponding to the system table space. However, considering that viewing the contents of these tables may help you analyze the problem, some tables starting with innodb_sys are provided in the system database information_schema:

USE information_schema;

SHOW TABLES LIKE 'innodb_sys%';

The tables starting with INNODB_SYS in the information_scheme database are not real internal system tables (the internal system tables are what we have above starting with SYS Those tables), instead, when the storage engine starts, it reads these system tables starting with SYS, and then populates these tables starting with INNODB_SYS. The fields in the table starting with INNODB_SYS and the table starting with SYS are not exactly the same, but it is enough for your reference only.

Appendix: Three ways to load data pages

InnoDB reads data from disk. The minimum unit is the data page. The data with id = xxx you want to get is one of the many rows in this data page.

For the data stored in MySQL, we call it a table logically. At the physical level such as disk, it is stored in the form of data page. When it is loaded into MySQL, we call it Cache page.

If the buffer pool does not have the page data, the buffer pool has the following three ways to read data, and the reading rate of each method is different:

1. Memory reading

If the data exists in memory, the execution time is basically about 1ms, which is very efficient.

2. Random reading

If the data is not in memory, you need to query the page on the disk. The overall time statement is about 10ms. Of these 10ms, 6ms are the actual busy time of the disk (including seek and semicircle rotation time), and 3ms are An estimate of the queuing events that may occur, plus a transfer time of 1ms to transfer the page from the disk server buffer to the database buffer. This 10ms seems very fast, but in fact it takes a very long time for the database, because it is still the time to read one page.

3. Sequential reading

Sequential reading is actually a batch reading method, because the data we request are often stored adjacent to each other on the disk. Sequential reading can help us read pages in batches, so that they can be loaded into the buffer pool at once. There is no need to perform separate disk I/O operations on other pages. If the throughput of a disk is 40MB/S, then for a 16KB page, 2560 (40MB/16KB) pages can be read sequentially at one time, which is equivalent to a page read time of 0.4ms. Using batch reading, even reading from disk is more efficient than reading a single page from memory.