[MySQL Indexing and Optimization] InnoDB data storage structure

Article directory

- 1. Database storage structure: page
- - 1.1 The basic unit of disk and memory interaction: page
  - 1.2 Overview of page structure
  - 1.3 Page Superstructure
- 2. Internal structure of the page
- 3. InnoDB row format (or record format)
- - 3.1 Compact line format
  - 3.2 Dynamic and Compressed row formats
  - 3.3 Redundant row format
- 4. Area, segment and fragment area
- - 4.1 Why should there be zones?
  - 4.2 Why do we need paragraphs?
  - 4.3 Why is there a fragmentation area?
  - 4.4 Classification of zones
- 5. Table space
- - 5.1 Independent table space
  - 5.2 System table space

1. Database storage structure: page

The index structure provides us with an efficient indexing method, but the index information and data records are stored in the file, or in the page structure to be precise. On the other hand, indexes are implemented in the storage engine. The storage engine on the MySQL server is responsible for reading and writing data in the table. The format of storage in different storage engines is generally different.

Since InnoDB is MySOL’s default storage engine, this chapter analyzes the data storage structure of the InnoDB storage engine.

1.1 Basic unit of disk and memory interaction: page

InnoDB divides data into several pages. The default page size in InnoDB is 16KB (you can see the innodb_page_size parameter through the show variables command). A page is the basic unit of disk and memory interaction.

Records are stored in rows, but database reads are not in row units. Otherwise, one read (that is, one I/O operation) can only process one row of data, and the efficiency will be very low.

1.2 Overview of page structure

Page a, page b, page c…page n These pages may not be connected in physical structure, as long as they are related through double linked list. The records in each data page will form a one-way linked list in order of primary key value from small to large. Each data page will generate a page directory< for the records stored in it. /code>, when searching for a record through the primary key, you can use the dichotomy method in the page directory to quickly locate the corresponding slot, and then traverse the records in the corresponding group of the slot to quickly find the specified record of.


1.3 Page Superstructure
In addition, in the database, there are also concepts of extent, segment and tablespace. The relationship between rows, pages, extents, segments, and table spaces is as shown in the figure below:


An extent (Extent) is a storage structure one level larger than a page. In the InnoDB storage engine, an extent will allocate 64 consecutive pages. Because the default page size in InnoDB is 16KB, the size of an extent is 64*16KB= 1MB.
A segment is composed of one or more areas. In a segment, areas are not required to be adjacent. A segment is an allocation unit in a database, Different types of database objects exist in different segments. When we create data tables and indexes, corresponding segments will be created accordingly. For example, when creating a table, a table segment will be created, and when creating an index, an index segment will be created.
Tablespace is a logical container. The objects stored in tablespace are segments. There can be one or more segments in a tablespace, but a segment can only belong to one tablespace. The database consists of one or more table spaces. Table spaces can be divided into system table space, user table space, and undo table space from a management perspective. , Temporary table space, etc.
2. Internal structure of the page
If pages are divided by type, common ones include data page (save B + node), system page, Undo page and transaction Data page etc. The data page is the page we use most often.

The 16KB storage space of the data page is divided into seven parts, namely File Header, Page Header, Maximum and Minimum Records (Infimum + supremum), User Records ( User Records), Free Space, Page Directory and File Tailer.

The functions of these seven parts are explained as follows:

How does B + tree perform record retrieval?
Query row records through the index of the B + tree. First, start from the root of the B + tree and search layer by layer until the leaf node is found, that is, the corresponding data page is found. The data page is loaded into the memory. The page directory The slot (sot) uses binary search to first find a rough record group (a group has less than 8 records, the page directory structure search diagram is as follows), and then passes  in the group Find records using linked list traversal.

3. InnoDB row format (or record format)
We usually insert data into the table in units of rows. The way these records are stored on disk is also called row format or record format. The InnoDB storage engine has designed 4 different types of row formats, namely Compact, Redundant, Dynamic and Compressed row format.

Check the default row format of MySQL 8.0 or the row format used by a specific table:
SELECT @@innodb_default_row_format; -- default dynamic

SHOW TABLE STATUS LIKE 'table name'; -- View the row format Row_format column used by a specific table

3.1 Compact line format



Variable length field length list: stores the number of bytes occupied by all variable length fields, the storage order is reversed


NULL value list: 1 represents NULL, 0 represents non-NULL. If the table does not allow NULL values to be stored, the NULL value list will no longer exist, so you can consider setting table fields as much as possible is non-NULL, note: The storage order is also reversed, there is a schematic diagram below


Record header information (5 bytes)


Reserved bits 1 and 2: each occupy 1 bit and are not used.


delete_mask: occupies 1 bit, delete identifier


min_rec_mask: 1 bit, this mark will be added to the smallest record in the non-leaf node of each layer of the B+ tree


n_owned: 4 bits, indicating the number of records owned by the current record slot. Only the largest record in the group will be recorded.


heap_no: occupies 13 bits, indicating the location information of the current record in the record heap


record_type: 3 bits, indicating the type of the current record, 0 represents an ordinary record, 1 represents a B + tree non-leaf node record, 2 represents the minimum record, 3 represents the maximum record


next_record: occupies 16 bits, indicating the relative position of the next record




Real data recorded: In addition to the data of the self-defined columns, there will also be three hidden columns

DB_ROW_ID: 6 bytes, row ID, uniquely identifies a record. If a table has neither a primary key nor a unique index, this hidden column will be added as the primary key by default
DB_TRX_ID: Transaction ID
DB_ROLL_PTR: rollback pointer



NULL value columns indicate intent:

Schematic diagram of recording header information:

3.2 Dynamic and Compressed row format
Compared with Compact row format, Dynamic and Compressed only have different processing methods for row overflow.
Row overflow: The size of a page is generally 16KB, which is 16384 bytes, and a VARCHAR(M) type column can store up to 65533 bytes (2 bytes storage Variable length, 1 storage NULL flag), so that one page may not be able to store one record. This phenomenon is called row overflow

The two record formats, Compressed and Dynamic, use complete row overflow for data stored in BLOBs. As shown in the figure, only a 20-byte pointer (the address of the overflow page) is stored in the data page, and the actual data is stored in the Off Page (overflow page).

Another function of the Compressed row record format is that the row data stored in it will be compressed using the zlib algorithm, so large-length data such as BLOB, TEXT, and VARCHAR can be stored very efficiently.


The two formats of Compact and Redundant will store a part of the data (storing 768 prefix bytes) at the actual data of the record, and store the remaining data in several other pages for paging Store and then record the real data using 20 bytes to store the addresses pointing to these pages (of course these 20 bytes also include the number of bytes occupied by the data scattered in other pages), so that you can Find the page where the remaining data is located


Extensions for Compact and Redundant pages:

Compressed and Dynamic:

3.3 Redundant row format
This format is supported to be compatible with the page format of versions prior to MySQL 5.0. It is not used in actual development, so too much analysis will not be done.
4. Area, segment and fragment area
4.1 Why should there be zones? 
The pages in each layer of the B + tree will form a doubly linked list. If storage space is allocated in page units, the physical distance between two adjacent pages in the doubly linked list will The location may be very far away. When we introduced the applicable scenarios of the B + tree index, we specifically mentioned that the range query only needs to locate the leftmost record and the rightmost record, and then scan along the doubly linked list. If the two adjacent records in the linked list The physical location of pages that are very far apart is called random I/O. Once again, the speed of the disk is several orders of magnitude different from the speed of the memory. Random I/O is very slow, so we should try to make the physical locations of adjacent pages in the linked list adjacent when performing range queries. Only then can you use the so-called Sequential I/O.
Introducing the concept of an area, an area is a continuous 64 pages in a physical location. Because the default page size in InnoDB is 16KB, the size of an extent is 64*16KB= 1MB. When the amount of data in the table is large, when allocating space for an index, it is no longer allocated in units of pages, but in units of areas, or even When there is a lot of data in the table, multiple consecutive areas can be allocated at once. Although it may cause a little waste of space (the data is not enough to fill the entire area), from a performance perspective, it can eliminate a lot of random I/O, and does more good than harm !
4.2 Why do we need paragraphs? 
For range queries, the records in the leaf nodes of the B + tree are actually sequentially scanned. If the leaf nodes and non-leaf nodes are not distinguished, and all the pages represented by the nodes are placed in the applied area, the effect of range scanning will be It's a big discount. Therefore, InnoDB treats the leaf nodes and non-leaf nodes of the B + tree differently. That is to say, the leaf nodes have their own unique areas, and the non-leaf nodes also have their own unique areas. Some areas. The collection of areas storing leaf nodes is considered a segment, and the collection of areas storing non-leaf nodes is also considered a segment. In other words, an index will generate 2 segments, a leaf node segment and a non-leaf node segment.
In addition to the leaf node segments and non-leaf node segments of the index, lnnoDB also has segments defined for storing some special data, such as rollback segments. Therefore, common segments include data segment, index segment, and rollback segment. The data segment is the leaf node of the B + tree, and the index segment is the non-leaf node of the B + tree.
In the InnoDB storage engine, the management of segments is completed by the engine itself, and the DBA does not and does not need to control it. This simplifies the DBA's management of segments to a certain extent.
A segment does not actually correspond to a continuous physical area in the table space, but is a logical concept consisting of several scattered pages and some complete areas.

4.3 Why is there a fragmented area
By default, a table using the InnoDB storage engine has only one clustered index. One index will generate 2 segments, and the segments apply for storage space in units of zones. By default, one zone occupies 1M (64*16Kb = 1024Kb) of storage. Space, so by default does a small table that only stores a few records also need 2M of storage space? Will I have to apply for 2M more storage space every time I add an index? This is simply a problem for tables with relatively few storage records. What a waste. The crux of this problem is that the areas we have introduced so far are very pure, that is, an area is allocated entirely to a certain segment, or all pages in the area are used to store the same It exists for segment data. Even if the segment data does not fill all the pages in the area, the remaining pages cannot be used for other purposes.
In order to consider the situation that allocating a complete area to a certain segment is too wasteful of storage space for tables with small data volume, lnnoDB proposed a fragment area code> concept. In a fragmented area, not all pages exist to store data of the same segment, but the pages in the fragmented area can be used for different purposes. For example, some pages are used for segment A, and some pages are used for segment A. Segment B, some pages do not even belong to any segment. The fragmented area belongs directly to the table space and does not belong to any segment.
So the strategy for allocating storage space for a certain segment is as follows:

When data is first inserted into the table, the segment is allocated storage space in single page units from a fragmented area.
When a segment has occupied 32 fragmented areas pages, it will apply to allocate storage space in units of complete areas.

So now a segment cannot only be defined as a collection of certain areas, but more accurately it should be a collection of some scattered pages and some complete areas
4.4 Classification of areas

Free area (FREE): No pages in this area are currently used.
Fragmentation area with remaining space (FREE_FRAG): Indicates that there are still available pages in the fragmentation area
Fragmentation area with no remaining space (FULL_FRAG): Indicates that all pages in the fragmentation area are used and there are no free pages.
Extent attached to a segment (FSEG): Each index can be divided into leaf node segments and non-leaf node segments

Areas in the three states of FREE, FREE_FRAG and FULL_FRAG are independent and belong directly to the table space. The area in the FSEG state is attached to a certain segment

If the table space is compared to a group army, sections are equivalent to divisions and districts are equivalent to regiments. Generally, groups belong to a certain division, just like the areas in FSEG all belong to a certain segment, while the areas in the three states of FREE, FREE_FRAG and FULL_FRAG are directly affiliated with the dry table space, just like independent groups. It's like taking orders directly from the military.

5. Table space
The table space can be regarded as the highest level of the logical structure of the InnoDB storage engine. All data is stored in the table space.
Table space is a logical container. The objects stored in table space are segments. There can be one or more segments in a table space, but a segment can only belong to one table space. A table space database is composed of one or more table spaces. Table spaces can be divided into system table space (Systemtablespace) and independent table space (File-per-table) from a management perspective. tablespace), Undo Tablespace (Undo Tablespace) and Temporary Tablespace (Temporary Tablespace), etc.
5.1 Independent table space
Independent table space, that is, each table has an independent table space, that is, the data and index information will be stored in its own table space. Independent table spaces (ie: single tables) can be migrated between different databases.
The space can be reclaimed (the DROP TABLE operation can automatically reclaim the table space. In other cases, the table space cannot be reclaimed by itself). For statistical analysis or log tables, after deleting a large amount of data, you can use: alter table TableName engine=innodb; to reclaim unused space. For tables that use independent table spaces, no matter how they are deleted, the fragmentation of the table space will not seriously affect the performance, and there is still a chance to deal with it.
Independent table space structure: It consists of segments, areas, and pages.
File size corresponding to the real table space: If you look in the data directory, you will find that the .ibd file corresponding to a newly created table only occupies 96K, which is only 6 pages in size ( In MySQL5.7), these .ibd files are self-expanding. As the data in the table increases, the files corresponding to the table space also gradually increase.
5.2 System table space
The structure of the system table space is basically similar to that of the independent table space, except that since the entire MySQL process has only one system table space, some additional pages about the entire system information will be recorded in the system table space, which is not available in the independent table space.
InnoDB data dictionary, SYS_TABLES and other tables starting with SYS, but users cannot directly access these tables, but the storage engine will store the data of SYS_ and other tables when it starts. Populate into tables starting with INNODB_SYS in the information_schema database.