Discussion on MySQL bulk data insertion scheme

Because the company has a need to insert a large amount of data, it took time to research it, and currently summarizes 5 methods:

  1. Single data loop insertion
  2. Perform sql splicing and insertion
  3. Batch insertion using mybatis plus (mp)
  4. Combined with sql splicing + loop insertion
  5. An upgraded version of 4, inserted into the same session
    The above are the 5 methods I have summarized. There will be a detailed verification process below. Please correct me if there are any errors.
1. Preparation
1. Create table

Because it is a test, only a few fields are created in the table.

CREATE TABLE `user` (
  `id` int(0) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `tel` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `dept_id` int(0) NULL DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;
SET FOREIGN_KEY_CHECKS = 1;
2. Basic code

The next step is to create the mapper, service and controller layers. Because it is relatively simple, the code will not be posted.

3. Configuration file

Batch processing needs to be enabled in the database: rewriteBatchedStatements = true

datasource:
  type: com.zaxxer.hikari.HikariDataSource
  driver-class-name: com.mysql.cj.jdbc.Driver
  url: jdbc:mysql://localhost:3306/demo?serverTimezone=Asia/Shanghai & amp;rewriteBatchedStatements = true
4. Statement

In this test, the table will be cleared after the data is inserted to prevent the impact of old data.
The insertion speed will also be affected by hardware such as CPU, which can be ignored because it is the same machine.
Each test will be performed 3 times and the service will be restarted

2. Start verification
1. Single data loop insertion

Because inserting this way is too slow, I only set 10,000 entries.

@Override
public int batchInsert() {<!-- -->
    Integer max = 10000;
    List<User> list = new ArrayList<>();
    for (Integer i = 0; i < max; i + + ) {<!-- -->
        User user = new User();
        user.setName("name:" + i);
        user.setTel("tel:" + 1);
        list.add(user);
    }
    long l = System.currentTimeMillis();
    for (User user : list) {<!-- -->
        mapper.insert(user);
    }
    Long time = System.currentTimeMillis() - l;
    System.out.println("Total time spent: " + time);
    return 1;
}

The results of the 3 tests are as follows:

Total time spent: 66188
Total time taken: 65873
Total time taken: 61469

It can be seen that it takes more than a minute to process 10,000 pieces of data, which is definitely not possible. The reason is mainly that time is spent on operations such as obtaining, releasing connections, and closing resources.

2. SQL splicing and insertion

This method mainly uses MySQL’s own batch insert statements, namely: insert into table(…) values(…),(…),(…)…
code show as below:

@Override
public int batchInsert() {<!-- -->
    Integer max = 10000;
    List<User> list = new ArrayList<>();
    for (Integer i = 0; i < max; i + + ) {<!-- -->
        User user = new User();
        user.setName("name:" + i);
        user.setTel("tel:" + 1);
        list.add(user);
    }
    long l = System.currentTimeMillis();
    mapper.batchInsert(list);
    Long time = System.currentTimeMillis() - l;
    System.out.println("Total time spent: " + time);
    return 1;
}

The xml file of the batchInsert method is as follows:

<insert id="batchInsert" parameterType="club.gggd.demo.domain.entity.User">
    INSERT INTO user(`name`, tel) VALUES
    <foreach collection="list" separator="," item="item">
        (#{item.name},#{item.tel})
    </foreach>
</insert>

The results of the 3 tests are as follows:

Total time taken: 986
Total time taken: 1116
Total time taken: 1182

It can be seen that after splicing all the data insertion into one SQL, the efficiency is greatly improved. After all, there is only one SQL and it does not take time to establish a connection or the like.
But when the amount of data becomes very large, the SQL will become very long, and I only have 3 fields here. When there are too many fields, the SQL will also become longer, so try not to make the SQL too long.

3. Batch insertion of mp

mp comes with a batch insertion function, we can use this to perform batch insertion:
The efficiency here and later is relatively high, so I increased the data volume to 100,000, so that it will be more obvious.

@Override
public int batchInsert() {<!-- -->
    Integer max = 100000;
    List<User> list = new ArrayList<>();
    for (Integer i = 0; i < max; i + + ) {<!-- -->
        User user = new User();
        user.setName("name:" + i);
        user.setTel("tel:" + 1);
        list.add(user);
    }
    long l = System.currentTimeMillis();
    this.saveBatch(list);
    Long time = System.currentTimeMillis() - l;
    System.out.println("Total time spent: " + time);
    return 1;
}

The results of the 3 tests are as follows:

Total time spent: 5392
Total time taken: 5220
Total time taken: 5330

It can be seen that the batch insertion method that comes with mp only takes more than 5 seconds to insert 100,000 items, and the efficiency is still very good.
By looking at the source code, you can find that mp’s batch insertion operates the data to be inserted in batches, with every 1,000 items as a batch, and then inserts the data in a loop, so the efficiency is also very high.

4. Combined with sql splicing + loop insertion

According to the second method, inserting through SQL splicing can also be very efficient, but if the amount of data is uncontrollable, the SQL will be too long and an error will be reported. Therefore, we can combine the first and second methods, that is, both splicing and looping. The way is as follows:

@Override
public int batchInsert() {<!-- -->
    Integer max = 100000;
    List<User> list = new ArrayList<>();
    for (Integer i = 0; i < max; i + + ) {<!-- -->
        User user = new User();
        user.setName("name:" + i);
        user.setTel("tel:" + 1);
        list.add(user);
    }
    // Split the data into every 1000 items
    List<List<User>> split = ListUtil.split(list, 1000);
    long l = System.currentTimeMillis();
    for (List<User> users : split) {<!-- -->
        mapper.batchInsert(users);
    }
    Long time = System.currentTimeMillis() - l;
    System.out.println("Total time spent: " + time);
    return 1;
}

The results of the 3 tests are as follows:

Total time spent: 4169
Total time taken: 3848
Total time taken: 3908

As you can see, this method is more efficient than mp’s batch insertion. The specific reason I think is that mp’s batch insertion is submitted once every 1,000 items, but all 1,000 items are inserted one by one, and this kind of The method is to merge 1000 items into one SQL insert, so the efficiency will be higher.

5. Insert into the same session

This method is a combination of method 3 and method 4. Each cycle in method 4 is a commit submission, so it also takes a certain amount of time to open and close the connection. In combination with mp, 1000 entries will be inserted as one submission. Woolen cloth?

@Override
public int batchInsert() {<!-- -->
    Integer max = 100000;
    List<User> list = new ArrayList<>();
    for (Integer i = 0; i < max; i + + ) {<!-- -->
        User user = new User();
        user.setName("name:" + i);
        user.setTel("tel:" + 1);
        list.add(user);
    }
    SqlSession sqlSession = null;
    long l = 0L;
    try {<!-- -->
        // Turn on batch processing mode BATCH, turn off automatic transaction submission
        sqlSession = sqlSessionFactory.openSession(ExecutorType.BATCH,false);
        //Reflection to obtain Mapper
        UserMapper userMapper = sqlSession.getMapper(UserMapper.class);
        // Split the data into every 1000 items
        List<List<User>> split = ListUtil.split(list, 1000);
        l = System.currentTimeMillis();
        for (List<User> users : split) {<!-- -->
            userMapper.batchInsert(users);
        }
        // Submit the transaction after all data is inserted
        sqlSession.commit();
    } catch (Exception e) {<!-- -->
        e.printStackTrace();
    } finally {<!-- -->
        //Finally remember to close the resource
        if (sqlSession != null) {<!-- -->
            sqlSession.close();
        }
    }
    Long time = System.currentTimeMillis() - l;
    System.out.println("Total time spent: " + time);
    return 1;
}

The results of the 3 tests are as follows:

Total time spent: 2699
Total time taken: 2761
Total time taken: 2690

It can be seen that the efficiency has been further improved, because all insertions are placed in the same commit, and only one connection is required. This solution is also the best solution in my opinion at present.
The above is my superficial understanding of batch data insertion. If there are any mistakes, please feel free to correct me.