If there are tens of millions of data, how to use Java to quickly query?

Foreword

  • Interviewer: Let me tell you, how did you query the 10 million data?

  • Brother B: Direct pagination query, use limit pagination.

  • Interviewer: Have you ever practiced it?

  • Brother B: Sure.

Now present a song “Cool and Cool”

Maybe some people have never encountered a table with tens of millions of data, and they don’t know what will happen when querying tens of millions of data.

Today I will take you to practice, this time based on MySQL 5.7.26 for testing

Prepare data

What if there is no data of 10 million?

create chant

Code to create ten million? That’s impossible, it’s too slow, and it may really take a day to run. Database scripts can be used to execute much faster.

Create table

CREATE TABLE `user_operation_log` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `ip` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `op_data` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr2` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr3` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr4` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr5` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr6` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr7` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr8` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr9` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr10` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr11` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr12` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;

Create data script

Using batch insert, the efficiency will be much faster, and commit every 1000 pieces, the amount of data is too large, which will also lead to slow batch insert efficiency

DELIMITER;;
CREATE PROCEDURE batch_insert_log()
BEGIN
  DECLARE i INT DEFAULT 1;
  DECLARE userId INT DEFAULT 10000000;
 set @execSql = 'INSERT INTO `test`.`user_operation_log`(`user_id`, `ip`, `op_data`, `attr1`, `attr2`, `attr3`, `attr4`, `attr5`, `attr6` , `attr7`, `attr8`, `attr9`, `attr10`, `attr11`, `attr12`) VALUES';
 set @execData = '';
  WHILE i<=10000000 DO
   set @attr = "'测试很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长的属性'";
  set @execData = concat(@execData, "(", userId + i, ", '10.0.69.175', 'User login operation'", ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @ attr, ",", @attr, ",", @attr, ")");
  if i % 1000 = 0
  then
     set @stmtSql = concat(@execSql, @execData,";");
    prepare stmt from @stmtSql;
    execute stmt;
    DEALLOCATE prepare stmt;
    commit;
    set @execData = "";
   else
     set @execData = concat(@execData, ",");
   end if;
  SET i=i + 1;
  END WHILE;

END;;
DELIMITER ;

Start testing

Brother’s computer configuration is relatively low: win10 standard pressure slag i5 reads and writes about 500MB SSD

Due to the low configuration, only 3,148,000 pieces of data were prepared for this test, which took up 5G of the disk (without indexing), and ran for 38 minutes. Students with good computer configurations can insert multi-point data for testing

SELECT count(1) FROM `user_operation_log`

Return result: 3148000

The three query times are:

  • 14060ms

  • 13755ms

  • 13447 ms

Normal pagination query

MySQL supports the LIMIT statement to select the specified number of data, and Oracle can use ROWNUM to select.

The MySQL pagination query syntax is as follows:

SELECT * FROM table LIMIT [offset,] rows | rows OFFSET offset
  • The first parameter specifies the offset of the first returned record row

  • The second parameter specifies the maximum number of rows to return

Let’s start testing the query results:

SELECT * FROM `user_operation_log` LIMIT 10000, 10

The three query times are:

  • 59ms

  • 49ms

  • 50ms

It seems that the speed is okay, but it is a local database, so the speed is naturally faster.

Test from another angle

Same offset, different data volume

SELECT * FROM `user_operation_log` LIMIT 10000, 10
SELECT * FROM `user_operation_log` LIMIT 10000, 100
SELECT * FROM `user_operation_log` LIMIT 10000, 1000
SELECT * FROM `user_operation_log` LIMIT 10000, 10000
SELECT * FROM `user_operation_log` LIMIT 10000, 100000
SELECT * FROM `user_operation_log` LIMIT 10000, 1000000

The query time is as follows:

a69fa5cdc035ac6338bff513dd6eadbb.png

picture

From the above results, it can be concluded that the larger the amount of data, the longer it takes

Same amount of data, different offsets

SELECT * FROM `user_operation_log` LIMIT 100, 100
SELECT * FROM `user_operation_log` LIMIT 1000, 100
SELECT * FROM `user_operation_log` LIMIT 10000, 100
SELECT * FROM `user_operation_log` LIMIT 100000, 100
SELECT * FROM `user_operation_log` LIMIT 1000000, 100

310ddb574c6060803b5cd610ccf7fd74.png

picture

From the above results, it can be concluded that the larger the offset, the longer it takes

SELECT * FROM `user_operation_log` LIMIT 100, 100
SELECT id, attr FROM `user_operation_log` LIMIT 100, 100

How to optimize

Now that we have gone through the above toss, we have also come to a conclusion. For the above two problems: large offset and large amount of data, we will start to optimize them separately.

Optimize the problem of large offset

Use subquery method

We can locate the id of the offset position first, and then query the data

SELECT * FROM `user_operation_log` LIMIT 1000000, 10SELECT id FROM `user_operation_log` LIMIT 1000000, 1SELECT * FROM `user_operation_log` WHERE id >= (SELECT id FROM `user_operation_log` LIMIT 1000000, 1) LIMIT 1 0

The query results are as follows:

ed22648975c438d8de3d58f8a60e7518.png

It can be concluded from the above results that:

  • The first one takes the most time, the third one is slightly better than the first one

  • Subqueries are faster with indexes

Disadvantages: only applicable to the situation where the id is incremented

The following writing method can be used in the case of non-incrementing id, but the disadvantage of this is that the paging query can only be placed in the subquery

Note: Some mysql versions do not support the use of limit in the in clause, so multiple nested selects are used

SELECT * FROM `user_operation_log` WHERE id IN (SELECT t.id FROM (SELECT id FROM `user_operation_log` LIMIT 1000000, 10) AS t)

Adopt id restriction

This method is more demanding, the id must be continuously increasing, and the range of the id must be calculated, and then use between, the sql is as follows

SELECT * FROM `user_operation_log` WHERE id between 1000000 AND 1000100 LIMIT 100

SELECT * FROM `user_operation_log` WHERE id >= 1000000 LIMIT 100

The query results are as follows:

cb612d53acdc77c7fcec4511e93af44c.png

picture

It can be seen from the results that this method is very fast

Note: The LIMIT here is to limit the number of entries, no offset is used

Optimize the problem of large amount of data

The amount of data returned will also directly affect the speed

SELECT * FROM `user_operation_log` LIMIT 1, 1000000

SELECT id FROM `user_operation_log` LIMIT 1, 1000000

SELECT id, user_id, ip, op_data, attr1, attr2, attr3, attr4, attr5, attr6, attr7, attr8, attr9, attr10, attr11, attr12 FROM `user_operation_log` LIMIT 1, 1000000

The query results are as follows:

2ee431712be67d0b1d08e074271bbf52.png

picture

From the results, it can be seen that the query efficiency can be significantly improved by reducing unnecessary columns

The query speed of the first and the third is about the same. At this time, you will definitely complain. Then why do I write so many fields? Just * and it’s over.

Note that my MySQL server and client are on the same machine, so the query data is almost the same. Students who have the conditions can test that the client is separated from MySQL

SELECT * Doesn’t it smell good?

By the way, I would like to add why SELECT * is prohibited. Is it simple and brainless, isn’t it fragrant?

Two main points:

  • Using “SELECT * ” in the database requires parsing more objects, fields, permissions, attributes and other related content. In the case of complex SQL statements and more hard parsing, it will cause a heavy burden on the database.

  • Increase network overhead, * Sometimes useless and large text fields such as log and IconMD5 are mistakenly included, and the data transmission size will increase geometrically. Especially if MySQL and the application are not on the same machine, this overhead is very obvious.

End

Finally, I hope that everyone can practice it by themselves, and I will definitely gain more. Welcome to leave a message! !

I just created the script for you, what are you waiting for! ! !

Link: https://juejin.cn/post/6863668253898735629< /strong>

(Copyright belongs to the original author, infringement and deletion)

– EOF –

Recommended reading Click on the title to jump

One line of code handles HTTP requests, which is ridiculously strong~

Redis Plus is here, performance burst!

Java 20 is officially released, and it’s amazing. .

Have you gained anything after reading this article? Please forward and share to more people

Pay attention to “Java back-end programming” and improve Java skills

7904ea2c6ca955b9c4f9b7405505acd8.png

Likes and watching are the biggest support