Disadvantages of UUID and snowflake algorithm

# Question

Why do you need a distributed globally unique ID and the business requirements of a distributed ID?

In complex distributed systems, it is often necessary to uniquely identify a large amount of data and messages

Such as finance, payment, catering, and hotels in Meituan-Dianping;

The data in the system of Maoyan Movies and other products is increasing day by day. After the data is divided into databases and tables, a unique ID is required to identify a piece of data or message;

Special ones such as orders, riders, and coupons also need to be identified by unique IDs.

At this time, a system that can generate a globally unique ID is very necessary

Part of the mandatory requirements for ID generation rules

① Globally unique

That is, no duplicate ID numbers can appear

②Increasing trend

Clustered indexes are used in Mysql’s InnoDB engine. Since most RDBMSs use BTree data structures as storage index data, we should try to use ordered primary keys to ensure write performance in the selection of primary keys.

③Monotonically increasing

Ensure that the next ID must be greater than the previous ID, such as transaction version number, sorting and other special requirements

④Information security

If the ID is continuous, it is very easy for malicious users to pick up the work, just download the specified URL directly in order; if it is an order number, it is even more dangerous, and the competition can directly know our daily order quantity. Therefore, in some application scenarios, the ID needs to be irregular and irregular, making it difficult for competitors to guess

⑤ With time stamp

In development, you can quickly understand the time when this distributed id is generated

Usability Requirements for ID Number Generation Systems

①High availability

Send a request to obtain a distributed ID, and the server must guarantee that it can return a distributed ID to me in 99.999% of cases

②Low latency

return faster

③High QPS (resistance to high access)

For example, 100,000 in one second, the server needs to be able to resist and generate 100,000 distributed IDs

# General general plan

(1) UUID

It comes with jdk and generates 36-bit characters in the form of 8-4-4-4-12, including 4 serial numbers -, which can be used if the single type only needs to be unique

Benefits: high performance, local generation, no network consumption

Disadvantages: Unordered and long strings, storing in the database will increase the pressure on the database, and the performance of entering the database is poor. Mysql officially recommends that the shorter the primary key, the better

Index, split of B+tree index

Since the distributed id is the primary key, then the primary key contains the index, and then the mysql index is implemented through the b + tree. Every time a new UUID data is inserted, in order to optimize the query, the b + tree at the bottom of the index will be processed Modified, because UUID data is unordered. So every insertion of UUID data will greatly modify the b + tree of the primary key, which is very bad. The insertion is completely out of order, which will not only cause some intermediate nodes to split, but also create a lot of unsaturated nodes in vain, which greatly reduces the performance of database insertion

(2) Database auto-increment primary key

Implementation principle of database auto-increment primary key: database auto-increment id and replace into

The meaning of REPLACE INTO is to insert a record, and if the value of the unique index in the table encounters a conflict, replace the old data

Is the database self-incrementing ID mechanism suitable for distributed ID? The answer is not suitable

1: It is difficult to expand the system horizontally. For example, after defining the step size and the number of machines, what should I do if I want to add machines? Assume that there is only one machine with the number 1, 2.3.4.5 (step size is 1), at this time One machine needs to be expanded. You can do this by setting the initial value of the second machine much higher than the first one. It seems to be fine. Now imagine that if we have 100 machines online, what should we do to expand the capacity at this time? It is a nightmare. Therefore, the horizontal expansion scheme of the system is complicated and difficult to realize.

2: The pressure on the database is still very high. Every time you get an ID, you have to read and write the database once, which greatly affects performance and does not meet the rules of low latency and high QPS in distributed IDs (under high concurrency, if you go to the database to get id, that is very performance-affecting)

(3) Redis generates a global id strategy

Because Redis is a single-line inherently guaranteed atomicity, atomic operations INCR and INCRBY can be used to achieve

Note: In the case of Redis cluster, it is also necessary to set different growth steps like MySQL, and at the same time, the key must be set to a valid period

Redis Cluster can be used for higher throughput.

Suppose there are 5 Redis in a cluster. The values that can be initialized for each Redis are 1, 2, 3, 4, 5 respectively, and then the step size is 5.

The ID generated by each Redis is:

A: 1,6,11,16,21

B: 2,7,12,17,22

C: 3,8,13,18,23

D: 4,9,14,19,24

E: 5,10,15,20,25

# snowflake

(I. Overview

Official website: twitter-archive/snowflake: Snowflake is a network service for generating unique ID numbers at high scale with some simple guarantees. (github.com)

Twitter’s distributed self-incrementing ID algorithm

Features:

① Can be generated sequentially according to time

②The result of the id generated by the Snowflake algorithm is a 64bit integer, which is a Long type, and converted into a string with a maximum of 19 digits

③ There will be no ID collision in the distributed system (distinguished by datacenter and workerld) and the efficiency is high.

(2) Structure

bit means +-, id is generally positive, so it is fixed at 0

The range of timestamp bits is 0-2^41, which can be used for about 69 years (counting from 1970)

The maximum number of working process bits is 2^10, that is, 1024 nodes, including 5 datacenters and 5 workerlds for distinction

12bit-serial number, serial number, used to record different ids generated within the same millisecond. The largest positive integer that can be represented by 12 bits (bit) is 2^12-1=4059

(3) Code

/**
 *Twitter_Snowflake<br>
 * The structure of SnowFlake is as follows (each part is separated by -):<br>
 * 0 - 0000000000 0000000000 0000000000 0000000000 0 - 00000 - 00000 - 000000000000<br>
 * 1-bit identification, since the basic type of long is signed in Java, the highest bit is the sign bit, the positive number is 0, and the negative number is 1, so id is generally a positive number, and the highest bit is 0<br>
 * 41-bit time interval (millisecond level), note that the 41-bit time interval is not the time interval for storing the current time, but the difference between the storage time interval (current time interval - start time interval)
 * The value obtained), the start time here is generally the time when our id generator starts to use, which is specified by our program (see the startTime attribute of the IdWorker class in the program below). 41-bit time cut, can use 69 years, year T = (1L << 41) / (1000L * 60 * 60 * 24 * 365) = 69<br>
 * 10-digit data machine bits can be deployed on 1024 nodes, including 5-digit datacenterId and 5-digit workerId<br>
 * 12-bit sequence, counting within milliseconds, 12-bit counting sequence number supports each node to generate 4096 ID serial numbers per millisecond (same machine, same time cut)<br>
 * It adds up to exactly 64 bits, which is a Long type.<br>
 * The advantage of SnowFlake is that it is generally sorted according to time increment, and ID collisions will not occur in the entire distributed system (distinguished by data center ID and machine ID), and the efficiency is high. After testing, SnowFlake can generate About 260,000 IDs.
 */
public class SnowflakeIdWorker {
    // ================================Fields================== ============================
    /** Start time cutoff (2020-08-28) */
    private final long twepoch = 1598598185157L;
    /** The number of digits occupied by the machine id */
    private final long workerIdBits = 5L;
    /** The number of digits occupied by the data identifier id */
    private final long datacenterIdBits = 5L;
    /** The maximum supported machine id, the result is 31 (this shift algorithm can quickly calculate the maximum decimal number that can be represented by several binary numbers) */
    private final long maxWorkerId = -1L ^ (-1L << workerIdBits);
    /** The maximum supported data ID, the result is 31 */
    private final long maxDatacenterId = -1L ^ (-1L << datacenterIdBits);
    /** The number of digits in the id of the sequence */
    private final long sequenceBits = 12L;
    /** Machine ID shifted 12 bits to the left */
    private final long workerIdShift = sequenceBits;
    /** The data identifier id is shifted to the left by 17 bits (12 + 5) */
    private final long datacenterIdShift = sequenceBits + workerIdBits;
    /** Time cut to the left by 22 bits (5 + 5 + 12) */
    private final long timestampLeftShift = sequenceBits + workerIdBits + datacenterIdBits;
    /** Generate the sequence mask, here is 4095 (0b111111111111=0xfff=4095) */
    private final long sequenceMask = -1L ^ (-1L << sequenceBits);
    /** Working machine ID(0~31) */
    private long workerId;
    /** Data Center ID(0~31) */
    private long datacenterId;
    /** Sequence within milliseconds (0~4095) */
    private long sequence = 0L;
    /** The last time the ID was generated */
    private long lastTimestamp = -1L;
    //================================Constructors================== =====================
    /**
     * Constructor
     * @param workerId job ID (0~31)
     * @param datacenterId Data center ID (0~31) This method is to judge whether the incoming computer room number and machine number exceed the maximum value, that is, 31, or less than 0
     */
    public SnowflakeIdWorker(long workerId, long datacenterId) {
        if (workerId > maxWorkerId || workerId < 0) {
            throw new IllegalArgumentException(String.format("worker Id can't be greater than %d or less than 0", maxWorkerId));
        }
        if (datacenterId > maxDatacenterId || datacenterId < 0) {
            throw new IllegalArgumentException(String.format("datacenter Id can't be greater than %d or less than 0", maxDatacenterId));
        }
        this. workerId = workerId;
        this.datacenterId = datacenterId;
    }
    // ===============================Methods================= ===========================
    /*
     * Core method
     * Get the next ID (this method is thread-safe)
     * @return SnowflakeId
     */
    public synchronized long nextId() {
        //1. Get the current system time
        long timestamp = timeGen();
        //If the current time is less than the timestamp of the last ID generation, it means that the system clock has rolled back and an exception should be thrown
        if (timestamp < lastTimestamp) {
            throw new RuntimeException(
                    String.format("Clock moved backwards. Refusing to generate id for %d milliseconds", lastTimestamp - timestamp));
        }
        //If it is generated at the same time, perform a sequence within milliseconds
        if (lastTimestamp == timestamp) {
            // The sequence needs to be increased by 1, but it is necessary to prevent the sequence from exceeding the maximum value of 4095, so it must be bitwise ANDed with SEQUENCE_MASK
            // That is, if the sequence is equal to 4095 at this time, after adding 1, it will be 4096, and after bitwise ANDing with 4095, the result will be 0
            sequence = (sequence + 1) & sequenceMask;
            // sequence overflow in milliseconds
            if (sequence == 0) {
                //Block until the next millisecond, get a new timestamp
                timestamp = tilNextMillis(lastTimestamp);
            }
        }
        //The timestamp changes, the sequence resets within milliseconds
        else {
            sequence = 0L;
        }
        //The time when the ID was last generated
        //Assign the current time to lastTime, so that the next time you can judge whether it is within the same millisecond
        lastTimestamp = timestamp;
        //Shift and put together by OR operation to form a 64-bit ID
         long id = ((timestamp - twepoch) << timestampLeftShift) // subtract the default time from the timestamp and then shift left by 22 bits AND operation
                | (datacenterId << datacenterIdShift) // computer room number shifted left by 17 digits AND operation
                | (workerId << workerIdShift) // machine number left shifted 12 bits AND operation
                | sequence; // The serial number does not need to be shifted to the left, and it can be directly ANDed
        return id;
    }
    /**
     * Block until the next millisecond until a new timestamp is obtained
     * @param lastTimestamp The last time the ID was generated
     * @return current timestamp
     */
    protected long tilNextMillis(long lastTimestamp) {
        long timestamp = timeGen();
        while (timestamp <= lastTimestamp) {
            timestamp = timeGen();
        }
        return timestamp;
    }
    /**
     * Returns the current time in milliseconds
     * @return current time (milliseconds)
     */
    protected long timeGen() {
        return System. currentTimeMillis();
    }
    //================================Test================== ==============================
    /** test */
    public static void main(String[] args) {
        SnowflakeIdWorker idWorker = new SnowflakeIdWorker(0, 0);
        for (int i = 0; i < 1000; i ++ ) {
            long id = idWorker. nextId();
            System.out.println(id);
        }
    }
}

(4) Advantages and disadvantages

Advantages:

The number of milliseconds is in the high position, the self-increment sequence is in the low position, and the entire ID is increasing in trend. It does not rely on third-party systems such as databases, and is deployed as a service, with higher stability and high performance in generating IDs. Bits can be allocated according to their own business characteristics, which is very flexible.

Cons:

Relying on the machine clock, if the machine clock is dialed back, it will cause the duplicate ID generation to increase on a single machine, but due to the design of a distributed environment, the clocks on each machine cannot be completely synchronized, and sometimes there will be situations that are not globally incremented (This shortcoming can be considered insignificant. Generally, distributed IDs only require an increasing trend, not a grid requirement. 90% of the requirements only require an increasing trend.)

Solutions for Weaknesses

You can refer to the following two to synchronize the machine clock

Baidu open source distributed unique ID generator UidGenerator

Leaf–Meituan Dianping Distributed ID Generation System

The rollover accident caused by Lombok is too bad!

After JVM tuning, the effect is remarkable, and the performance is improved by 15%

Do you really know the function and principle of SpringBoot Starter?

An Online Fault: Thoughts on Database Connection Pool Leakage

Interviewer: What are the key points in designing a high-traffic and high-concurrency system?

What is the fuse mechanism of Spring Cloud microservices and the meaning of fuse?

Reply【Dry goods】Get selected dry goods video tutorials

Reply【Add group】Join the exchange group for solving difficult problems

Reply【mat】A detailed document tutorial for memory overflow problem analysis

Reply【Make money】Get a WeChat robot that can make money in java

Reply【Sideline】Get a programmer’s sideline strategy

Good article, please like + share

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge algorithm skill treehome page overview 47440 people are learning