Disadvantages of UUID and Snowflake Algorithm

  • question

  • General universal solution

  • snowflake algorithm

Picture

picture

Foreword

The unique system ID is a problem we often encounter when designing a system, and we often struggle with this problem.

This article is to provide you with an idea for generating a distributed unique global ID generation solution. I hope it can help you.

If there are any shortcomings, please give me some advice! !

Question

Why we need distributed globally unique ID and the business requirements of distributed ID

In complex distributed systems, it is often necessary to uniquely identify a large amount of data and messages, such as finance, payment, catering, and hotels in Meituan-Dianping.

The data in the system of products such as Maoyan Movies is gradually growing. After the database is divided into tables, a unique ID is required to identify a piece of data or information;

Special Ian’s orders, riders, and coupons all need to be identified with a unique ID.

At this time, a system that can generate a globally unique ID is very necessary.

Picture

picture

Hard requirements for ID generation rules
  • Globally unique

  • Increasing trend

    • Clustered indexes are used in MySQL’s InnoDB engine. Since most RDBMS use the Btree data structure to store indexes, when selecting primary keys, we should try to use ordered primary keys to ensure writing performance.

  • monotonically increasing

    • Ensure that the next ID must be greater than the previous ID, such as transaction version number, IM incremental message, sorting and other special requirements

  • information security

    • If the ID is continuous, it will be very easy for malicious users to crawl. They can just download the specified URL in order. If it is an order number, it will be dangerous. Competitors can directly know our daily order volume, so in some application scenarios Next, the ID needs to be irregular and irregular, making it difficult for competitors to guess.

  • with timestamp

    • You can also quickly understand when the distributed ID was generated during development.

Availability requirements for ID number generation systems
  • High availability

    • When I issue a request to obtain a distributed ID, the server will ensure that it will create a unique distributed ID for me 99.999% of the time.

  • low latency

    • Send a request to obtain the distributed ID, the server must be fast, extremely fast

  • High QPS

    • For example, if 100,000 requests to create distributed IDs are sent simultaneously, the server must be able to withstand it and successfully create 100,000 distributed IDs at once.

General solutions

UUID

UUID.randomUUID(), the standard type of UUID contains 32 hexadecimal digits, divided into five segments with hyphens, 36 characters in the form of 8-4-4-4-12, performance Very high, locally generated, no network consumption.

There is a problem

The performance of entering the database is poor because UUID is unordered

Unordered, it is impossible to predict the order in which it is generated, and it cannot generate increasing order numbers.

First of all, distributed IDs are generally used as initial keys, but according to the official recommendation of MySQL, the primary key should be as short as possible. Each UUID is very long, so it is not recommended.

Primary key, when ID is used as the primary key, there will be some problems in certain circumstances

For example, in the scenario of DB primary key, UUID is very unsuitable. MySQL official has clear instructions.

Index, split of B + tree index

Since the distributed ID is the primary key, and the primary key contains the index, and the mysql index is implemented through the B + tree, every time new UUID data is inserted, in order to optimize the query, the B + tree underlying the index will be Modification, because UUID data is unordered, every insertion of UUID data will greatly modify the B + tree of the primary key. This is very bad. The insertion is completely unordered, which will not only cause some intermediate nodes to split. It will also create a lot of unsaturated nodes in vain, which greatly reduces the performance of database insertion.

UUID can only guarantee global uniqueness, does not meet the subsequent increasing trend, and increases monotonically.

Database auto-increment primary key
Single machine

In distributed systems, the main principle of the database’s self-increasing ID mechanism is: database self-increasing ID and mysql database’s replace into implementation, here the replace into and insert functions Similar, but the difference is: replace into first tries to insert into the data list. If it is found that this row of data already exists in the table (judged based on the primary key or unique index), it will be deleted first and then inserted. Otherwise, a new row will be inserted directly. data.

The meaning of REPLACE INTO is to insert a record. If the value of the unique index in the table encounters a conflict, the old data will be replaced.

Picture

picture

REPLACE into t_test(stub) values('b');
select LAST_INSERT_ID();

Every time we insert, we find that the original data will be replaced, and the ID will also increase.

This is enough

  • Incremental

  • Monotonicity

  • uniqueness

In a distributed situation and there is not much concurrency, this solution can be used to obtain a global unique ID.

Cluster distributed cluster

Is the database auto-increment ID mechanism suitable for distributed ID? The answer is not suitable

It is difficult to expand the system horizontally. For example, after defining the step size and the number of machines, what should you do if you want to add machines? Suppose there is a machine with the number: 1, 2, 3, 4, 5 (the step size is 1). At this time, you need to expand the capacity of one machine. You can do this: Set the initial value of the second machine to be much higher than the first machine. It seems to be okay, but if there are 100 machines online, how to expand the capacity at this time? It is simply a nightmare, so the system horizontal expansion plan is complicated and difficult to implement.

The database is still under a lot of pressure. Every time you get the ID, you have to read and write the database, which greatly affects performance and does not comply with the low latency and high QPS rules of distributed ID (under high concurrency, if you go to the database to get the ID, then It greatly affects performance)

Generate global ID strategy based on Redis
Stand-alone version

Because Redis is single-threaded and inherently guaranteed to be atomic, it can be implemented using the atomic operations INCR and INCRBY.

INCRBY: Set the growth step size

Cluster distributed

Note: In the case of Redis cluster, different growth steps need to be set like MySQL, and the key must have a validity period. You can use Redis cluster to obtain higher throughput.

Suppose there are 5 Redis in a cluster. You can initialize the values of each Redis to 1, 2, 3, 4, 5 respectively, and then set the step size to 5.

The IDs generated by each Redis are:

A:1 6 11 16 21
B:2 7 12 17 22
C:3 8 13 18 23
D:4 9 14 19 24
E:5 10 15 20 25

But the problem is that the maintenance and upkeep of the Redis cluster is troublesome, and the configuration is troublesome. Because a single point of failure needs to be set up, sentinels are on duty

But the main problem is that for one ID, the entire Redis cluster needs to be introduced, which feels like killing a chicken with a big knife.

Snowflake Algorithm

What is it

Twitter’s distributed self-increasing ID algorithm, Snowflake

Initially, Twitter migrated the storage system from MySQL to Cassandra (an open source distributed NoSQL database system developed by Facebook). Because Cassandra did not have a sequential ID generation mechanism, it developed such a globally unique ID generation service.

Twitter’s distributed snowflake algorithm SnowFlake has been tested and can generate 260,000 self-increasing and sortable IDs per second.

  • Twitter’s SnowFlake generated IDs can be generated in order by time

  • The result of the ID generated by the SnowFlake algorithm is a 64-bit integer, which is a Long type (the maximum length after conversion into a string is 19)

  • There will be no ID collision in the distributed system (distinguished by datacenter and workerID) and it is more efficient.

In distributed systems, there are some scenarios that require globally unique IDs. Basic requirements for generating IDs.

  • In a distributed environment, global uniqueness is required

  • Generally, monotonically increasing is required, because generally unique IDs will exist in the database, and the characteristic of InnoDB is to store content in leaf nodes on the primary key index, and increase from left to right. Considering the performance of the database, it is generally the most efficient way to generate IDs. Good is monotonically increasing. In order to prevent ID conflicts, 36-bit UUIDs can be used, but UUIDs have some disadvantages. First, they are relatively long, and UUIDs are generally unordered.

  • You may also need to have no rules, because if you use a unique ID as the order number, you need this kind of rule in order to prevent others from knowing the order volume in a day.

Structure

Several core components of the snowflake algorithm

Picture

picture

In Java, the 64-bit certificate is of long type, so the ID generated by the SnowFlake algorithm is stored in the long class.

Part 1

The highest bit in binary is the sign bit, 1 represents a negative number and 0 represents a positive number. The generated ID is generally an integer, so the highest bit is fixed to 0.

Part 2

The second part is the 41bit timestamp bit, used to record timestamps, millisecond level

41 bits can represent 2^41 -1 numbers

If it is only used to represent positive integers, the representable range is: 0 – 2^41 -1, minus 1 because the representable numerical range is calculated from 0, not from 1.

In other words, 41 bits can represent a value of 2^41 – 1 millisecond, which is 69.73 years when converted into unit years.

Part 3

The third part is the working machine ID, 10Bit is used to record the working machine ID

Can be deployed on 2^10 = 1024 nodes, including 5-digit datacenterId (data center, computer room) and 5-digit workerID (machine code)

The largest positive integer that can be represented by 5 digits is 2^5 = 31 numbers to represent different data centers and machine codes

Part 4

The positive integer that can be represented by 12 bits is 2^12 = 4095, that is, 0 1 2 … 4094 can be used to represent 4095 ID serial numbers generated by the same machine in the same timestamp.

SnowFlake can guarantee

All generated IDs trend upward over time

There will be no duplicate IDs in the entire distributed system because datacenterId and workerId are used to distinguish them.

Implementation

Snowflake algorithm is written by scala algorithm. Some people use java to implement it. Github address

https://github.com/beyondfengyu/SnowFlake/blob/master/SnowFlake.java

/**
 * Twitter's snowflake algorithm -- java implementation
 *
 * @author beyond
 */
public class SnowFlake {

    /**
     * Starting timestamp
     */
    private final static long START_STMP = 1480166465631L;

    /**
     * Number of bits occupied by each part
     */
    private final static long SEQUENCE_BIT = 12; //The number of digits occupied by the sequence number
    private final static long MACHINE_BIT = 5; //The number of bits occupied by the machine identification
    private final static long DATACENTER_BIT = 5;//The number of bits occupied by the data center

    /**
     * Maximum value of each part
     */
    private final static long MAX_DATACENTER_NUM = -1L ^ (-1L << DATACENTER_BIT);
    private final static long MAX_MACHINE_NUM = -1L ^ (-1L << MACHINE_BIT);
    private final static long MAX_SEQUENCE = -1L ^ (-1L << SEQUENCE_BIT);

    /**
     * Displacement of each part to the left
     */
    private final static long MACHINE_LEFT = SEQUENCE_BIT;
    private final static long DATACENTER_LEFT = SEQUENCE_BIT + MACHINE_BIT;
    private final static long TIMESTMP_LEFT = DATACENTER_LEFT + DATACENTER_BIT;

    private long datacenterId; //data center
    private long machineId; //machine identification
    private long sequence = 0L; //sequence number
    private long lastStmp = -1L;//Last timestamp

    public SnowFlake(long datacenterId, long machineId) {
        if (datacenterId > MAX_DATACENTER_NUM || datacenterId < 0) {
            throw new IllegalArgumentException("datacenterId can't be greater than MAX_DATACENTER_NUM or less than 0");
        }
        if (machineId > MAX_MACHINE_NUM || machineId < 0) {
            throw new IllegalArgumentException("machineId can't be greater than MAX_MACHINE_NUM or less than 0");
        }
        this.datacenterId = datacenterId;
        this.machineId = machineId;
    }

    /**
     * Generate next ID
     *
     * @return
     */
    public synchronized long nextId() {
        long currStmp = getNewstmp();
        if (currStmp < lastStmp) {
            throw new RuntimeException("Clock moved backwards. Refusing to generate id");
        }

        if (currStmp == lastStmp) {
            //In the same millisecond, the sequence number increases automatically
            sequence = (sequence + 1) & MAX_SEQUENCE;
            //The number of sequences in the same millisecond has reached the maximum
            if (sequence == 0L) {
                currStmp = getNextMill();
            }
        } else {
            // Within different milliseconds, the sequence number is set to 0
            sequence = 0L;
        }

        lastStmp = currStmp;

        return (currStmp - START_STMP) << TIMESTMP_LEFT //Time stamp part
                | datacenterId << DATACENTER_LEFT //Data center part
                | machineId << MACHINE_LEFT //Machine identification part
                | sequence; //serial number part
    }

    private long getNextMill() {
        long mill = getNewstmp();
        while (mill <= lastStmp) {
            mill = getNewstmp();
        }
        return mill;
    }

    private long getNewstmp() {
        return System.currentTimeMillis();
    }

    public static void main(String[] args) {
        SnowFlake snowFlake = new SnowFlake(2, 3);

        for (int i = 0; i < (1 << 12); i + + ) {
            System.out.println(snowFlake.nextId());
        }

    }
}
Project implementation experience

hutools toolkit

Address: https://github.com/looly/hutool

SpringBoot integrates snowflake algorithm

Introducing the hutool tool class

<dependency>
    <groupId>cn.hutool</groupId>
    <artifactId>hutool-all</artifactId>
    <version>5.3.1</version>
</dependency>

Integrate

/**
 * Snowflake algorithm
 *
 * @author: Moxi
 */
public class SnowFlakeDemo {
    private long workerId = 0;
    private long datacenterId = 1;
    private Snowflake snowFlake = IdUtil.createSnowflake(workerId, datacenterId);

    @PostConstruct
    public void init() {
        try {
            //Convert network ip to long
            workerId = NetUtil.ipv4ToLong(NetUtil.getLocalhostStr());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    /**
     * Get Snowflake ID
     * @return
     */
    public synchronized long snowflakeId() {
        return this.snowFlake.nextId();
    }

    public synchronized long snowflakeId(long workerId, long datacenterId) {
        Snowflake snowflake = IdUtil.createSnowflake(workerId, datacenterId);
        return snowflake.nextId();
    }

    public static void main(String[] args) {
        SnowFlakeDemo snowFlakeDemo = new SnowFlakeDemo();
        for (int i = 0; i < 20; i + + ) {
            new Thread(() -> {
                System.out.println(snowFlakeDemo.snowflakeId());
            }, String.valueOf(i)).start();
        }
    }
}

got the answer

1251350711346790400
1251350711346790402
1251350711346790401
1251350711346790403
1251350711346790405
1251350711346790404
1251350711346790406
1251350711346790407
1251350711350984704
1251350711350984706
1251350711350984705
1251350711350984707
1251350711350984708
1251350711350984709
1251350711350984710
1251350711350984711
1251350711350984712
1251350711355179008
1251350711355179009
1251350711355179010
Advantages and Disadvantages
Advantages
  • The number of milliseconds is in the high dimension, the auto-increasing sequence is in the low dimension, and the entire ID has an increasing trend.

  • It does not rely on third-party systems such as databases and is deployed as a service. It has higher stability and the performance of generating IDs is also very high.

  • Bits can be allocated according to its own business characteristics, which is very flexible

Disadvantages
  • Depends on the machine clock. If the machine clock is set back, duplicate IDs will be generated.

  • It is incremental on a single machine, but due to the distributed environment, the clocks on each machine cannot be completely synchronized, and sometimes it does not increase globally. This shortcoming can be considered insignificant. Generally, distributed IDs only require trend increments. , and does not strictly require increments, 90% of the demands only require trend increments.

Other supplements
  • In order to solve the problem of clock dialback, which leads to duplicate IDs, someone later specially proposed a solution.

    • Baidu’s open source distributed unique ID generator UidGenerator

    • Leaf – Meituan-Dianping distributed ID generation system

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 56687 people are learning the system