You will understand the various implementations of distributed locks after reading this article!

Foreword

Total number of words: 1W +

? Reading time: 15min

Keywords: distributed lock, Redis, Etcd, ZooKeeper

Today we are going to talk about distributed locks. There is a lot of related content on the Internet, but it is relatively scattered. I just finished learning and summarized it and shared it with everyone. The article will have more content. Let us first understand what we want to talk about from the mind map.

What is a distributed lock

Distributed locks are a way to control synchronous access to shared resources between distributed systems and maintain consistency through mutual exclusion.

Before understanding distributed locks, first understand thread locks and process locks:

Thread lock: Mainly used to lock methods and code blocks. When a method or code uses a lock, only one thread executes the method or code segment at the same time. Thread locks only have an effect in the same JVM, because the implementation of thread locks fundamentally relies on shared memory between threads, such as Synchronized, Lock, etc.

Process lock: Controls multiple processes in the same operating system to access a shared resource. Because processes are independent, each process cannot access the resources of other processes, so process locks cannot be implemented through thread locks such as synchronized.

For example, the sync package in Golang language provides basic synchronization primitives, such as mutex locks

However, the above two are suitable for application in a single architecture. However, in a distributed system, multiple service nodes and multiple processes are deployed in different node machines. At this time, due to resource competition, the two locks on node local resources are invalid. .

At this time, distributed locks are needed to control access to resources by multiple processes in the distributed system. Therefore, distributed locks are used to solve the distributed mutual exclusion problem!

Characteristics of distributed locks

Mutually exclusive

Mutual exclusivity is easy to understand. This is also the most basic function, that is, at any time, only one client can acquire the lock, and no two clients can acquire the lock at the same time.

Avoid deadlock

Why does a deadlock occur? Because the client that acquired the lock failed to release the lock due to some reasons (such as machine down, etc.), and other clients can no longer acquire the lock, resulting in the entire process being unable to continue.

Faced with this situation, of course there is a solution!

Introducing expiration time: Usually we will set a TTL (Time To Live, survival time) to avoid deadlock, but this cannot be completely avoided.

For example, if the TTL is 5 seconds, process A obtains the lock.
The problem is that process A did not release the lock within 5 seconds and was automatically released by the system. Process B obtained the lock.
Exactly at the 6th second, process A finishes executing and the lock will be released again, that is, process A releases the lock of process B.

Just adding an expiration time will cause two problems: lock expiration and releasing other people’s locks.

Lock additional uniqueness: To solve the problem of releasing other people’s locks, we can set a [unique ID] for each client process, so that we can check the unique ID at the application layer.

Automatic renewal: The lock expiration problem arises because we cannot estimate the lock holding time. If the setting is shorter, there will be a risk of [early expiration]. However, if the expiration time is set too long, the lock may not be available for a long time. freed.

There are also ways to deal with this situation. You can start a daemon process (watch dog) to detect the expiration time and renew the lease. For example, the Java technology stack can use Redisson to handle it.

Reentrant:

What happens if a thread acquires the lock, but while executing, tries to acquire the lock again?

Yes, it leads to repeated acquisition of locks, occupying lock resources, and causing deadlock problems.

Let’s understand what [reentrancy] is: It means that the same thread can acquire the lock multiple times without causing deadlock while holding the lock. That is, a thread can acquire the same lock again after acquiring the lock. A lock without waiting for the lock to be released.

Solution: For example, to implement reentrant locks in Redis distributed locks, you need to use the Lua scripting language of Redis and use reference counter technology to ensure the correctness of reentrant locks in the same thread.

Fault tolerance

Fault tolerance is so that when some nodes (redis nodes, etc.) go down, the client can still acquire and release locks. Generally speaking, there are two processing methods:

One, like etcd/zookeeper, can automatically perform failover as a lock service because it is a cluster in itself. The other can provide multiple independent lock services. The client requests multiple independent lock services. A certain lock When a service fails, lock information can also be obtained from other services, but this disadvantage is obvious. The client needs to request multiple lock services.

Implementation method

The implementation methods of distributed locks include database, Redis-based cache, ZooKeeper, Etcd, etc. This article mainly describes these implementation methods combined with problems!

Based on MySQL

Do you feel a little confused about using database tables to implement distributed locks? Yes, I also had some doubts when I collected information before writing. Although we do not recommend this method, we can also understand it as a solution. , let’s see how it’s done:

For example, create a table in the database that contains fields such as method names, and create a unique index on the method name field. If you want to execute a method, use this method name to insert a record into the table. If the insertion is successful, you will get Lock, deleting the corresponding row is the lock release.

//Lock record table
CREATE TABLE `lock_info` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT 'primary key',
  `name` varchar(64) NOT NULL COMMENT 'method name',
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_name` (`method_name`)
) ENGINE=InnoD

This is mainly implemented by using the name field as the unique index. The unique index ensures the uniqueness of the record. Just delete the record when the lock is released.

There are also many disadvantages:

The database is a single point and very dependent on the availability of the database
Additional maintenance of TTL is required
Database reading and writing are very slow under high concurrency conditions.

We don’t need too much text here. In reality, we mostly use memory storage to implement distributed locks.

Based on Redis

The interviewer asked: Do you know about distributed locks? Presumably most interviewers will talk about the way Redis implements distributed locks, OK, let’s get to the point [Based on Redis distributed locks]

For Redis distributed locks, can I just use the setnx command and set the expiration time?

setnx lkey lvalue expire lockKey 30

Under normal circumstances it is possible, but there is a problem here. Although setnx is atomic, setnx + expire is not. In other words, setnx and expire are executed in two steps, two operations: [locking and timeout] are separate. If expire execution fails, the lock will not be released.

The reasons for locking and timeout settings are mentioned at the beginning of the article [Avoiding Deadlock]. If you don’t understand, you can read more.

What is the correct locking command for Redis?

//Ensure atomic execution of commands
SET lKey randId NX PX 30000

randId is a random string generated by the client. The client is unique when locking, mainly to avoid releasing other people’s locks.

Let’s take a look at the same process, as shown below:

Client1 acquires the lock successfully.
Since Client1’s business processing time is too long, the lock expiration time is up and the lock is automatically released.
Client2 obtained the lock corresponding to the same resource.
Client1’s business processing is completed and the lock is released, but the lock held by Client2 is released.
Client3 can still obtain the lock at this time, and Client2 also holds the lock at this time, and everything is messed up.

This randId can avoid releasing other people’s locks when releasing the lock, because when releasing the lock, the client needs to obtain the value of the lock (randId) first and determine whether it is the same before deleting it.

if (redis.get(lKey).equals(randId)) {
    redis.del(lockKey);
}

When locking, atomicity is required. How to achieve atomicity when releasing the lock?

This is a good question. We use atomic commands to avoid the potential failure to set the expiration time when locking. Releasing the lock is also the Get + Del command. There is also the problem of releasing other people’s locks.

My head is buzzing, why are there so many issues to consider, take a break when you are tired, why don’t we continue reading!

The root of the problem here is that the lock is judged on the client and released on the server, as shown below:

Therefore, the judgment and deletion of locks should be performed on the redis server. You can use Lua scripts to ensure atomicity and release the core logic of the lock [GET, judgment, DEL], written in Lua scripts, and let Redis execute it. This implementation can ensure these three atomicity of steps.

// Release it after judging that the lock belongs to you
if redis.call("GET",KEYS[1]) == ARGV[1]
then
    return redis.call("DEL",KEYS[1])
else
    return 0
end

What should we do if Client1 acquires the lock but the business problem requires a long processing time and exceeds the lock expiration time?

Since the business execution time exceeds the lock expiration time, we can renew the lock, for example, start a daemon process, regularly monitor the lock expiration time, and when it is about to expire, automatically renew the lock and reset the expiration time.

This is implemented in the Redisson framework, which requires WatchDog (watchdog): the watchdog mechanism is enabled when the lock time is not specified when locking. The default lock is 30 seconds. It is checked every 10 seconds. If it exists, the expiration is reset. The time is 30 seconds (that is, it will not be renewed after 30 seconds)

Hmm, this should be more stable!

Hehe, these are all problems that may arise from locking in a “single” Redis instance. Indeed, single-node distributed locks can solve most people’s needs. However, [Redis Cluster] or [Sentinel Mode] are usually used to achieve high availability of Redis, which causes master-slave synchronization problems.

Imagine this scenario:

Client1 requested Master to lock successfully
However, the Master is down abnormally, and the locking information has not yet been synchronized to the slave database (master-slave replication is asynchronous)
At this time, the slave library Slave1 is promoted to the new master library by Sentinel, and the lock information is not on the new master library (not synchronized to Slave1)

Faced with this problem, the author of Redis proposed a solution called Redlock, which is an implementation based on multiple Redis nodes (all Master). This solution is based on 2 premises:

It is no longer necessary to deploy slave libraries and sentinel instances, only the main library is deployed
However, multiple main libraries need to be deployed, and the official recommendation is at least 5 instances.

Redlock locking process:

Client first obtains “current timestamp T1”
The Client initiates locking requests to these five Redis instances in sequence (using the SET command mentioned earlier), and each request will set a timeout (millisecond level, much shorter than the effective time of the lock). If a certain instance fails to lock (Including various abnormal situations such as network timeout and lock being held by others), immediately apply for a lock to the next Redis instance.
If the Client successfully locks from >= 3 (most) or more Redis instances, then obtain the “current timestamp T2” again. If T2 – T1
After locking successfully, go to operate shared resources (such as modifying a MySQL row, or initiating an API request)
The lock fails, and the Client initiates a lock release request to “all nodes” (the Lua script mentioned earlier releases the lock)

Redlock releases the lock:

The client initiates a lock release operation to all Redis nodes

Question 1: Why lock on multiple instances?

Essentially for fault tolerance, we look at the multiple Master example nodes in the picture, which actually constitute a distributed system. There will always be abnormal nodes in the distributed system. If multiple instances are locked, even if some instances crash abnormally, the remaining The instance is locked successfully, and the entire lock service is still available!

Question 2: Why is it necessary to calculate the cumulative locking time after the locking is successful in step 3?

The locking operation targets multiple nodes in the distribution, so it is definitely more time-consuming than a single instance. Network delay, packet loss, timeout, etc. must also be considered. The more network requests there are, the higher the probability of anomalies. The bigger.

So even if N/2 + 1 nodes are successfully locked, if the cumulative time spent on locking has exceeded the lock expiration time, then the lock at this time is meaningless.

Question 3: Why do we need to operate all nodes to release the lock?

Mainly to ensure that residual locks caused by node abnormalities are cleared!

For example: when a certain Redis node is locked, the lock may fail due to “network reasons”.

Or if the client successfully locks a Redis instance, but when reading the response result, network problems cause the read to fail, then the lock has actually been successfully locked on Redis.

Therefore, when releasing the lock, no matter whether the lock has been successfully locked before, the locks of all nodes must be released.

There is a debate about the security of Redlock. I will briefly mention it here. If you are interested, you can take a look:

Java Interview 365: RedLock Security Debate (Part 1) 4 Agree·0 Comments “”>

Based on Etcd

Etcd is a very reliable kv storage system implemented in Go language. It often stores key data in distributed systems. It is usually used in configuration centers, service discovery and registration, distributed locks and other scenarios.

This article mainly looks at how Etcd implements distributed locks from the perspective of distributed locks. Let’s Go!

Etcd feature introduction:

Lease mechanism: Lease mechanism (TTL, Time To Live), etcd can set a lease for the stored kv pair. When the lease expires, the kv will be invalid and deleted; it also supports renewal and keepalive.
Revision mechanism: Each key has a Revision attribute value. The global Revision value corresponding to each etcd transaction will be + 1, so the Revision attribute value corresponding to each key is globally unique. By comparing the size of Revision, you can know the order in which write operations are performed.
When implementing distributed locks, multiple programs grab locks at the same time and obtain locks in sequence according to the size of the Revision value to avoid the “thundering herd effect” and achieve fair locks.
Prefix mechanism: also called directory mechanism, all keys and their corresponding attribute values in the directory can be obtained based on the prefix.
Watch mechanism: watch supports watching a fixed key or a prefix directory. When the watch key changes, the client will receive a notification.

Why do these features allow Etcd to implement distributed locks? Because these features of Etcd can meet the following requirements for implementing distributed locks:

Lease mechanism (Lease): used to support the automatic release of locks under abnormal circumstances
Prefix and Revision mechanism: used to support the ability to acquire locks fairly and queue up.
Monitoring mechanism (Watch): used to support lock grabbing capabilities
Cluster mode: used to support high availability of lock services

With these knowledge and theories, let’s take a look at how Etcd implements distributed locks. Because I am also a Golang developer myself, we will also put some code here.

Look at the process first, and then combine it with code comments!

func main() {
    config := clientv3.Config{
        Endpoints: []string{"xxx.xxx.xxx.xxx:2379"},
        DialTimeout: 5 * time.Second,
    }
 
    // Get client connection
    client, err := clientv3.New(config)
    if err != nil {
        fmt.Println(err)
        return
    }
 
    // 1. Lock (create a lease, automatically renew the lease, and use the lease to seize a key)
    // Used to apply for a lease
    lease := clientv3.NewLease(client)
 
    //Apply for a 10s lease
    leaseGrantResp, err := lease.Grant(context.TODO(), 10) //10s
    if err != nil {
        fmt.Println(err)
        return
    }
 
    // Get the id of the lease
    leaseID := leaseGrantResp.ID
 
    // Prepare a context for canceling lease renewal
    ctx, cancelFunc := context.WithCancel(context.TODO())
 
    // Ensure that automatic lease renewal will stop after the function exits
    defer cancelFunc()
        // Ensure that the lease will expire after the function exits
    defer lease.Revoke(context.TODO(), leaseID)
 
    // Automatically renew the lease
    keepRespChan, err := lease.KeepAlive(ctx, leaseID)
    if err != nil {
        fmt.Println(err)
        return
    }
 
    // Coroutine that handles lease renewal responses
    go func() {
        select {
        case keepResp := <-keepRespChan:
            if keepRespChan == nil {
                fmt.Println("lease has expired")
                goto END
            } else {
                //The lease will be renewed every second
                fmt.Println("Received automatic lease renewal response", keepResp.ID)
            }
        }
    END:
    }()
 
    // if key does not exist, then set it, else fails to grab the lock
    kv := clientv3.NewKV(client)
    //Create transaction
    txn := kv.Txn(context.TODO())
    // If key does not exist
    txn.If(clientv3.Compare(clientv3.CreateRevision("/cron/lock/job7"), "=", 0)).
        Then(clientv3.OpPut("/cron/jobs/job7", "", clientv3.WithLease(leaseID))).
        Else(clientv3.OpGet("/cron/jobs/job7")) //If key exists
 
    // Submit transaction
    txnResp, err := txn.Commit()
    if err != nil {
        fmt.Println(err)
        return
    }
 
    // Determine whether the lock has been grabbed
    if !txnResp.Succeeded {
        fmt.Println("The lock is occupied:", string(txnResp.Responses[0].GetResponseRange().Kvs[0].Value))
        return
    }
 
    // 2. Process business (locked, very safe)
 
    fmt.Println("Processing task")
    time.Sleep(5 * time.Second)
 
    // 3. Release the lock (cancel automatic lease renewal and release the lease)
    // defer will cancel the lease renewal and release the lock
}

However, the concurrency package provided by clientv3 also implements distributed locks. We can implement distributed locks more conveniently, but the internal implementation logic is similar:

First the concurrency.NewSession method creates the Session object
Then the Session object creates a Mutex object through concurrency.NewMutex
Locking and releasing the lock call Lock and UnLock respectively.

Based on ZooKeeper

The data storage structure of ZooKeeper is like a tree. This tree is composed of nodes called Znode.

The process of locking/releasing locks is as follows

Client tries to create a znode node, such as /lock. For example, if Client1 arrives first, it will be created successfully, which is equivalent to getting the lock.
Other clients will fail to create (znode already exists) and fail to acquire the lock.
Client2 can enter a waiting state. When the /lock node is deleted, ZooKeeper will notify it through the watch mechanism.
After Client1 holding the lock completes accessing the shared resource, the znode is deleted and the lock is released.
Client2 continues to complete the lock acquisition operation until the lock is acquired.

ZooKeeper does not need to consider the expiration time, but uses [temporary nodes]. After the Client gets the lock, it will always hold the lock as long as the connection continues. Even if the Client crashes, the corresponding temporary node Znode will be automatically deleted, ensuring that the lock is released.

How does Zookeeper detect whether the client crashes?

Each client maintains a Session with ZooKeeper, which relies on regular heartbeats to maintain.

If Zookeeper cannot receive the client’s heartbeat for a long time, it will consider the Session to have expired and delete the temporary node.

Of course this is not a perfect solution

In the following scenario, Client1 and Client2 may obtain the lock at the same time during the window time:

Client 1 creates znode node/lock and obtains the lock.
Client 1 entered a long GC pause. (Or there is a problem with the network, or there is a problem with the zk service detection heartbeat thread, etc.)
The Session that Client 1 connected to ZooKeeper has expired. znode node/lock is automatically deleted.
Client 2 creates znode node/lock, thereby acquiring the lock.
Client 1 recovers from the GC pause and it still thinks it holds the lock.

Okay, now let’s summarize the advantages and disadvantages of Zookeeper when using distributed locks:

Advantages of Zookeeper:

There is no need to consider the expiration time of the lock, which is more convenient to use.
watch mechanism, if locking fails, you can watch and wait for the lock to be released to implement optimistic locking

shortcoming:

Not as good as Redis
High deployment and operation costs
The client lost contact with Zookeeper for a long time and the lock was released.

Summary

The article contains a lot of content and involves a lot of knowledge points. If you don’t understand it after reading it once, it is recommended that you save it and read it several times to build a scenario structure for distributed locks.

To summarize, this article mainly summarizes distributed locks and their usage. There are many ways to implement distributed locks.

Database: A lock is represented by creating a unique record. If the unique record is added successfully, the lock is created successfully. To release the lock, the record needs to be deleted, but performance bottlenecks are prone to occur, so databases are basically not used. As a distributed lock.

Redis: Redis provides efficient operations for acquiring and releasing locks, and combined with Lua scripts, Redission, etc., it has a better way to handle exceptions. Because it is based on memory, the reading and writing efficiency is also very high. high.

Etcd: Using lease, watch, and revision mechanisms, it provides a simple distributed lock method. The cluster mode allows Etcd to handle a large number of reads and writes, with excellent performance, but the configuration is complex and consistent. Sexual issues also exist.

Zookeeper: Use the node synchronization function provided by ZooKeeper to implement distributed locks, and there is no need to set an expiration time, and it can automatically handle lock release in abnormal situations.

If your business data is very sensitive, you must pay attention to this issue when using distributed locks. You cannot assume that distributed locks are 100% safe.

Of course, you also need to combine it with your own business. Maybe in most cases we still use Redis as a distributed lock. One is that we are familiar with it, and there are many ways to perform and handle abnormal situations. I think it can satisfy most business scenarios.

Thank you for reading to the end, I hope this article is helpful to you~

Welcome to like , collect , follow and support it three times in a row ~

I will continue to work hard~

refer to:

https://mp.weixin.qq.com/s/Fkga3KaU0fBv5zXM-b8JhA

https://zhuanlan.zhihu.com/p/378797329

https://www.cnblogs.com/aganippe/p/