“Redis” SkipList bottom layer implementation and application

References & Acknowledgments

Do you really understand the underlying data structure skiplist of ZSet in Redis? Riemann Chow

In-depth understanding of skip table and its application in Redis JD Cloud Developer

Bottom layer implementation of Redis jump table

Article directory

“Redis” SkipList bottom layer implementation and application
- @[toc]
- 1. What is a skip list (skiplist)
- 2. How to understand the jump table
- 3. Time complexity analysis of jump table
- 4. Implementation principle of jump table
- - structure definition
  - skiplist creation
  - skiplist insert node
  - skiplist update node
  - skiplist find node
- 5. Application of jump table in ZSet
- - Application of jump table in Redis
  - Implementation details of the jump table in ZSet
  - - The realization principle of the number of random layers
    - The average number of layers of skip table nodes
- 6. Summary

1. What is a skiplist

A skip list (skiplist) is an ordered data structure, which achieves the purpose of quickly accessing nodes by maintaining multiple pointers to other nodes in each node. Unlike data structures such as linked lists and dictionaries that are widely used in Redis, Redis only uses jump tables in two places, one is to implement ordered set keys, and the other is used as an internal data structure in cluster nodes. Apart from that, skip lists have no other purpose in Redis.

Let’s think about it, why do you choose skiplist for these two scenarios in Redis?

Since the jump list is an ordered data structure, let’s consider the situation of finding a specific element in the ordered sequence:

If the sequence is stored in a linear structure (array) that supports random access, then we can easily do it with binary search.
However, considering the efficiency of addition and deletion and memory scalability, in many cases, if you want to use a linear structure (linked list) that does not support random access for storage, you can only traverse from the beginning and compare one by one.
As a compromise, if a binary tree structure (BST) is used for storage, binary search can be performed without relying on the random access feature.

We know that the more orderly the insertion elements of ordinary BST are, the lower the efficiency is, and in the worst case, it will degenerate back to a linked list. Therefore, many bigwigs have proposed a self-balancing BST structure, so that its addition, deletion, and query operations maintain a time complexity of O(logn) under any circumstances. The representative of self-balancing BST is the AVL tree and its derived red-black tree. If it is promoted, it is not limited to binary trees, and the familiar B-trees and B+ trees also belong to this category, which are often used in file systems and databases.

Self-balancing BST is obviously very fragrant, but it still has a point that is not so fragrant: the self-balancing process of the tree is relatively complicated, and it is troublesome to implement. In the case of high concurrency, locking will also bring considerable overhead. For example, the AVL tree requires four rotation operations of LL, LR, RL, and RR to maintain balance, while the red-black tree requires three operations of left rotation, right rotation, and node color change. The animation below shows the balancing process of the AVL tree when inserting elements.

So, is there a simpler implementation method that is similar in efficiency to self-balancing BST? The answer is the jump table, and it is much simpler, let’s take a look at it below.

2. How to understand the jump table

For a singly linked list, even if the data stored in the linked list is ordered, if we want to find a certain data in it, we can only traverse the linked list from beginning to end. In this way, the search efficiency will be very low, and the time complexity will be very high, which is O(n).

So how to improve the search efficiency? Please see the diagram I have drawn below. In this linked list, every other node has an additional link pointing to its nodes in the first two positions in the list. Because of this, in the worst case, at most n/2 + 1 nodes. For example, if we want to check the node 90, according to the previous single-linked list search, we need 8 nodes, but now we only need 5 nodes.

Let’s extend this idea and get the following graph, where every 4th node has a link to the next 4th node in front of the node, and only n/4 + 1 nodes are examined.

Here we use the idea of mathematics to expand for generality. Every 2^ith node has a link to the next 2^ith node chain in front of this node. The total number of chains is simply doubled, but now at most logn nodes are considered in a lookup. It is not difficult to see that the total time consumption of a lookup is O(logn), because the lookup consists of a chain going forward to a new node or descending to a lower level at the same node. The total time consumption of each step during a lookup is at most O(logn). Note that the search in this data structure is basically a binary search (Binary Search).

I only gave two examples. Here you can imagine for yourself that when the length of the linked list is n, the search efficiency becomes more prominent.

The structure of this linked list plus multi-level index is a jump list. Next, let’s quantitatively analyze how fast it is to query with a jump table.

3. Time complexity analysis of skip table

We know that the time complexity of querying a certain data in a singly linked list is O(n). So what is the time complexity of querying a certain data in a jump table with multi-level indexes?

Let me decompose the problem, first look at such a problem, if there are n nodes in the linked list, how many levels of indexes will there be?

According to what we said above, the number of chain nodes in the first-level index is about n/2, the number of chain nodes in the second-level index is about n/4, and the number of chain nodes in the third-level index is about n /8, and so on, that is to say, the number of chain nodes of the k-th level index is 1/2 of the number of chain nodes of the k-1th level index, then the number of k-th level index nodes is n/( 2k).

Suppose the index has h levels and the highest level index has 2 nodes. Through the above formula, we can get n/(2h)=2, so as to obtain h=log2n-1. If the layer of the original linked list is included, the height of the entire jump list is log2n. When we query a certain data in the skip table, if each layer needs to traverse m nodes, then the time complexity of querying a data in the skip table is O(m*logn).

What is the value of this m? According to the previous index structure, we only need to traverse up to 3 nodes at each level of index, that is to say, m=3, why is it 3? Let me explain.

Suppose the data we are looking for is x, in the k-level index, after we traverse to the y node, we find that x is greater than y, and smaller than the subsequent node z, so we use the down pointer of y to descend from the k-level index to the k-1 level index. In the k-1th level index, there are only 3 nodes between y and z (including y and z), so we only need to traverse up to 3 nodes in the k-1 level index, and so on, each level Indexes only need to traverse up to 3 nodes.

Through the above analysis, we get m=3, so the time complexity of querying arbitrary data in the jump table is O(logn). The time complexity of this search is the same as binary search. In other words, we actually realized the binary search based on the singly linked list. The premise is that many levels of indexes are established, which is the design idea of exchanging space for time we mentioned.

Our time complexity is excellent, so what is the space complexity of the jump table?

In fact, in software development, we don’t have to care too much about the extra space occupied by the index. When talking about data structures and algorithms, we habitually regard the data to be processed as integers, but in actual software development, the original linked list may store large objects, and the index node only needs to store the key Values and several pointers do not need to store objects, so when the object is much larger than the index node, the extra space occupied by the index can be ignored.

4. Implementation principle of jump table

The skip table of Redis is defined by two structures redis.h/zskiplistNode and redis.h/zskiplist, where zskiplistNode is used to represent skip nodes, and zskiplist structure is used for Save the relevant information of the jump table nodes, such as the number of nodes and pointers to the head node and tail node, etc.

The leftmost of the above figure is the zskiplist structure, which contains the following attributes:

header: point to the header node of the jump table
tail: point to the tail node of the skip list
level: Record the number of layers of the node with the largest number of layers in the current jump table (the number of layers of the header node is not counted)
length: Record the length of the jump table, that is, the number of nodes currently contained in the jump table (the header node is not counted)

To the right of the zskiplist structure are four zskiplistNode structures that contain the following properties:

Level: Each level of the node is marked with L1, L2, L3, etc. in the node, L1 represents the first level, L2 represents the second level, and so on. Each layer comes with two properties: forward pointer and span. The forward pointer is used to access other nodes in the direction of the end of the table, and the span records the distance between the node pointed by the forward pointer and the current node.
Backward pointer: the backward pointer of the node identified by BW in the node, which points to the previous node located at the current node. The back pointer is used when the program traverses from the end of the list to the head of the list.
Score: 1.0, 2.0 and 3.0 in each node are the scores saved by the node. In the jump table, the nodes are arranged in ascending order according to the scores they save.
Member object (obj): o1, o2 and o3 in each node are member objects saved by the node.

Let’s take a look at how Redis implements skiplist.

Structure Definition

The structure definition of zskiplist:

The structure definition of zskiplistNode:

skiplist creation

What needs to be noted here is the constant ZSKIPLIST_MAXLEVEL, which defines the maximum number of layers of zskiplist, with a value of 32, which is why the node is only up to L32.

skiplist insert node

Its general execution process is as follows:

Find a suitable insertion position according to the search process mentioned above. Note that zset allows the scores to be the same, in which case they will be sorted according to the lexicographical order of the node data obj.
Call the zslRandomLevel() method to randomly get the level of the node to be inserted.
Call the zslCreateNode() method to create a new node according to the level, score and data obj.
Traverse each layer, modify the forward pointer forward and jump length span of the new node and its front and rear nodes, and also update the bottom backward pointer backward.
Two arrays update and rank are maintained. The update array is used to record the node whose last score of each layer is smaller than the score to be inserted, that is, the insertion position. The rank array is used to record the rank of the previous node at the insertion position, so as to update the span value at last.

5.4 skiplist delete node

skiplist update node

Both the update process and the insert process use the zadd method. First, it is judged whether the value exists. If it exists, it is the update process. If it does not exist, it is the insertion process. In the update process, if the Value is found, delete it first, and then add it. The disadvantage of this is that it will do two searches, which is slower in terms of performance.

In Redis 5.0 version, Antirez, the author of Redis, optimized the update process. The current update process is to judge whether the value exists, and if it exists, it will be updated directly, and then adjust the score sorting of the entire jump table, so that there is no need two searches.

skiplist find node

5. Application of jump table in ZSet

Generally, when discussing search problems, the first thing that comes to mind is the balanced tree and the hash table, but the data structure of the jump table is also very sharp, and its performance and implementation complexity can be compared with the red-black tree. In some scenarios, due to the red-black tree, from Invented in 1990, it is widely used in various scenarios, including data storage engines such as Redis and LevelDB, which will be introduced in detail later.

Application of jump table in Redis

The ZSet structure contains both a dictionary and a jump table, and the jump table saves all set elements from small to large by score. The dictionary holds the mapping from member to score. These two structures share the member and score of the same element through pointers, without wasting additional memory.

typedef struct zset {
    dict *dict;
    zskiplist *zsl;
} zset;

Dictionary and skip list layout in ZSet:

Implementation details of jump table in ZSet

The realization principle of the number of random layers

The skip table is a probabilistic data structure, and the number of insertion levels of elements is randomly specified. William Pugh described its calculation process in the paper as follows: specify the maximum layer number MaxLevel of the node, specify the probability p, and the default layer number lvl is 1

Generate a random number r from 0 to 1, if r

Repeat step 2 until the generated r >p, at this time lvl is the number of layers to be inserted.

Pseudocode for generating random layers in the paper:

The implementation of the jump table in Redis basically follows this idea, but there are minor differences. Take a look at the random source code src/z_set.c of Redis about the number of layers of the skip table:

/* Returns a random level for the new skiplist node we are going to create.
 * The return value of this function is between 1 and ZSKIPLIST_MAXLEVEL
 * (both inclusive), with a powerlaw-alike distribution where higher
 * levels are less likely to be returned. */
int zslRandomLevel(void) {<!-- -->
    int level = 1;
    while ((random() & amp;0xFFFF) < (ZSKIPLIST_P * 0xFFFF))
        level + = 1;
    return (level<ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;
}

Two of the macros are defined in redis.h:

#define ZSKIPLIST_MAXLEVEL 32 /* Should be enough for 2^32 elements */
#define ZSKIPLIST_P 0.25 /* Skiplist P = 1/4 */

You can see in while:

(random() & amp;0xFFFF) < (ZSKIPLIST_P*0xFFFF)

When I saw this formula at first glance, I was a little surprised because it involved bit operations. I need to study why Antirez uses bit operations to write this way?

The initial guess is that random() returns a floating-point number [0-1], so I found a tool for converting floating-point numbers to binary online, and entered 0.5 to see the result:

It can be seen that the result of 0.5 32bit conversion hexadecimal is 0x3f000000, if it is ANDed with 0xFFFF, the result is still 0, which is not as expected.

In my impression, the math library of the C language does not seem to have a direct random function, so I went to the Redis source code to look for it, so I downloaded the 3.2 version of the code, and did not find the implementation of random(), but I found it in several other places. application:

The use of random() in dict.c:

The use of random() in cluster.c:

Seeing the modulo operation here, I realized with hindsight that random() is a floating point number of [0-1], but now it seems that it is uint32, so the formula of Antirez is easy to understand.

ZSKIPLIST_P*0xFFFF

Since ZSKIPLIST_P=0.25, it is equivalent to shifting 2 bits to the right of 0xFFFF to become 0x3FFF. Assuming random() is relatively uniform, after clearing the high 16 bits of 0xFFFF, the value of the low 16 bits will fall to 0x0000 Between -0xFFFF, the probability of while being true is only 1/4, more generally speaking, the probability of being true is 1/ZSKIPLIST_P.

The implementation of the number of random layers is not uniform, the important thing is the generation of random numbers, the generation code for the number of jump table layers in LevelDB is as follows:

template <typename Key, typename Value>
int SkipList<Key, Value>::randomLevel() {<!-- -->

  static const unsigned int kBranching = 4;
  int height = 1;
  while (height < kMaxLevel & amp; & amp; ((::Next(rnd_) % kBranching) == 0)) {<!-- -->
    height + + ;
  }
  assert(height > 0);
  assert(height <= kMaxLevel);
  return height;
}

uint32_t Next( uint32_t & amp; seed) {<!-- -->
  seed = seed & 0x7fffffffu;

  if (seed == 0 || seed == 2147483647L) {<!-- -->
    seed = 1;
  }
  static const uint32_t M = 2147483647L;
  static const uint64_t A = 16807;
  uint64_t product = seed * A;
  seed = static_cast<uint32_t>((product >> 31) + (product & M));
  if (seed > M) {<!-- -->
    seed -= M;
  }
  return seed;
}

It can be seen that leveldb uses random numbers and kBranching to take the modulus. If the value is 0, an additional layer is added. Although floating-point numbers are not used, the probability balance is also achieved.

The average number of layers of jump table nodes

We can easily see that the higher the number of node layers, the lower the probability of occurrence. In any case, the number of layers always satisfies the power law. The larger the number, the smaller the probability of occurrence.

If the frequency of something occurs in a power relationship with a certain property of it, then this frequency can be said to obey a power law.

The performance of the power law is that the frequency of a few events accounts for the majority of the overall frequency, while most of the remaining events only account for a small part of the overall frequency.

When the power law is applied to the random layers of the jump table, most of the node layers are yellow parts, and only a few are green parts, and the probability is very low.

The quantitative analysis is as follows:

The number of node layers is at least 1, and the number of node layers greater than 1 satisfies a probability distribution.
The probability that the number of node layers is exactly equal to 1 is p^0(1-p)
The probability that the number of node layers is exactly equal to 2 is p^1(1-p)
The probability that the number of node layers is exactly equal to 3 is p^2(1-p)
The probability that the number of node layers is exactly equal to 4 is p^3(1-p)

The probability that the number of layers of recursive nodes is exactly equal to K is p^(k-1)(1-p)

Therefore, if we require the average number of layers of nodes, then it is transformed into an expectation problem of probability distribution. The soul painter Dabai is online again:

In the table, P is the probability, and V is the corresponding value, which gives the possibility of all values and probabilities, so the expectation of this probability distribution can be calculated.

The formula in the square brackets is actually the geometric sequence learned in the first grade. Commonly used techniques are misplaced, subtracted and summed. From this, we can see that the expected value of the number of node layers is inversely proportional to 1-p.

For Redis, the expected number of node layers is 1.33 when p=0.25.

In the Redis source code, there is a detailed process of inserting, deleting, and adjusting the jump table. This article will not expand on it. The code is not difficult to understand. It is written in pure C without so many special effects. Feel free to read it boldly.

6. Summary

This article mainly describes the basic concept and simple principle of jump table, as well as related parts such as index node level, time and space complexity, etc., and does not involve probability balance and engineering implementation, and uses the underlying data structure zset in Redis as a typical application Let’s expand, and further see the practical application of the jump list.

The time complexity of the jump table is the same as that of the AVL tree and the red-black tree, which can reach O(logN). However, the AVL tree must maintain a height balance, and the red-black tree must maintain an approximate balance of height, which will lead to a delay when inserting or deleting nodes. Some time overhead, so compared to the AVL tree and red-black tree, the jump table saves the time overhead of maintaining a high balance, but correspondingly pays more space to store nodes of multiple layers, so the jump table A table is a data structure that trades space for time.

It should be noted that the principle, application, and implementation details of the jump list are also hot topics in the interview, and it is worth spending time to study and master.