Redis data structure ziplist

Foreword

In order to improve memory efficiency, Redis designed a special data structure ziplist (compressed list). The essence of ziplist is a byte array, which adopts a compact and continuous storage format, which can effectively compress data and improve memory efficiency.

When the data volume of hash and zset is relatively small, ziplist will be used first to store data. Otherwise, a large number of pointers will be allocated, and the pointers themselves will take up space and increase the fragmentation rate of memory. Taking hash as an example, you can use the following configuration to use compressed list storage when the number of Entry is less than 512 and the data length is less than 64, otherwise use hash table storage.

hash-max-ziplist-entries=512
hash-max-ziplist-value=64

ZipList

Advantages of ziplist:

  • High memory efficiency, no need to allocate additional pointers, data is stored compactly together
  • Efficient sequential access because it is a continuous memory space
  • Reduce memory fragmentation

Of course, ziplist also has some shortcomings, otherwise Redis would not give up on it when the amount of data becomes large.

  • Search time complexity O(N)
  • Memory reallocation, the ziplist length is fixed and cannot be dynamically expanded. You can only re-apply for a piece of memory.
  • Chain update problem

The shortcomings of ziplist are still obvious, but you cannot use it just because of its shortcomings. When the amount of data is small and there are not many write operations, it can indeed save memory. Isn’t it the design philosophy of Redis to use different data structures to store according to the characteristics of the data?

The layout of ziplist is described in the source code as follows. All data is compactly arranged:

<zlbytes> <zltail> <zllen> <entry> <entry> ... <entry> <zlend>

td>

Properties Type Length Description
zlbytes uint32_t 4 bytes ziplist occupied bytes
zltail uint32_t 4 bytes The distance between the tail node and the starting address
zllen uint16_t 2 bytes Number of nodes
entry Variable length Node
zlend uint8_t 1 byte End character 0xff

Because the location of the tail node is recorded, ziplist can quickly find the head and tail nodes, but it needs to be traversed for the intermediate nodes.

Entry represents the elements in the ziplist and consists of three parts:

<previous_entry_length> <encoding> <content>
  • previous_entry_length: The length of the previous Entry, using variable length design, occupying 1 or 5 bytes. If the length of the previous Entry is less than 254, use 1 byte to store it, otherwise use 5 bytes to store it.
  • encoding: the encoding type of the data, recording the type and length of the data. The high-order bits starting with 1 represent numbers, and the high-order bits starting with 0 represent byte arrays.
  • content: the value of the storage node

image.png
Redis will encode encoding and content according to the type of value, for example:

  • Starting with encoding 1111, the last 4 bits represent an integer value, and there is no content part.
  • Starting with encoding 11111110, it means content stores an 8-bit integer value
  • Starting with encoding 11000000, it means content stores a 16-bit integer value
  • encoding starts with 00, the last 6 bits indicate the length, and content is a byte array with a maximum length of 63
  • encoding 01 starts, the last 6 bits + 1 byte represent the length, content is a byte array with a maximum length of 2^14-1

Redis defines various types of encoding:

/* Different encoding/length possibilities */
#define ZIP_STR_MASK 0xc0
#define ZIP_INT_MASK 0x30
#define ZIP_STR_06B (0 << 6)
#define ZIP_STR_14B (1 << 6)
#define ZIP_STR_32B (2 << 6)
#define ZIP_INT_16B (0xc0 | 0<<4)
#define ZIP_INT_32B (0xc0 | 1<<4)
#define ZIP_INT_64B (0xc0 | 2<<4)
#define ZIP_INT_24B (0xc0 | 3<<4)
#define ZIP_INT_8B 0xfe

Source code

The source code files corresponding to ziplist are src/ziplist.c and src/ziplist.h.

  • ziplistNew

Create an empty ziplist. Because it is essentially a byte array, there is no structure and a pointer is returned.

unsigned char *ziplistNew(void) {<!-- -->
    // ziplist space header(4 + 4 + 2) + end(1)
    unsigned int bytes = ZIPLIST_HEADER_SIZE + ZIPLIST_END_SIZE;
    // Allocate memory
    unsigned char *zl = zmalloc(bytes);
    //Write total length
    ZIPLIST_BYTES(zl) = intrev32ifbe(bytes);
    // Write zltail because there is no element yet, pointing to the end of the header
    ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(ZIPLIST_HEADER_SIZE);
    // write zllen=0
    ZIPLIST_LENGTH(zl) = 0;
    //Write the end character
    zl[bytes-1] = ZIP_END;
    return zl;
}
  • zlentry

Redis defines the structure zlentry to carry node Entry data, but this does not represent the actual storage method of Entry data.

typedef struct zlentry {<!-- -->
    unsigned int prevrawlensize; // Record the number of bytes occupied by the length of the previous node 1 or 5
    unsigned int prevrawlen; // length of the previous node
    unsigned int lensize; // Number of bytes occupied by the length of the current node 1 or 5
    unsigned int len; //The length of the current node
    unsigned int headersize; //Node header length
    unsigned char encoding; // Encoding method integer/string
    unsigned char *p; // pointer to store value
} zlentry;
  • zipEntry

The zlentry structure is relatively complex, and Redis provides the zipEntry() method to encode it:

static inline void zipEntry(unsigned char *p, zlentry *e) {<!-- -->
    //Node length before decoding
    ZIP_DECODE_PREVLEN(p, e->prevrawlensize, e->prevrawlen);
    //Set encoding method
    ZIP_ENTRY_ENCODING(p + e->prevrawlensize, e->encoding);
    ZIP_DECODE_LENGTH(p + e->prevrawlensize, e->encoding, e->lensize, e->len);
    assert(e->lensize != 0);
    e->headersize = e->prevrawlensize + e->lensize;
    e->p = p;
}
  • ziplistPush

Used to insert new nodes to the head or tail of the ziplist.

unsigned char *ziplistPush(unsigned char *zl, unsigned char *s, unsigned int slen, int where) {<!-- -->
    unsigned char *p;
    // Determine whether to insert to the head or the tail
    p = (where == ZIPLIST_HEAD) ? ZIPLIST_ENTRY_HEAD(zl) : ZIPLIST_ENTRY_END(zl);
    return __ziplistInsert(zl,p,s,slen);
}
  • __ziplistInsert

Insert a new node into the ziplist at the specified location. The main steps are:

  • Encode the content of an element
  • Reapply for a piece of memory space
  • data copy
/**
 * ziplist insert node
 * @param zl ziplist pointer
 * @param p insert position before p
 * @param s content
 * @param slen length
 * @return
 */
unsigned char *__ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen) {<!-- -->
    size_t curlen = intrev32ifbe(ZIPLIST_BYTES(zl)), reqlen, newlen;
    unsigned int prevlensize, prevlen = 0;
    size_t offset;
    int nextdiff = 0;
    unsigned char encoding = 0;
    long long value = 123456789;
    zlentry tail;
    if (p[0] != ZIP_END) {<!-- --> // The compression list is not empty, find the position to be inserted based on p
        ZIP_DECODE_PREVLEN(p, prevlensize, prevlen); // Decode prevlen length
    } else {<!-- -->
        unsigned char *ptail = ZIPLIST_ENTRY_TAIL(zl);
        if (ptail[0] != ZIP_END) {<!-- -->
            prevlen = zipRawEntryLengthSafe(zl, curlen, ptail);
        }
    }

    //Try to encode as integer, if failed, save the character array directly
    if (zipTryEncoding(s,slen, & amp;value, & amp;encoding)) {<!-- -->
        reqlen = zipIntSize(encoding);
    } else {<!-- -->
        reqlen = slen;
    }
    // reqlen represents the current node length = preceding node length + encoding length + content length
    reqlen + = zipStorePrevEntryLength(NULL,prevlen);
    reqlen + = zipStoreEntryEncoding(NULL,encoding,slen);
    int forcelarge = 0;
    nextdiff = (p[0] != ZIP_END) ? zipPrevLenByteDiff(p,reqlen) : 0;
    if (nextdiff == -4 & amp; & amp; reqlen < 4) {<!-- -->
        nextdiff = 0;
        forcelarge = 1;
    }
    offset = p-zl;
    newlen = curlen + reqlen + nextdiff;
    zl = ziplistResize(zl,newlen);
    p = zl + offset;

    if (p[0] != ZIP_END) {<!-- -->
        /* Subtract one because of the ZIP_END bytes */
        memmove(p + reqlen,p-nextdiff,curlen-offset-1 + nextdiff);

        /* Encode this entry's raw length in the next entry. */
        if (forcelarge)
            zipStorePrevEntryLengthLarge(p + reqlen,reqlen);
        else
            zipStorePrevEntryLength(p + reqlen,reqlen);

        /* Update offset for tail */
        ZIPLIST_TAIL_OFFSET(zl) =
            intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl)) + reqlen);

        /* When the tail contains more than one entry, we need to take
         * "nextdiff" in account as well. Otherwise, a change in the
         * size of prevlen doesn't have an effect on the *tail* offset. */
        assert(zipEntrySafe(zl, newlen, p + reqlen, & amp;tail, 1));
        if (p[reqlen + tail.headersize + tail.len] != ZIP_END) {<!-- -->
            ZIPLIST_TAIL_OFFSET(zl) =
                intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl)) + nextdiff);
        }
    } else {<!-- -->
        /* This element will be the new tail. */
        ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(p-zl);
    }
    if (nextdiff != 0) {<!-- -->
        offset = p-zl;
        zl = __ziplistCascadeUpdate(zl,p + reqlen);
        p = zl + offset;
    }
    p + = zipStorePrevEntryLength(p,prevlen);
    p + = zipStoreEntryEncoding(p,encoding,slen);
    if (ZIP_IS_STR(encoding)) {<!-- -->
        memcpy(p,s,slen);
    } else {<!-- -->
        zipSaveInteger(p,value,encoding);
    }
    //Update zllen field
    ZIPLIST_INCR_LENGTH(zl,1);
    return zl;
}

Through the source code, we found that the overhead of inserting elements in ziplist is still very high. In fact, the same is true for deletion and update, because the data is closely arranged. Once new data is written, or the updated data is larger than before, there will be no storage space. An embarrassing situation, at which time a new memory space had to be reallocated.

  • ziplistFind

The time complexity of searching for ziplist elements is O(N). The element comparison is traversed from beginning to end until the end character is encountered.

unsigned char *ziplistFind(unsigned char *zl, unsigned char *p, unsigned char *vstr, unsigned int vlen, unsigned int skip) {<!-- -->
    int skipcnt = 0;
    unsigned char vencoding = 0;
    long long vll = 0;
    size_t zlbytes = ziplistBlobLen(zl);
    while (p[0] != ZIP_END) {<!-- --> // Traverse from beginning to end until the end character is encountered
        struct zlentry e;
        unsigned char *q;
        assert(zipEntrySafe(zl, zlbytes, p, & amp;e, 1));
        q = p + e.prevrawlensize + e.lensize;
        if (skipcnt == 0) {<!-- -->
            if (ZIP_IS_STR(e.encoding)) {<!-- -->
                // String comparison method, character by character comparison
                if (e.len == vlen & amp; & amp; memcmp(q, vstr, vlen) == 0) {<!-- -->
                    return p;
                }
            } else {<!-- -->
                //Comparison of integers
                if (vencoding == 0) {<!-- -->
                    if (!zipTryEncoding(vstr, vlen, & amp;vll, & amp;vencoding)) {<!-- -->
                        vencoding = UCHAR_MAX;
                    }
                    assert(vencoding);
                }
                if (vencoding != UCHAR_MAX) {<!-- -->
                    long long ll = zipLoadInteger(q, e.encoding);
                    if (ll == vll) {<!-- -->
                        return p;
                    }
                }
            }
            skipcnt = skip;
        } else {<!-- -->
            skipcnt--;
        }
        //Move to next node
        p = q + e.len;
    }
    return NULL;
}

Chain update

In addition to the high cost of additions, deletions and modifications, ziplist also has a risk point, which is chain updates.
Assume that the ziplist now stores 100 elements, and the length is 253. At this time, each element can use 1 byte to store the size of the previous element, and everyone is fine. At this time, there was suddenly an update operation, which changed the data length of the table header to exceed 254. At this time, the impact is not just on itself, because its length exceeds 254, resulting in that its subsequent element can no longer use 1 byte to store the length, but must be changed to 5 bytes, and the latter element must be changed to 5 bytes. After storing the length of the previous element, it will cause its own length to exceed 254, which will cause the next element to use 5 bytes to store its length, and so on, resulting in a chain reaction.
The impact of chain updates is very large, but fortunately the probability of it happening is not high.

Tail

In order to improve memory efficiency, Redis will give priority to using a data structure called ziplist to store data when the amount of data stored in hash, list, and zset is small. It is essentially a byte array. All elements are encoded according to a specific encoding format and arranged compactly together to improve memory efficiency. The disadvantage is that data updates require memory reallocation.
Overall, ziplist is suitable for storing smaller ordered collections or hash table data. But for large data, where frequent expansion is required, or where efficient iteration is required, a regular sorted set or hash table may be more suitable. When using ziplist, you need to weigh the pros and cons and choose an appropriate data structure based on actual needs.