Representation of Redis string

Representation of string

Redis is developed in C language, but the string type used by Redis does not adopt the string type of C language. Next, let’s take a look at why such a design is adopted.

The c language represents a string as a character array, ending with a character like ‘\0

1. Representation of Redis strings – SDS

Redis itself builds an abstract type called simple dynamic string (SDS) and uses SDS as Redis’s default string representation.

struct sdshdr {<!-- -->
    //len saves the length of the string saved by SDS
    int len;
    //free records the number of unused bytes in the buf array
    int free;
    //buf[] array is used to store each element of the string
    char buf[];
}

1. Binary security

Because C strings use the null character as the identifier of the end of the string, and for some binary files (such as pictures, etc.), the content may include empty strings, so the C string cannot be accessed correctly;

All SDS APIs process the elements in buf in a binary manner, and SDS does not use the empty string to determine whether it ends, but uses the length represented by the len attribute to determine whether the string ends strong>.

2. Reduce the number of memory reallocations for modifying strings

If you want to modify a string in C language, you must reallocate the memory (release it first and then apply for it), because if there is no reallocation, the increase in string length will cause memory buffer overflow< /strong>, when the string length is reduced, it will cause memory leak.

For SDS, due to the existence of the len attribute and the free attribute, SDS implements two strategies: space pre-allocation and lazy space release for modifying strings:

1. Space pre-allocation: When space expansion is performed on a string, more memory is expanded than actually required, which can reduce the number of memory reallocations required for continuous string growth operations.

When the size of the string stored in redis is less than 1MB, it stores any string, and its free size is always the same as its own size; when the string size is greater than 1MB, it allocates a free size that is fixed at 1MB.

2. Lazy space release: When shortening a string, the program does not immediately use memory reallocation to recycle the excess bytes after shortening, but uses the free attribute to record the number of these bytes. , waiting for subsequent use. (Of course, SDS also provides corresponding APIs, and we can also manually release these unused spaces when needed.)

3. Compatible with some C string functions

Although SDS is binary safe, it still follows the convention that each string is terminated by an empty string, so that some functions in the C language library can be reused.

4. Prevent buffer overflow

We know that when using the strcat function in C language to concatenate two strings, once the memory space of sufficient length is not allocated, a buffer overflow will occur. For the SDS data type, when modifying characters, it will first check whether the memory space meets the requirements according to the recorded len attribute. If not, the corresponding space will be expanded, and then the modification operation will be performed. , so buffer overflow will not occur

5. String length O(1)

Due to the existence of the len attribute, we only need to read the len attribute to obtain the length of the SDS string, and the time complexity is O(1). For C language, obtaining the length of a string is usually achieved by traversing the count, and the time complexity is O(n). You can get the string length of key through strlen key command

2. After Redis3.2 version, 5 structural types have been further designed

Before the redis3.2 branch appeared, strings only used one type: sdshdr (the SDS mentioned above). This structure had a drawback, such as creating a string each time. Since len + free (int type, generally The operating system occupies 4 bytes) and occupies at least 8 bytes. Therefore, no matter how long the string is, it must occupy at least 8 bytes, which is quite wasteful.

The 3.2 branch introduces five sdshdr types. Each time an sds is created, the actual length of the sds is used to determine which type of sdshdr should be selected. Different types of sdshdr occupy different memory spaces. This subdivision can greatly save space. The following is the definition of sdshdr in version 3.2:

1. sdshdr5

In fact, this type of redis will not be used, because there is no remaining space for fields, making it inconvenient to expand. 【Ignorable】

struct __attribute__ ((__packed__)) sdshdr5 {<!-- -->
    //Actually, this type of redis will not be used because there is no field with remaining space and it is inconvenient to expand. Its internal structure is also different from other sdshdr, just look at sdshdr8.
    unsigned char flags;
    //A total of 8 bits, the lower 3 bits are used to store the real flags (type), and the upper 5 bits are used to store len (length).
    char buf[];
    //The actual storage location of sds
};

img

According to the picture above, flags is 1 byte of char type, represented by the first byte of the string. Since sds has 5 types, the first three digits of flags represent the sds type, and the last 5 digits represent the length of the stored data. , so this type can only store data smaller than 2^5 bytes.

2. sdshdr8

struct __attribute__ ((__packed__)) sdshdr8 {<!-- -->
  uint8_t len;//Indicates the length of the current sds (unit is byte)
  uint8_t alloc; //Indicates the memory size allocated for sds (unit is bytes)
  //Use one byte to represent the current sdshdr type. Because there are five types of sdshdr, at least 3 bits are needed to represent it.
  //000:sdshdr5, 001:sdshdr8, 010:sdshdr16, 011:sdshdr32, 100:sdshdr64. The upper 5 bits are not used so they are all 0.
  unsigned char flags;
  char buf[];//The actual storage location of sds
};

img

  • len: Indicates the length of the current sds, excluding the ‘/0’ terminator. The length can be obtained directly. Note that the unit is bytes, the first byte of the string.
  • alloc: The currently allocated size (free used in versions before 3.2 means that there are free bytes of free space left), excluding the ‘/0’ terminator. Note that the unit is bytes, the second byte of the string
  • flags represents the current sdshdr type. If declared as char, it means there is 1 byte (8 bits) in total. Only the lower three bits can be used to represent all 5 sdshdr types (refer to the figure above). The upper 5 bits are not used. So they are all 0. , the third byte of the string.

The lower three digits represent: 000:sdshdr5, 001:sdshdr8, 010:sdshdr16, 011:sdshdr32, 100:sdshdr64

3. sdshdr16, sdshdr32, sdshdr64

struct __attribute__ ((__packed__)) sdshdr16 {<!-- -->
    uint16_t len;
    /* used */
    uint16_t alloc;
    /* excluding the header and null terminator */
    unsigned char flags;
    /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr32 {<!-- -->
    uint32_t len;
    /* used */
    uint32_t alloc;
    /* excluding the header and null terminator */
    unsigned char flags;
    /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr64 {<!-- -->
    uint64_t len;
    /* used */
    uint64_t alloc;
    /* excluding the header and null terminator */
    unsigned char flags;
    /* 3 lsb of type, 5 unused bits */
    char buf[];
};

The above structure can be compared to sdshdr8

4. How to choose which structure to use

Before redis creates an sds, it will call “sdsReqType(size_t string_size) to determine which sdshdr to use”. This function passes the length of an sds as a parameter and returns the sdshdr type that should be used.

#define SDS_TYPE_5 0 //00000000
#define SDS_TYPE_8 1 //00000001
#define SDS_TYPE_16 2 //00000010
#define SDS_TYPE_32 3 //00000011
#define SDS_TYPE_64 4 //00000100

#define SDS_TYPE_MASK 7 //00000111, as a mask that takes the lower 3 bits of flags

static inline char sdsReqType(size_t string_size) {<!-- -->
    if (string_size < 1<<5) //Less than 2^5, the high 5 bits of the flags member can be expressed
        return SDS_TYPE_5;
    if (string_size < 1<<8) //Less than 2^8, 8-bit integer (uint8_t in sdshdr8) can represent string_size
        return SDS_TYPE_8;
    if (string_size < 1<<16) //Less than 2^16, 16-bit integer (uint16_t in sdshdr16) can represent string_size
        return SDS_TYPE_16;
    //Less than 2^32, a 32-bit integer (uint32_t in sdshrd32) can represent string_size,
    //1ll means 1long long (at least 64 bits). If there is no ll, 1 is an int. Assume that int is 4 bytes and 32 bits.
    //1<<32 will lead to undefined behavior.
    if (string_size < 1ll<<32)
        return SDS_TYPE_32;
    return SDS_TYPE_64; //If the length of sds exceeds 2^64, all types cannot represent the len of this sds.
}

Therefore, some string-related functions are stored in the sds.h file. For example, the function to find the length of a string only needs to use sds as a parameter. Judge the function by comparing flags & SDS_TYPE_MASK and SDS_TYPE_n Which type of sdshdr does the sds belong to, and then retrieve the relevant information of the sds according to the specified sdshdr type. For example sdslen function (get string length)

Note that in order to determine which type to use sdshrd, we only need to compare flags & SDS_TYPE_MASK with SDS_TYPE_n (the reason why SDS_TYPE_MASK is needed is because there is a special case of sdshdr5, and its high 5 bits are not necessarily 0)

//Returns a pointer of type T containing sdshdr of string s
#define SDS_HDR(T,s) ((struct sdshdr##T *)((s)-(sizeof(struct sdshdr##T))))
//Use the flags member variable of sdshdr5 as a parameter to return the length of sds. This is actually an impossible hack.
#define SDS_TYPE_5_LEN(f) ((f)>>SDS_TYPE_BITS)
#define SDS_TYPE_BITS 3
static inline size_t sdslen(const sds s) {<!-- -->
    //Through s[-1] we can get the member variable flags of sdshdr to which sds belongs.
    unsigned char flags = s[-1];
    switch(flags & amp;SDS_TYPE_MASK) {<!-- -->
        case SDS_TYPE_5:
            return SDS_TYPE_5_LEN(flags);
        case SDS_TYPE_8:
            return SDS_HDR(8,s)->len;//Remove the len member of sdshdr
        case SDS_TYPE_16:
            return SDS_HDR(16,s)->len;
        case SDS_TYPE_32:
            return SDS_HDR(32,s)->len;
        case SDS_TYPE_64:
            return SDS_HDR(64,s)->len;
    }
    return 0;

Similar to sdslen, there are the following functions that use sds to find the sdshdr type, so we will not analyze them one by one:

static inline size_t sdsavail(const sds s)
static inline void sdssetlen(sds s, size_t newlen)
static inline void sdsinclen(sds s, size_t inc)
static inline size_t sdsalloc(const sds s)
static inline void sdssetalloc(sds s, size_t newlen)

3. Summary

This section mainly explains how Redis represents strings. The reason why string representation in C language is not used is mainly based on security, memory allocation and improving the time complexity of obtaining character length, etc., and it will be adopted after 3.2 The sdshdr structure in 5 represents different strings, which saves memory space even more.

syntaxbug.com © 2021 All Rights Reserved.