redis cache penetration, cache breakdown, cache avalanche causes + solutions

Redis cache penetration, cache breakdown, cache avalanche causes + solutions

1. Foreword

In our daily development, we all use databases for data storage. Since there is usually no high concurrency in general system tasks, it seems that there is no problem with this, but once it involves the demand for large amounts of data , For example, when some products are snapped up, or when the number of visits to the home page is large in an instant, the system that only uses the database to store data will have serious performance problems due to the problem of disk-oriented and slow disk read/write speed. The arrival of thousands of requests requires the system to complete thousands of read/write operations in a very short period of time. At this time, the database is often not able to bear it, and it is extremely easy to cause the database system to be paralyzed and eventually lead to service downtime. Serious production problems.

In order to overcome the above problems, projects usually introduce NoSQL technology, which is a memory-based database and provides certain persistence functions.

Redis technology is one of the NoSQL technologies, but the introduction of redis may cause problems such as cache penetration, cache breakdown, and cache avalanche. This paper analyzes these three issues in depth.

Second, first understanding

Cache penetration: The data corresponding to the key does not exist in the data source. Every time a request for this key cannot be obtained from the cache, the request will go to the data source, which may overwhelm the data source. For example, using a non-existent user id to obtain user information, neither the cache nor the database, if hackers exploit this vulnerability to attack, the database may be overwhelmed.
Cache breakdown: The data corresponding to the key exists, but it expires in redis. If there are a large number of concurrent requests at this time, these requests will generally load data from the backend DB and set it back to the cache when the cache expires. At this time, there are large concurrent requests It may instantly overwhelm the backend DB.
Cache avalanche: When the cache server restarts or a large number of caches fail in a certain period of time, it will also put a lot of pressure on the back-end system (such as DB) when it fails.

3. Cache Penetration Solution

A data that must not exist in the cache and cannot be queried, because the cache is passively written when it misses, and for fault tolerance, if the data cannot be found from the storage layer, it will not be written to the cache, which will result in this non-existent data Every request has to go to the storage layer to query, which loses the meaning of caching.

There are many ways to effectively solve the cache penetration problem, the most common is to use Bloom filter to hash all possible data into a large enough In the bitmap, a data that must not exist will be intercepted by this bitmap, thus avoiding the query pressure on the underlying storage system. There is also a more simple and crude method (this is what we use), if the data returned by a query is empty (whether the data does not exist or the system is faulty), we still put the empty The result is cached, but its expiration time is short, no longer than five minutes.

Pseudocode for brutal mode:

//pseudocode
public object GetProductListNew() {
    int cacheTime = 30;
    String cacheKey = "product_list";

    String cacheValue = CacheHelper. Get(cacheKey);
    if (cacheValue != null) {
        return cacheValue;
    }

    cacheValue = CacheHelper. Get(cacheKey);
    if (cacheValue != null) {
        return cacheValue;
    } else {
        //The database cannot be queried, it is empty
        cacheValue = GetProductListFromDB();
        if (cacheValue == null) {
            //If it is found to be empty, set a default value and cache it
            cacheValue = string. Empty;
        }
        CacheHelper. Add(cacheKey, cacheValue, cacheTime);
        return cacheValue;
    }
}

Fourth, cache breakdown solution

The key may be accessed at a certain point of time with high concurrency, which is a very “hot” data. At this time, a problem needs to be considered: the problem of cache being “broken down”.

Use a mutex key

A common practice in the industry is to use mutex. To put it simply, when the cache fails (judging that the value taken out is empty), instead of immediately loading the db, first use some operations of the cache tool with a successful operation return value (such as Redis’s SETNX or Memcache ADD) to set a mutex key, when the operation returns successfully, then perform the load db operation and reset the cache; otherwise, retry the entire get cache method.

SETNX is the abbreviation of “SET if Not eXists”, that is, it is only set when it does not exist, and it can be used to achieve the lock effect.

public String get(key) {
      String value = redis. get(key);
      if (value == null) { //Represents the expiration of the cached value
          //Set a timeout of 3 minutes to prevent the del operation from failing to load db next time the cache expires
      if (redis.setnx(key_mutex, 1, 3 * 60) == 1) { // means the setting is successful
               value = db. get(key);
                      redis.set(key, value, expire_secs);
                      redis.del(key_mutex);
              } else { //At this time, it means that other threads at the same time have loaded db and set it back to the cache. At this time, just retry to get the cache value
                      sleep(50);
                      get(key); //retry
              }
          } else {
              return value;
          }
 }

memcache code:

if (memcache. get(key) == null) {
    // 3 min timeout to avoid mutex holder crash
    if (memcache. add(key_mutex, 3 * 60 * 1000) == true) {
        value = db. get(key);
        memcache.set(key, value);
        memcache.delete(key_mutex);
    } else {
        sleep(50);
        retry();
    }
}

Other plans: waiting for your supplements.

5. Cache avalanche solution

The difference from cache breakdown is that here is for many key caches, while the former is a certain key.

The cache is normally obtained from Redis, and the schematic diagram is as follows:

The schematic diagram of cache invalidation moment is as follows:

The impact of the avalanche effect on the underlying system when the cache is invalid is terrible! Most system designers consider using locks or queues to ensure that there will not be a large number of threads reading and writing to the database at one time, so as to avoid a large number of concurrent requests falling on the underlying storage system when failure occurs. Another simple solution is to disperse the cache expiration time. For example, we can add a random value based on the original expiration time, such as 1-5 minutes random, so that the repetition rate of each cache expiration time will be reduced. It is difficult to trigger collective failure events.

Locking and queuing, the pseudo code is as follows:

//pseudocode
public object GetProductListNew() {
    int cacheTime = 30;
    String cacheKey = "product_list";
    String lockKey = cacheKey;

    String cacheValue = CacheHelper. get(cacheKey);
    if (cacheValue != null) {
        return cacheValue;
    } else {
        synchronized(lockKey) {
            cacheValue = CacheHelper. get(cacheKey);
            if (cacheValue != null) {
                return cacheValue;
            } else {
              //Here is generally sql query data
                cacheValue = GetProductListFromDB();
                CacheHelper. Add(cacheKey, cacheValue, cacheTime);
            }
        }
        return cacheValue;
    }
}

Locking and queuing is only to reduce the pressure on the database, and does not improve system throughput. Assuming that under high concurrency, the key is locked during cache reconstruction, which means that 999 of the past 1000 requests are blocked. It will also cause the user to wait for a timeout. This is a temporary solution, not a permanent solution!

Note: The solution of locking and queuing is to solve the concurrency problem of the distributed environment, and it is possible to solve the problem of distributed locks; threads will also be blocked, and the user experience is very poor! Therefore, it is rarely used in real high concurrency scenarios!

Pseudocode for random values:

//pseudocode
public object GetProductListNew() {
    int cacheTime = 30;
    String cacheKey = "product_list";
    //cache tag
    String cacheSign = cacheKey + "_sign";

    String sign = CacheHelper. Get(cacheSign);
    //Get cached value
    String cacheValue = CacheHelper. Get(cacheKey);
    if (sign != null) {
        return cacheValue; // not expired, return directly
    } else {
        CacheHelper.Add(cacheSign, "1", cacheTime);
        ThreadPool. QueueUserWorkItem((arg) -> {
      //Here is generally sql query data
            cacheValue = GetProductListFromDB();
          //The date is set to twice the cache time for dirty reads
          CacheHelper. Add(cacheKey, cacheValue, cacheTime * 2);
        });
        return cacheValue;
    }
}

Explanation:

Cache mark: record whether the cached data is expired, if it expires, it will trigger to notify another thread to update the actual key cache in the background;
Cache data: its expiration time is twice as long as the cache tag time, for example: the tag cache time is 30 minutes, and the data cache is set to 60 minutes. In this way, when the cache tag key expires, the actual cache can still return the old data to the caller, and the new cache will not be returned until another thread completes the update in the background.

Regarding the solution to cache collapse, three solutions are proposed here: using locks or queues, setting expiration flags to update the cache, setting different cache expiration times for keys, and a solution called “secondary cache”.

6. Summary

For business systems, it is always a case-by-case analysis. There is no best, only the most suitable.

For other cache issues, such as full cache and data loss, you can learn by yourself. Finally, I would like to mention the three words LRU, RDB, and AOF. Usually, we use the LRU strategy to deal with overflow, and the RDB and AOF persistence strategies of Redis to ensure data security under certain circumstances.