Bloom filter and cuckoo filter

Filter usage scenarios:

For example, there are the following requirements:

1. Originally there were 1 billion numbers, and now 100,000 numbers have come in. It is necessary to quickly and accurately judge whether these 100,000 numbers are in the 1 billion number database?
Solution 1: Store 1 billion numbers in the database and perform database query. The accuracy is improved, but the speed will be slower.
Solution 2: Put 1 billion numbers into memory, such as Redis cache, here we calculate the memory size: 1 billion * 8 bytes = 8GB, through memory query, both accuracy and speed are available, but about 8gb The memory space is quite a waste of memory space.
2. Those who have been in contact with crawlers should have such a demand. There are thousands of websites that need crawlers. For a new website url, how do we judge whether we have already crawled this url?
The solution is still the above two, obviously, neither is very good.
3. Similarly, there is also the filtering of spam mailboxes.
Then for such a large data collection, how to accurately and quickly judge whether a certain data is in a large data collection without occupying memory, the filter came into being.

Bloom Filter

Bloom filter: A data structure (bitmap) consisting of a long string of binary vectors, which can be regarded as a binary array. Since it is binary, it stores either 0 or 1, but the initial default value is 0.

1. Add data
When adding an element key to the Bloom filter, we calculate a value through multiple hash functions, and then set the square where the value is located to 1.
2. How to judge whether it exists
Pass this new data through several custom hash functions above to calculate each value, and then check whether the corresponding places are all 1. If there is a situation that is not 1, then we can say that the new data must not present in this Bloom filter.
Conversely, if the value calculated by the hash function corresponds to 1, then we can definitely conclude: must this data exist in this Bloom filter?
The answer is no, because the results calculated by multiple different data through the hash function will be repeated, so there will be a certain position where other data is set to 1 through the hash function.
We can draw a conclusion:The Bloom filter can judge that a certain data must not exist, but it cannot judge that it must exist.

For example, element 3 in the figure below does not actually exist, but the value after the hash result is set to 1 by element 1 and element 2.

3. Advantages and disadvantages of Bloom filter

Advantages: The advantages are obvious, the binary array occupies very little memory, and the insertion and query speeds are fast enough.
Disadvantages: With the increase of data, the misjudgment rate will increase; there is also the inability to judge that the data must exist;There is also an important disadvantage that the data cannot be deleted.

1. Put the algorithm and bitmap data on the client
2. The algorithm is in the client, and the bitmap data is in redis
3. Both algorithm and bitmap data are placed in redis

Upgraded version of the Bloom filter (Counting Bloom Filter)

The principle is to upgrade the bits of the bitmap to counters (Counters). Adding elements will give the corresponding Counters + 1 respectively; delete function

Cuckoo filter

In order to solve the problem that the Bloom filter cannot delete elements, the author of the paper “Cuckoo Filter: Better Than Bloom” proposed the cuckoo filter. Compared with the cuckoo filter, the Bloom filter has the following disadvantages: weak query performance, low space utilization efficiency, no support for reverse operations (deletion), and no support for counting.

The query performance is weak because the Bloom filter needs to use multiple hash functions to detect multiple different locations in the bitmap. These locations have a large span in memory, which will cause CPU cache line hit rate Low.

Low space efficiency is because under the same misjudgment rate, the space utilization rate of the cuckoo filter is significantly higher than that of the Bloom, and the space can be saved by more than 40%. However, the Bloom filter does not require that the length of the bitmap must be an exponent of 2, while the cuckoo filter must have this requirement. From this point of view, it seems that the Bloom filter is more space-scalable.

Does not support reverse delete operation This problem really hits the weakness of Bloom filter. In a dynamic system, elements are always coming and going. Bloom filters are like blots, there will be traces when they come and go, and they can’t be cleaned up even if they go away. For example, there are only 1kw elements left in your system, but there are hundreds of millions of water elements in the whole, the Bloom filter is very helpless, it will store the imprints of these lost elements there forever. Over time, this filter will become more and more crowded, until one day you find that its false positive rate is too high and you have to rebuild it.

Cuckoo Hash
The simplest cuckoo hash structure is a one-dimensional array structure, and there will be two hash algorithms to map new elements to two positions in the array. If one of the two positions is empty, the element can be placed directly into it. But if the two positions are full, it will randomly kick one away, and then occupy this position by itself.

p1 = hash1(x) % l
p2 = hash2(x) % l
</code>

Unlike the cuckoo, the cuckoo hashing algorithm will help these victims (displaced eggs) find other nests. Because each element can be placed in two positions, as long as any one has a free position, it can be inserted. So this sad egg that was squeezed out will check to see if his other position is free, and if it is free, he will move over by himself and everyone will be happy. But what if this position is also taken by someone else? Well, then it will do the “doves take over the magpie’s nest” again, passing the role of victim to someone else. The new victim then repeats the process until all eggs have found their nests.

Cuckoo hashing problem
But there will be a problem, that is, if the array is too crowded, it will be kicked hundreds of times without stopping, and the insertion efficiency will be seriously affected at this time. At this time, the cuckoo hash will set a threshold. When the continuous nest occupation behavior exceeds a certain threshold, the array is considered to be almost full. At this time, it needs to be expanded and all elements relocated.

There would be another problem, and that would be the possibility of a run cycle. For example, for two different elements, the two positions after the hash are exactly the same. At this time, there is no problem with each of them having one position. But at this time, the third element comes, and its position after hash is the same as them. Obviously, there will be a running cycle at this time. However, the position of three different elements is still the same after two hashes. The probability of this is not very high, unless your hash algorithm is too frustrated.

The cuckoo hash algorithm’s attitude towards this kind of run cycle is that the array is too crowded and needs to be expanded (in fact, this is not the case)

Optimization

The average space utilization of the above cuckoo hash algorithm is not high, only about 50%. When this percentage is reached, the number of consecutive runs will soon exceed the threshold. The value of such a hash algorithm is not obvious, so it needs to be improved.

One of the improved solutions is to increase the hash function so that each element has not only two nests, but three nests and four nests. This can greatly reduce the probability of collision and increase the space utilization rate to about 95%.

Another improvement is to hang multiple seats on each position of the array, so that even if two elements are hashed at the same position, there is no need to “occupy the magpie’s nest” immediately, because there are multiple seats, you can Feel free to sit on one. Unless these multiple seats are occupied, a run is required. Obviously this would also significantly reduce the number of runs. The space utilization rate of this solution is only about 85%, but the query efficiency will be very high. Multiple seats in the same position are continuous in memory space, which can effectively use the CPU cache.

Therefore, a more efficient solution is to combine the above two improved solutions, such as using 4 hash functions and placing 2 seats in each position. In this way, both time efficiency and space efficiency can be obtained. Such a combination can even increase the space utilization rate by 99%, which is a very remarkable space efficiency.

cuckoo filter

The cuckoo filter has the same structure as the cuckoo hash. It is also a one-dimensional array, but unlike the cuckoo hash, the cuckoo hash will store the entire element, while the cuckoo filter will only store the fingerprint information of the element. (several bits, similar to bloom filters). Here the filter sacrifices data accuracy for space efficiency. It is precisely because the fingerprint information of the element is stored, so there will be a false positive rate, which is exactly the same as the Bloom filter.

First of all, the cuckoo filter still only uses two hash functions, but each position can place multiple seats. The choice of these two hash functions is special, because only fingerprint information can be stored in the filter. When the fingerprint at this position is squeezed, it needs to calculate another dual position. The calculation of this dual position requires the element itself. Let’s recall the previous hash position calculation formula.

fp = fingerprint(x)
p1 = hash1(x) % l
p2 = hash2(x) % l
</code>

We know the fingerprints of p1 and x, but there is no way to directly calculate p2.

Special hash function

The ingenuity of the cuckoo filter lies in the design of a unique hash function, so that p2 can be directly calculated according to p1 and the element fingerprint, without the need for complete x elements.

fp = fingerprint(x)
p1 = hash(x)
p2 = p1 ^ hash(fp) // XOR
</code>

It can be seen from the above formula that when we know fp and p1, we can directly calculate p2. Similarly, if we know p2 and fp, we can also directly calculate p1-the duality.

p1 = p2 ^ hash(fp)
</code>

So we don’t need to know whether the current position is p1 or p2 at all, we only need to XOR the current position and hash(fp) to get the dual position. And you only need to ensure that hash(fp) != 0 can ensure that p1 != p2, so that there will be no problem of kicking yourself and causing an infinite loop.

Maybe you will ask why the hash function here does not need to take the modulus of the length of the array? It is actually needed, but the cuckoo filter enforces that the length of the array must be a power of 2, so modulo the length of the array is equivalent to taking the last n bits of the hash value. When performing an XOR operation, ignore the other bits other than the lower n bits. Reserving the lower n bits of the calculated position p is the final dual position.