[C++] Application of hashing — Bloom filter

Article directory

  • 1. Bloom filter proposed
  • 2. Bloom filter concept
  • 3. Selection of the number of Bloom filter hash functions
  • 4. Implementation of Bloom filter
    • 1.Insertion of Bloom filter
    • 2. Searching for Bloom filters
    • 3. Bloom filter removal
    • 4. Complete code implementation
  • 5. Summary of Bloom filter
    • 1. Advantages of Bloom filter
    • 2. Bloom filter defects
    • 3. Application of Bloom filter
    • 4. Bloom filter related interview questions

1. Bloom filter proposed

We know that bitmaps can quickly determine whether a certain data is in a collection, but bitmaps have the following shortcomings:

1. Bitmaps are only suitable for scenarios where the data range is concentrated. When the data is too scattered, there will be a waste of space.

2. Bitmaps can only be used for shaping, and non-shaped data cannot be processed.

We can use functors to solve the shortcoming that bitmaps can only be used for integers, that is, define a HashFunc function for a specific type and convert it into an integer. For example, when the data type is a string, we can use a string The hash algorithm converts the string into an integer and then maps the integer into a bitmap

But there is a big flaw in this method – the values converted by different strings through the same HashFunc function may be the same. In other words, we will misjudge when we judge that a number is not there. In this case Down:

1. The existence of this string in the bitmap is inaccurate, because the bit of the string may have been originally 0, but it conflicted with other strings, resulting in a misjudgment, causing the bit to become 1.

2. It is accurate that the string in the bitmap does not exist, because a bit of 0 means that the string and other strings that may conflict with the string do not exist, so it is not accurate.

In addition, we usually modulo the value converted by the string hash function to save space, because the range of the converted result is uncertain, but modulo will increase the probability of hash collision.

So how can we reduce the false positive rate? The answer is Bloom filter

2. Bloom filter concept

Bloom filter is a compact and clever probabilistic data structure proposed by Burton Howard Bloom in 1970. It is characterized by efficient insertion and query, and can be used to tell You “something must not exist or may exist”, it uses multiple hash functions to map a data into a bitmap structure. This method can not only improve query efficiency, but also save a lot of memory space.

We can see that the Bloom filter reduces the misjudgment rate by using multiple hash functions, that is, allowing the same element to map multiple subscript positions, and only when these positions are all 1 during the query. It means that the value exists, and the probability that different subscripts mapped by different hash functions of the same element are misjudged at the same time is definitely much lower than the probability that one subscript position is misjudged.

It should be noted that the Bloom filter can only reduce the rate of false positives, but cannot completely eliminate them.

3. Selection of the number of Bloom filter hash functions

So is it better to map more subscript positions? Of course not, because the more subscript positions mapped to an element, the more space is wasted. At this time, we can choose a hash table or a red-black tree. , the result of such query is still accurate. Regarding how to choose the number of hash functions and the length of the Bloom filter, here is a blog written by an expert. You can refer to: Detailed explanation of the principle, usage scenarios and precautions of the Bloom filter.

We can know from this blog that the hash length, Bloom filter length, number of inserted elements and misjudgment rate are as follows:

and the relationship between them:

For the above K value, we can get:

1. When k is 3, m is about 4.3n, that is, an inserted element consumes about 4 bits.

1. When k is 5, m is about 7.2n, that is, an inserted element consumes about 7 bits.

1. When k is 8, m is about 11.6n, that is, an inserted element consumes about 12 bits.

Combining relationship diagrams and relationship expressions, it can be concluded that 3-5 hash functions are more appropriate.

4. Implementation of Bloom filter

The implementation of the Bloom filter is very simple. Just use the bitset in the library for the bitmap. For the string hash algorithm, you can select several algorithms with higher ratings from the algorithms introduced in the following blog: Various string Hash functions

1. Insertion of Bloom filter

To insert data, we only need to call set for each bitmap, and it can be achieved, such as the following example:

Insert into bloom filter: “baidu”

Insert into bloom filter: “tencent”

Code:

void set(const T & amp; key)
{<!-- -->
size_t hashi1 = HashFunc1()(key) % (N * X);
size_t hashi2 = HashFunc2()(key) % (N * X);
size_t hashi3 = HashFunc3()(key) % (N * X);
size_t hashi4 = HashFunc4()(key) % (N * X);

_bs.set(hashi1);
_bs.set(hashi2);
_bs.set(hashi3);
_bs.set(hashi4);
}

2. Bloom filter search

The idea of a Bloom filter is to map an element into a bitmap using multiple hash functions, so the bit at the mapped position must be 1. Therefore, you can search in the following way: Calculate whether the bit position corresponding to each hash value is stored as zero. As long as one is zero, it means that the element must not be in the hash table, otherwise it may be in the hash table.

Note: If the Bloom filter says that an element does not exist, the element must not exist. If the element exists, the element may exist because some hash functions have certain misjudgments.

For example: when searching for “alibaba” in the Bloom filter, assume that the hash values calculated by the three hash functions are: 1, 3, 7, which happen to overlap with the bits of other elements. At this time, the Bloom filter tells the The element exists, but the element does not exist.

The code is implemented as follows:

bool test(const T & amp; key)
{<!-- -->
size_t hashi1 = HashFunc1()(key) % (N * X);
if (!_bs.test(hashi1))
{<!-- -->
return false;
}

size_t hashi2 = HashFunc2()(key) % (N * X);
if (!_bs.test(hashi2))
{<!-- -->
return false;
}

size_t hashi3 = HashFunc3()(key) % (N * X);
if (!_bs.test(hashi3))
{<!-- -->
return false;
}

size_t hashi4 = HashFunc4()(key) % (N * X);
if (!_bs.test(hashi4))
{<!-- -->
return false;
}

// The previous judgments are all accurate and there is no misjudgment.
return true; // There may be a misjudgment. If several mapping positions conflict, there will be a misjudgment.
}

3. Bloom filter deletion

Bloom filters cannot directly support deletion because when one element is deleted, other elements may be affected.

For example: delete the “tencent” element in the above picture. If you directly set the binary bit position corresponding to this element to 0, the “baidu” element will also be deleted, because these two elements are on the bit positions calculated by multiple hash functions. There happens to be overlap.

A method to support deletion: expand each bit in the Bloom filter into a small counter, add one to k counters (hash addresses calculated by k hash functions) when inserting elements, and delete elements When , k counters are decremented by one, and deletion operations are increased by occupying several times more storage space.

defect:

1. Unable to confirm whether the element is actually in the bloom filter

2. Existence count wraparound

4. Complete code implementation

#pragma once

namespace hdp
{
struct BKDRHash
{
size_t operator()(const string & amp; key)
{
size_t hash = 0;
for (auto ch : key)
{
hash *= 131;
hash + = ch;
}
return hash;
}
};

struct APHash
{
size_t operator()(const string & amp; key)
{
unsigned int hash = 0;
int i = 0;

for (auto ch : key)
{
if ((i & 1) == 0)
{
hash ^= ((hash << 7) ^ (ch) ^ (hash >> 3));
}
else
{
hash ^= (~((hash << 11) ^ (ch) ^ (hash >> 5)));
}

+ + i;
}

return hash;
}
};

struct DJBHash
{
size_t operator()(const string & amp; key)
{
unsigned int hash = 5381;

for (auto ch : key)
{
hash + = (hash << 5) + ch;
}

return hash;
}
};

struct JSHash
{
size_t operator()(const string & s)
{
size_t hash = 1315423911;
for (auto ch : s)
{
hash ^= ((hash << 5) + ch + (hash >> 2));
}
return hash;
}
};
template
classBloomFilter
{
public:
void set(const T & amp; key)
{<!-- -->
size_t hashi1 = HashFunc1()(key) % (N * X);
size_t hashi2 = HashFunc2()(key) % (N * X);
size_t hashi3 = HashFunc3()(key) % (N * X);
size_t hashi4 = HashFunc4()(key) % (N * X);

_bs.set(hashi1);
_bs.set(hashi2);
_bs.set(hashi3);
_bs.set(hashi4);
}

bool test(const T & amp; key)
{<!-- -->
size_t hashi1 = HashFunc1()(key) % (N * X);
if (!_bs.test(hashi1))
{<!-- -->
return false;
}

size_t hashi2 = HashFunc2()(key) % (N * X);
if (!_bs.test(hashi2))
{<!-- -->
return false;
}

size_t hashi3 = HashFunc3()(key) % (N * X);
if (!_bs.test(hashi3))
{<!-- -->
return false;
}

size_t hashi4 = HashFunc4()(key) % (N * X);
if (!_bs.test(hashi4))
{<!-- -->
return false;
}

// The previous judgments are all accurate and there is no misjudgment.
return true; // There may be a misjudgment. If several mapping positions conflict, there will be a misjudgment.
}
private:
std::bitset _bs;
};

void test_bloomfilter1()
{
// Continue at 10:46
string str[] = { "Zhu Bajie", "Sun Wukong", "Sha Wujing", "Tang Sanzang", "White Dragon Horse 1", "1 White Dragon Horse", "White 1 Dragon Horse", "White 11 Dragon Horse", "1 White Dragon Horse 1" };
BloomFilter<10> bf;
for (auto & str : str)
{
bf.set(str);
}

for (auto & s : str)
{
cout << bf.test(s) << endl;
}
cout << endl;

srand(time(0));
for (const auto & s : str)
{
cout << bf.test(s + to_string(rand())) << endl;
}
}

void test_bloomfilter2()
{
srand(time(0));
const size_t N = 100000;
BloomFilter bf;

std::vector v1;
std::string url = "https://www.cnblogs.com/-clq/archive/2012/05/31/2528153.html";

for (size_t i = 0; i < N; + + i)
{
v1.push_back(url + std::to_string(i));
}

for (auto & str : v1)
{
bf.set(str);
}

// v2 and v1 are similar string sets, but different
std::vector v2;
for (size_t i = 0; i < N; + + i)
{
std::string url = "https://www.cnblogs.com/-clq/archive/2012/05/31/2528153.html";
url + = std::to_string(999999 + i);
v2.push_back(url);
}

size_t n2 = 0;
for (auto & str : v2)
{
if (bf.test(str))
{
+ + n2;
}
}
cout << "Misjudgement rate of similar strings:" << (double)n2 / (double)N << endl;

// Dissimilar string set
std::vector v3;
for (size_t i = 0; i < N; + + i)
{
string url = "zhihu.com";
url + = std::to_string(i + rand());
v3.push_back(url);
}

size_t n3 = 0;
for (auto & str : v3)
{
if (bf.test(str))
{
+ + n3;
}
}
cout << "Misjudgment rate of dissimilar strings:" << (double)n3 / (double)N << endl;
}
}

We can use the above two functions to test the Bloom filter. We will eventually find that the false positive rate is controllable. We can test and adjust the number of hash functions and the length of the Bloom filter according to the specific application scenario. , and finally realize the Bloom filter that best suits the current application scenario.

5. Bloom filter summary

1. Advantages of Bloom filter

1. The time complexity of adding and querying elements is: O(K), (K is the number of hash functions, which is generally relatively small), and has nothing to do with the amount of data.

2. Hash functions have no relationship with each other and facilitate hardware parallel operations.

3. Bloom filters do not need to store the elements themselves, which has great advantages in some situations with strict confidentiality requirements.

4. When it can withstand a certain amount of misjudgment, the Bloom filter has a huge space advantage over other data structures.

5. When the amount of data is large, Bloom filters can represent the entire set, but other data structures cannot

6. Bloom filters using the same set of hash functions can perform intersection, union, and difference operations.

2. Bloom filter defects

1. There is a false positive rate, that is, there is a false positive (False Position), that is, it cannot accurately determine whether the element is in the set (remedial method: create a whitelist to store data that may be misjudged)

2. The element itself cannot be obtained

3. Generally, elements cannot be removed from Bloom filters.

4. If you use counting method to delete, there may be a counting wraparound problem.

3. Application of Bloom filter

The Bloom filter was proposed to solve the problem that bitmaps can only handle shaping and data range concentration. However, hash functions and modulo will cause hash conflicts and misjudgment. In order to reduce the misjudgment rate, we need reasonable choices. A tree of hash functions and the length of a bloom filter

Bloom filters are suitable for scenarios that do not require complete accuracy and allow certain misjudgments, such as the following scenarios:

1. Judgment of nicknames when users register: Some websites do not allow duplicate nicknames during registration, and all registered nicknames are stored in the database of the server. Because the database is stored on the disk, the access speed is very slow, so if the user does not It is very inefficient to choose a nickname and check whether it exists in the database. At this time, we can add a bloom filter in front of the server for filtering - all registered nicknames are mapped to the bloom filter, which is not necessarily accurate. At this time, the running user can use the nickname and return directly , no longer need to search in the database. If the nickname is in the Bloom filter, the user is prompted to re-enter it, but this does not affect the user's use. We can also search it in the database and return accurate results.

2. Query personal data: For example, we need to use the ID number as the key value to find the specific information of a certain user in the company's customer data. Since the efficiency of direct access to data is very low, at this time, we can add a Bloom filter in front of the server. The search method is the same as the first point. If it does not exist, it will be returned directly. If it exists, it will be searched in the database and the result of the search will be returned.

In other areas, such as determining whether a website is a dangerous website, the information should be displayed when we click on the website. At this time, we can use the Bloom filter to display it quickly.

4. Bloom filter related interview questions

Given two files, each with 10 billion queries, we only have 1G of memory. How to find the intersection of the two files? Exact algorithms and approximate algorithms are given respectively

We need to use the idea of hash cutting - use the same hash function to cut the two files separately, and the cutting results are A0-Ai, B0-Bi; because the hash function is the same, the query and occurrence in Ai and Bi The conflicting queries are all in the same small file. At this time, we only need to find the intersection of Ai and Bi in the small files with the same subscript. It should be noted that if the small file is large, it means that one or several queries have a large number of repetitions. At this time, change a hash function and perform hash cutting on the recursive sub-files of the Ai and Bi small files respectively.

For an accurate algorithm, we need to first store all the elements in the small file No. Ai into set/map, and then take the data in the small file No. Bi and query it in set/map to get the intersection, but the final result Deduplication processing is required.