[[Hash Application] Bitmap/Bloom Filter]

Bitmap/Bloom filter

bitmap
- bitmap concept
- Use of bitmaps
- Bitmap simulation implementation
bloom filter
- Bloom filter concept
- Use of bloom filter
- Bloom filter simulation implementation
Bitmap/Bloom Filter Application: Massive Data Processing
- Hash splitting

Bitmap

Bitmap concept

Computers usually use bits as the smallest storage unit of data. There are only two binary states: 0 and 1. This determines that bits can be used to store yes/no information for a certain condition, and this method takes up the least amount of system memory. . Therefore, the standard library in C++ provides the bitset class, which stores data in bits as the smallest unit and is mainly used to provide bit-level operations.

Bitmap uses each bit to store a certain state. It is suitable for scenarios where massive data is not repeated. It is usually used to determine whether a certain data exists.

Use of bitmaps

Initialize bitset, initialize and give the number of bits n

bitset<16> foo;
bitset<16> bar(0xfa2);
bitset<16> baz("0101111001"); //Read from right to left. If the string length is less than n digits, add 0
foo: 0000000000000000
bar: 0000111110100010
baz: 0000000101111001

bitset operation summary

Bitmap simulation implementation

template<size_t N>
class bitset {<!-- -->
bitset()
{<!-- -->
_bt.resize(N / 8 + 1,0);
}
//Set the x bit to 1
void set(size_t x)
{<!-- -->
size_t i = x / 8;
size_t j = x % 8;
_bt[i] |= (1 << j);
}
//Set the x bit to 0
void reset(size_t x)
{<!-- -->
size_t i = x / 8;
size_t j = x % 8;
_bt[i] & amp;= ~(1 << j);
}
//Query whether x exists
bool test(size_t x)
{<!-- -->
size_t i = x / 8;
size_t j = x % 8;
return _bt[i] & amp; (1 << j);
}
private:
vector<char> _bt;
};

//When processing massive data
void BitSetTest()
{<!-- -->
// Open 4.29 billion bits
bitset<-1> bs1;
bitset<0xffffffff> bs2;
}

Bloom filter

Bloom filter concept

Bloom filter is a compact and clever probabilistic data structure proposed by Burton Howard Bloom in 1970. It is characterized by efficient insertion and query. It can be used to tell you “something must not be exists or may exist”, it uses multiple hash functions to map a data into a bitmap structure. This method can not only improve query efficiency, but also save a lot of memory space.

Differences from bitmaps:
Bloom filter: Bloom filter is mainly used to quickly determine whether an element may exist in the data collection. There may be a certain misjudgment rate
Bitmap: Bitmap is mainly used to accurately determine whether an integer exists in a set. It canfail to misjudge.

Using Bloom filters

Initialization
Bloom filters use a bitmap of length m and initialize all bits to 0. At the same time, k different hash functions need to be selected.

Insert
When an element is added to a Bloom filter, k hash calculations are performed on the element to obtain k hash values (usually integers). Then set the corresponding k positions in the bit array to 1.

Query
The Bloom filter performs k hash calculations and obtains k hash values.
If any of the positions is 0, it can be determined that the element must not exist in the Bloom filter; otherwise, it may exist (there is a certain misjudgment rate here, because different elements may have the same hash value, As a result, the same position in the bitmap represents the existence of different elements)

Delete
Bloom filters cannot directly support deletion because when one element is deleted, other elements may be affected.
Optimization: Expand each bit in the Bloom filter into a small counter. When inserting an element, the hash function is calculated to k positions, and the counter at that position is + 1. When an element is deleted, the corresponding position counter of the element is -1, increase the deletion operation by occupying several times more storage space.

Note:

Selection of hash function: The hash function should be able to distribute the hash values evenly to minimize the possibility of collisions.
The size of the bit array: The length m of the bit array and the number of hash functions k will directly affect the false positive rate of the Bloom filter. Usually, the size of m is related to the expected number of storage elements and the tolerated false positive rate.
False positive rate: Bloom filter has a certain false positive rate, that is, it may be judged that an element exists in the Bloom filter, but it does not actually exist. This is due to possible collisions between different elements on the bit array.

Bloom filter simulation implementation

template<size_t N, size_t X = 5, class K = string,
class HashFunc1 = BKDRHash,
class HashFunc2 = APHash,
class HashFunc3 = DJBHash>
classBloomFilter
{<!-- -->
public:
 void Set(const K & amp; key)
 {<!-- -->
 size_t len = X*N;
 size_t index1 = HashFunc1()(key) % len;
 size_t index2 = HashFunc2()(key) % len;
 size_t index3 = HashFunc3()(key) % len;
 _bs.set(index1);
 _bs.set(index2);
 _bs.set(index3);
 }
 bool Test(const K & amp; key)
 {<!-- -->
 size_t len = X*N;
 size_t index1 = HashFunc1()(key) % len;
 if (_bs.test(index1) == false)
 return false; //Accurate
 size_t index2 = HashFunc2()(key) % len;
 if (_bs.test(index2) == false)
 return false; //Accurate
 size_t index3 = HashFunc3()(key) % len;
 if (_bs.test(index3) == false)
 return false; //Accurate
 return true; // There is a misjudgment

 }
 // Deletion is not supported and deletion may affect other values.
 void Reset(const K & amp; key);
-private:
bitset<X*N> _bs;

Bitmap/Bloom filter application: massive data processing

Bitmap applications

Quickly find whether a certain data is in a collection

1. Given 4 billion unique unsigned integers, not sorted. Given an unsigned integer, how to quickly determine whether a number is among these 4 billion numbers. 【Tencent】
answer:
1G=2^30=1 billion bytes
4 billion ints=16 billion bytes=16G
The amount of data is very large and takes up a lot of memory space. Bitmaps can be used to reduce the memory space occupied.
Whether the data is in the given integer data, the result is present or not, which are exactly two states. Then a binary bit can be used to represent the information of whether the data exists. If the binary bit is 1, it means it exists, and if it is 0, it means does not exist. for example:

2. Given 10 billion integers, design an algorithm to find an integer that appears only once?
Represented by two bitmaps, the possible combinations are:

Combination	Number of occurrences
00	0
01	1
10	2
11	More than 3 times

The number stored in the statistical bitmap is 01 which is what is required.

Find the intersection and union of two sets

3. Given two files, each with 10 billion integers, and we only have 1G of memory, how to find the intersection of the two files?
Initial idea: map all values in a file, read another file, and determine whether it exists
Problem: There may be multiple identical elements in the file, and the resulting intersection needs to be deduplicated, which is time-consuming.
Optimization: Two files and two bitmaps, corresponding position & operation, the element represented by the position with the result of 1 is the intersection

Disk block marking in the operating system

Bloom filter application

Given two files, each with 10 billion queries, and we only have 1G of memory, how to find the intersection of the two files? Give an exact algorithm
Exact algorithm:
Assuming that each query occupies 30 bytes, each of the above files has 300 billion bytes == 300G. The files are very large. We can first divide them into several small files and find the intersection between each small file.
We use hash-cutting to split the files.

In this way, the same query must enter the small file with the same label. We only need to find the intersection of two small files with the same label. Next, use set to find the intersection, read Ai and put it in set, read Bi to see if it is in set, or delete it if it is not.

Hash split

Hash sharding is often used in databases and distributed systems. Here are some common situations where hashing can be considered:

1.Database sharding: When the amount of database data is huge and a single database server cannot handle it, the data can be divided into multiple shards and a hash function can be used to distribute the data to different shards. . This improves database scalability and performance.

2.Load balancing: In a distributed system, load balancing can be achieved by using hash sharding. Based on the specific attributes of the request (such as client ID, request URL, etc.), an identifier is calculated through a hash function, and then the identifier is used to select the corresponding server to handle the request, thereby making the load distribution more even.

3.Distributed cache: In a distributed cache system, hash sharding can be used to store data in multiple cache nodes, thereby improving cache capacity and read performance. Calculate the key value of the data through the hash function, and then store the data on the corresponding node.

It should be noted that when using hash slicing, ensure that the hash function has good uniformity and randomness to avoid data skew and hot spot problems. In addition, since hash sharding may introduce data migration and consistency issues, it requires careful design and implementation.