[C++ and data structures] Bitmaps and Bloom filters

Table of Contents

1. Bitmap

1. The concept of bitmap

2. Implementation of bitmap

①.Basic structure

②、set

③、reset:

④、test

⑤. Question:

⑥. Advantages, disadvantages and applications of bitmaps:

⑦. Complete code and testing

2. Bloom filter

1. Proposal of Bloom filter

2. Implementation of Bloom filter

①.Basic structure

②. Implementation of three Hash functors

③、 set

④、 test

⑤. Delete

⑥. Complete code and testing

⑦、Advantages and disadvantages


1. Bitmap

1. The concept of bitmap

1. Interview questions

Give
40
Billions of unique unsigned integers, not sorted. Given an unsigned integer, how to quickly determine whether a number is in

this
40
Among the billions. 【Tencent】

Finding the presence of a number is actually a key model. The common solutions are as follows:

  • 1. Traversal, time complexity O(N)
  • 2. Sorting (O(NlogN)), using binary search: logN
  • 3. Bitmap solution

The problem with the first two solutions: the amount of data is too large and cannot be placed in the memory.

We can first consider how much space 4 billion non-repeating unsigned integers occupy?

One billion bytes is 1 G, and 4 billion integers, one integer is 4 bytes, that is, 16 billion (9 zeros) bytes are needed, that is, 16G is needed. These data take up too much space and cannot be stored in the memory. No less.

How to solve the problem of bitmap?

The so-called bitmap is to use each bit to store a certain state. It is suitable for massive amounts of data and there is no duplication of data. It is usually used to determine whether a certain data exists. Whether the data is in the given integer data, the result is present or not, which are exactly two states. Then a binary bit can be used to represent the information of whether the data exists. If the binary bit is
1
, represents existence, 0 represents absence. (Bitmap is a type of direct addressing method)

For 4 billion data, we need to open at least 4.2 billion spaces. Why? If there is a number 4200000000, which position do you want to map it to? When we open space, we need to open up the range of data so that it can be fully mapped.

Then we directly open 2^32 spaces (here each position is a bit, 2^32≈4.29 billion, because the bitmap It is the bits stored in each position) to establish a mapping relationship between all numbers using direct addressing methods, that is, we open up 4.29 billion bits, and first treat 4.29 billion as bytes, then It ≈ 4G, and 4G ≈ 4000M (or MB), and here are bits, so it needs to be divided by 8, 4000/8 = 500M, sobitmap storage can occupy 500M, which saves space and efficiency And very high(high efficiency, because the direct addressing method has no hash collision)

2. Implementation of bitmap

①, basic structure

How many bits should be opened for the initialization of the bitmap, because the bitmap saves space by storing bits. The N you pass is also a bit. Secondly, the type of vector is int, because < strong>The type does not support bits

class bitset
{
public:
bitset(size_t N)
{//N represents how many bits to open (abbreviation: bits)
_bits.resize(N / 32 + 1, 0);
_num = N;
}

private:
//int* _bits;
std::vector<int> _bits;
size_t _num; //Indicates how many bits are stored and how much data is mapped and stored
};
_bits.resize(N / 32 + 1, 0);

Why N / 32 + 1?

Because the resize opening space of vector is based on integer units, and each position of the bitmap stores a bit, and an integer has 32 bits. N represents the number of bits, then N/32 gets the number of integers, but + 1 is needed. Why? For example, N=100, 100/32=3, that means opening 3 integer positions, that is, 32*3=96 bits, but there is still no room for 100, so you open 96 bits (same principle as 97, 98… .There are no positions), in order to avoid this problem, we will + 1, that is, open one more shaping, which will only waste 31 bits at most.

②,set

Function: Set the xth bit to 1, indicating that this bit exists

void set(size_t x)
{//Set the x-th bit to 1, indicating that this position exists

size_t index = x / 32; //Calculate the integer where the mapping position is
size_t pos = x % 32; //Calculate the position of x in the integer

_bits[index] |= (1 << pos); //For this integer, the position of pos or 1
//1<<pos means first moving 1 to the bit position at the same position as pos, because pos=how many times, 1 will be shifted to the left by how many times.
//|= means OR, that is, except for the pos position, which changes to 1, the other positions are not affected.
}

In fact, it is to examine how to change a certain bit of an integer to 1 without affecting the other 31 bits

The left shift on the little-endian machine is to the left, and the left shift on the big-endian machine is to the right

This is a bug in the C language design. It is a problem left over from history and can easily mislead people. The development of computer technology is flourishing, and then it is integrated and unified.

③, reset:

Function: Set the xth position to 0, indicating that this bit does not exist

void reset(size_t x)
{//Set the x-th bit to 0, indicating that this position does not exist

size_t index = x / 32; //Calculate the integer where the mapping position is
size_t pos = x % 32; //Calculate the position of x in the integer

_bits[index] & amp;= ~(1 << pos); //For this integer, the position of pos is the same as 0
}

④, test

Function: Determine whether the x-th bit is present (that is, determine whether the bit mapped by x is 1)

//Determine whether x is present (that is, whether the bit mapped by x is 1)
bool test(size_t x)
{
size_t index = x / 32; //Calculate the integer where the mapping position is
size_t pos = x % 32; //Calculate the position of x in the integer

return _bits[index] & amp; (1 << pos); //If the result is not 0, it is true, if it is 0, it is false
}

⑤, Question:

Theoretically, these 4 billion numbers cannot be stored in the memory. They should be stored in files. Then we have to read the file. Since the 4 billion numbers need to be opened according to the range, 4.29 billion of the space needs to be opened. How can it be opened like this? Big space? Common methods are as follows:

①, bitset bs(-1); //Because the constructor parameter of the bitmap is size_t, then -1 if viewed as an unsigned number is the maximum value of the integer

②, bitset bs(0xffffffff);

⑥. Advantages, disadvantages and applications of bitmaps:

Advantages: Space saving and high efficiency

Disadvantages: Can only handle plastic surgery

Application:

  • Quickly find whether a certain data is in a collection
  • Sort + remove duplicates
  • Find the intersection, union, etc. of two sets
  • Disk block marking in the operating system

⑦, complete code and test

#pragma once
#include
#include

namespace mz
{
class bitset
{
public:
bitset(size_t N)
{//N represents how many bits to open (abbreviation: bits)
_bits.resize(N / 32 + 1, 0);
_num = N;
}

void set(size_t x)
{//Set the x-th bit to 1, indicating that this position exists

size_t index = x / 32; //Calculate the integer where the mapping position is
size_t pos = x % 32; //Calculate the position of x in the integer

_bits[index] |= (1 << pos); //For this integer, the position of pos or 1
//1< _bits;
size_t _num; //Indicates how many bits are stored and how much data is mapped and stored
};

void test_bitset()
{
bitset bs(100);
bs.set(99);
bs.set(98);
bs.set(97);
bs.reset(98);

for (size_t i = 0; i < 100; + + i)
{
printf("[%d]:%d\
", i, bs.test(i));
}
}

}

Some test results are as follows:

2. Bloom filter

1. The proposal of Bloom filter

When we use the news client to read news, it will constantly recommend new content to us. Each time it recommends it, it will reiterate and remove content that we have already seen. The question is, how does the news client recommendation system implement push duplication removal? The server is used to record all the historical records that the user has seen. When the recommendation system recommends news, it will filter out the historical records of each user and filter out those records that already exist. How to search quickly?

  • 1. Use hash table to store user records. Disadvantages: Waste of space
  • 2. Use bitmap to store user records. Disadvantage: Bitmap generally can only process integers. If the content number is a string, it cannot be processed Got it
  • 3. Combine hash with bitmap, that is, Bloom filter

Bloom filters are
By Blum (
Burton Howard Bloom
)exist
1970
A compact and relatively ingenious method proposed in
General
Rate data structure
,feature is
Efficient insertion and querying can be used to tell you

Something must not exist or may exist
at

, which uses multiple hash functions to map a data into a bitmap structure. this way
Not only can it improve query efficiency, but also
Can save a lot of memory space
.

2. Implementation of Bloom filter

①, basic structure

Generally, Bloom filters store strings or structures, etc. Generally, they will not store integers, because integers are stored using bitmaps.

The bottom layer of the Bloom filter is implemented using bitmaps. Because strings are more commonly used, we directly use strings as the default template parameters, and the other three Hash The template parameter represents using three positions to map a value

How many spaces does the constructor open?

Some experts have already calculated that 10 elements require a length of 43 bits, so we might as well start with 5 times

template<class K = string, class Hash1 = HashStr1, class Hash2 = HashStr2, class Hash3 = HashStr3>
class bloomfilter
{
public:
    //It will be a problem to open it directly, because I may not have mapped a few values.
// Then open space according to how much data you will probably store.
//Someone has calculated how many values are better to open, that is, how many values you store will be mapped to how many bits.
bloomfilter(size_t num)
:_bs(5 * num)
,_N(5 * num)
{}

private:
bitset _bs; //The bottom layer is a bitmap
size_t_N;
};

②. Implementation of three Hash functors

Because three mapping positions are used to map a value, three functions need to be written to convert strings into integers. And because the string type is more commonly used, these three functors will be used as default template parameters

The following are string algorithms that can reduce hash conflicts (we use the following three algorithms for the following three functors):

struct HashStr1
{
size_t operator()(const string & amp; str)
{ //Use BKDRHash
size_t hash = 0;
for (size_t i = 0; i < str.size(); + + i)
{
hash *= 131;
hash + = str[i];
}

return hash;
}
};

struct HashStr2
{
size_t operator()(const string & amp; str)
{ //Use RSHash
size_t hash = 0;
size_t magic = 63689; //magic number
for (size_t i = 0; i < str.size(); + + i)
{
hash *= magic;
hash + = str[i];
magic *= 378551;
}

return hash;
}
};

struct HashStr3
{
size_t operator()(const string & amp; str)
{ //Use SDBMHash
size_t hash = 0;
for (size_t i = 0; i < str.size(); + + i)
{
hash *= 65599;
hash + = str[i];
}

return hash;
}
};

③, set

The function wants to mark the key as existing, and we have said that we need to use three positions to map the key value, so we call the Hash functor to first calculate the mapping position of the string, %_N is because we opened _N bits for the bitmap at the beginning, and the mapping position you calculated is likely to be greater than _N, so it is %_N, which can both store and unify, then A key value can be represented by three mapping positions

Note: Hash1 is a functor type, and Hash1() is a functor object. Of course, you can also write it as Hash1 hs1; hs1(key) % _N; but it is obviously more troublesome

void set(const K & amp; key)
{
size_t index1 = Hash1()(key) % _N;//Use anonymous objects of type Hash1
size_t index2 = Hash2()(key) % _N;
size_t index3 = Hash3()(key) % _N;

_bs.set(index1);//Indicates that the position index1 exists
_bs.set(index2);
_bs.set(index3);
}

④, test

Function: Determine whether the key value exists

Because a value uses three mapping positions, we determine whether the three mapping positions of the calculated key exist at the same time in the bitmap. The key value exists only if it exists at the same time. Otherwise, it is sure that one does not exist. does not exist

bool test(const K & amp; key)
{
size_t index1 = Hash1()(key) % _N;
if (_bs.test(index1) == false)
return false;

size_t index2 = Hash2()(key) % _N;
if (_bs.test(index2) == false)
return false;

size_t index3 = Hash3()(key) % _N;
if (_bs.test(index3) == false)
return false;

return true; //But it is not necessarily true here, there may still be a misjudgment

//The judgment is inaccurate and there may be misjudgment.
//The judgment is not there, it is accurate
}

⑤, delete

Bloom filters cannot directly support deletion because when one element is deleted, other elements may be affected.

A method to support deletion: Expand each bit in the Bloom filter into a small counter, and give
k counters
(k
hash address calculated by a hash function
)
Adding one, when deleting an element, gives
K counters are decremented by one, and deletion operations are increased by occupying several times more storage space.

Defects:

1.
Unable to confirm if element is actually in bloom filter

2.
presence count wraparound

void reset(const K & amp; key)
{
//Just set the mapped position to 0?
//Deletion is not supported, and there may be accidental deletion, so Bloom filters generally do not support deletion.
}

⑥, complete code and test

#pragma once
#include"bitset.h"
#include

using std::string;
using std::cout;
using std::endl;

namespace mz
{
struct HashStr1
{
size_t operator()(const string & amp; str)
{ //Use BKDRHash
size_t hash = 0;
for (size_t i = 0; i < str.size(); + + i)
{
hash *= 131;
hash + = str[i];
}

return hash;
}
};

struct HashStr2
{
size_t operator()(const string & amp; str)
{ //Use RSHash
size_t hash = 0;
size_t magic = 63689; //magic number
for (size_t i = 0; i < str.size(); + + i)
{
hash *= magic;
hash + = str[i];
magic *= 378551;
}

return hash;
}
};

struct HashStr3
{
size_t operator()(const string & amp; str)
{ //Use SDBMHash
size_t hash = 0;
for (size_t i = 0; i < str.size(); + + i)
{
hash *= 65599;
hash + = str[i];
}

return hash;
}
};

template
class bloomfilter
{
public:
//It will be a problem to open it directly, because I may not have mapped a few values.
// Then open space accordingly based on how much data you will probably store.
//Someone has calculated how much to open, that is, how many values you store will be mapped to how many bits.
bloomfilter(size_t num)
:_bs(5 * num)
,_N(5 * num)
{}

void set(const K & amp; key)
{
size_t index1 = Hash1()(key) % _N;//Use anonymous objects of type Hash1
size_t index2 = Hash2()(key) % _N;
size_t index3 = Hash3()(key) % _N;

_bs.set(index1);
_bs.set(index2);
_bs.set(index3);

}

bool test(const K & amp; key)
{
size_t index1 = Hash1()(key) % _N;
if (_bs.test(index1) == false)
return false;

size_t index2 = Hash2()(key) % _N;
if (_bs.test(index2) == false)
return false;

size_t index3 = Hash3()(key) % _N;
if (_bs.test(index3) == false)
return false;

return true; //But it is not necessarily true here, there may still be a misjudgment
\t\t
//The judgment is inaccurate and there may be misjudgment.
//The judgment is not there, it is accurate
}

void reset(const K & amp; key)
{
//Just set the mapped position to 0?
//Deletion is not supported, and there may be accidental deletion, so Bloom filters generally do not support deletion.
}

private:
bitset _bs; //The bottom layer is a bitmap
size_t_N;
};

void test_bloomfilter()
{
bloomfilter bf(100); //No string is given here, just use <> because string is the default
bf.set("abcd");
bf.set("aadd");
bf.set("bcad");

cout << bf.test("abcd") << endl;
cout << bf.test("aadd") << endl;
cout << bf.test("bcad") << endl;
cout << bf.test("cbad") << endl;

}
}

⑦, advantages and disadvantages

Advantages of Bloom Filters

  • 1. The time complexity of adding and querying elements is: O(K), (K is the number of hash functions, which is generally small), and it has nothing to do with the size of the data.
  • close
  • 2. Hash functions have no relationship with each other, which facilitates hardware parallel operations.
  • 3. Bloom filters do not need to store the elements themselves, which has great advantages in some situations with strict confidentiality requirements.
  • 4. When it can withstand certain misjudgments, Bloom filters have a huge space advantage over other data structures.
  • 5. When the amount of data is large, Bloom filters can represent the entire set, but other data structures cannot
  • 6. Bloom filters using the same set of hash functions can perform intersection, union, and difference operations

Bloom filter defects

  • 1. There is a false positive rate, that is, there is a false positive (False Position), that is, it cannot accurately determine whether the element is in the set (remedial method: then
  • Establish a whitelist to store data that may be misjudged)
  • 2. Unable to obtain the element itself
  • 3. Generally, elements cannot be removed from Bloom filters
  • 4. If you use counting method to delete, there may be a counting wraparound problem.

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 56975 people are learning the system