[C++] Bitmap (massive data processing)

Article directory

  • Throwing Questions: Bringing in Bitmaps
    • bitmap resolution
  • The concept of bitmap
  • Implementation of bitmap
    • structure
    • Constructor
    • set bit
    • clear bit
    • Check if the number exists
    • reverse bit
    • size and count
    • print function
  • bitmap application

Throwing a problem: importing bitmap

Problem: Given 4 billion non-repeating unsigned integers that are not sorted, and an unsigned integer, how to quickly determine whether this number is among the 4 billion numbers.

Problem solving ideas:

  1. Traversal, time complexity O(N);
  2. First sort (O(NlogN)), and then use binary search (logN).
  3. Put it into a hash table or red-black tree.

Now let’s analyze whether the above idea is feasible. First of all, we need to know how much space is occupied by 4 billion unsigned integers: 1G=1024MB=1024x2024KB=1024x1024x1024Byte, which is approximately equal to 1 billion bytes, so 4 billion unsigned integers (an integer needs to occupy 4 bytes) need to occupy about 16G memory space.

4 billion unsigned integers require 16G of space, so we can only use merge sort, but the binary search algorithm cannot be completed in the file, so idea 1 is not applicable.

If it is placed in a hash table or a red-black tree, we must first load the data in the file into the memory, which can be loaded step by step, and then insert the data into the hash table or red-black tree. After the construction is completed, Find the data again. That’s not as good as a brute force lookup, right?

In summary, the important reason why the above ideas are not applicable is: insufficient memory.

Bitmap resolution

Whether the data is in the given integer data, the results are nothing more than two, yes or no, then we can use a binary bit to represent whether the data exists, if the binary bit is 1, it means it exists, and 0 means it does not exist. Often we can convert integers into chars for storage. One char is equal to 1 byte = 8 bits, so 4 billion unsigned integers only need about 0.5G of memory space. for example:

The concept of bitmap

Bitmap is to use each bit to store a certain state. Suitable for massive data without duplication of data, usually used to judge whether a certain data exists.

Bitmap implementation

Structure

  • takes character array
  • Use non-type template parameters, N indicates the number of bits to be set
namespace zxn
{<!-- -->
//simulate bitmap
template<size_t N>
class bitset
{<!-- -->
public:
\t\t//Constructor
bitset();
//set bit
void set(size_t x);
//clear bit
void reset(size_t x);
//Check if the data exists
bool test(size_t x);
// reverse bit
void flip(size_t pos);
//Get the number of bits that can be accommodated
size_t size();
// Get the number of bits that are set
size_t count();
// print function
void Print();
private:
vector<char> _bits; //bitmap
};
}

Constructor

N is a non-type template parameter, indicating the number of bits we need to set, because we use a character array, the storage type is char, and the char type occupies one byte, so N/8 + 1 indicates the char type space we need. Use the resize function to initialize it to 0 while opening the space.

//Constructor
bitset()
{<!-- -->
_bits.resize(N / 8 + 1, 0);
}

Set bit

How to set the bit corresponding to the data to 1 (exists)?

Here, pay attention to distinguishing left shift <<, right shift >>, and low address and high address in virtual memory.

  • Review the concept of big and small endian

Big-endian byte order: high addresses exist in low bits
Little endian: low address exists in low address

Note: When reading, it is from low address to high address >

  • Why use left shift (observe virtual memory address)

void set(size_t x)
{<!-- -->
//Calculate the position of the bit mapped to the data x in the i-th char array
size_t i = x / 8;
//Calculate the bit of data x mapping in the jth bit of this char
size_t j = x % 8;

_bits[i] |= (1 << j);
}

Clear bit

The function of the reset function is to set the corresponding bit of a data to non-existent.

void reset(size_t x)
{<!-- -->
//Calculate the position of the bit mapped to the data x in the i-th char array
size_t i = x / 8;
//Calculate the bit of data x mapping in the jth bit of this char
size_t j = x % 8;

_bits[i] & amp;= ~(1 << j);
}

Judge whether this number exists

The function of the test function is to judge whether the unsigned normal is in the 4 billion unique data. If the bit state corresponding to this data is 1, it means that this data exists and returns true.

bool test(size_t x)
{<!-- -->
size_t i = x / 8;
size_t j = x % 8;

return _bits[i] & amp; (1 << j);
//or return (_bits[i]<<j) & amp;1;
}

Invert bit

  1. Calculate the bit position corresponding to the data.
  2. Shift 1 to the left by j bits and perform XOR operation with the i-th bit in char. (1 if different, 0 if same)
//Reverse bit
void flip(size_t x)
{<!-- -->
size_t i = x / 8;
size_t j = x % 8;
_bits[i] ^= (1 << j);//XOR: 1 if different, 0 if same
}

size and count

//Get the number of bits that can be accommodated
size_t size()
{<!-- -->
return N;
}

How to get the number of 1s in the bitmap:

  1. Perform & amp; AND operation on original tree n and n-1 to get new n;
  2. Determine whether n is 0, if n is not 0, go to the first step in a loop.
  3. This cycle continues until n is finally 0, and the number of times the code is executed represents how many 1s there are in the binary bit.

Principle: n & amp; (n-1) This operation will eliminate the rightmost 1 of the binary every time it is performed.

//Get the number of bits with 1
size_t count()
{<!-- -->
size_t count = 0;
\t\t
for (auto e : _bits)
{<!-- -->
//e is char type

while (e)
{<!-- -->
e = e & (e - 1);
count + + ;
}
}
return count;//The number of bits to be set, that is, the number of 1 in the bitmap
}

Print function

The function of the print function is to traverse the bitmap, print each bit, and count the number of bits during the printing process.

void Print()
{<!-- -->
int count = 0;
size_t n = _bits.size();//Easy to get the last character of the array

//Print the first n-1 chars first
for (size_t i = 0; i < n - 1; i ++ )
{<!-- -->
for (size_t j = 0; j < 8; j++ )
{<!-- -->
if (_bits[i] & amp; (1 << j))
{<!-- -->
cout << "1";
}
else
{<!-- -->
cout << "0";
}
count + + ;
}
}
//Print the first N%8 digits of the last character
for (size_t j = 0; j < N % 8; j ++ )
{<!-- -->
if (_bits[n - 1] & amp; (1 << j))
{<!-- -->
cout << "1";
}
else
{<!-- -->
cout << "0";
}
count + + ;
}
cout << " " << count << endl;
}

Application of bitmap

  1. Quickly find out whether a certain data is in a collection;
  2. Sort + deduplicate;
  3. Find the intersection and union of two sets;
  4. Disk block markers in the operating system.

Bitmap advantages and disadvantages:

  1. Advantages: fast, save space.
  2. Disadvantages: only integer types can be mapped, and other types, such as floating point numbers, strings, etc., cannot store mappings.

Problem 1: Given 10 billion integers, design an algorithm to find the integer that occurs only once.

A bit in a bitmap can only represent two states, so we can open up two bitmaps, and the corresponding positions of the two bitmaps represent the first bit and the first bit of the integer mapped to the position respectively .

Can be marked as follows:

  1. 00 means does not exist, has not appeared
  2. 01 means appear once
  3. 10 means more than one occurrence


Code implementation: multiplexing bitset

template<size_t N>
class twobitset
{<!-- -->
public:
//Set bit: 00 means 0 occurrences, 01 means only one occurrence, 10 means more than one occurrence
void set(size_t x)
{<!-- -->
//00 -> 01
if (_bs1. test(x) == false & & _bs2.test(x) == false)
{<!-- -->
_bs2.set(x);
}
else if (_bs1. test(x) == false & & _bs2.test(x) == true)
{<!-- -->
//01 -> 10
_bs1.set(x);
_bs2.reset(x);
}
}
void Print()
{<!-- -->
// traverse the entire bitmap
for (size_t i = 0; i < N; i ++ )
{<!-- -->
//Print out the integer that only appears once, that is, the first digit is 0, and the second digit is 1,01
if (_bs2. test(i))
{<!-- -->
cout << i << endl;
}
}
}
private:
bitset<N> _bs1;
bitset<N> _bs2;
};
void test_twobitset()
{<!-- -->
int a[] = {<!-- --> 3, 45, 53, 32, 32, 43, 3, 2, 5, 2, 32, 55, 5, 53,43,9,8,7,8 };
twobitset<100> bs;
/ / Map the data in the array a to the bitmap
for (auto e : a)
{<!-- -->
bs.set(e);
}
//Print out the data that only appears once
bs. Print();
}

Note: storing 10 billion integers requires about 40G of memory space, so the data must be stored in the file; in order to map all integers, the size of the bitmap must be 2^32 bits, that is, 4294967295, so opening a bitmap is about It needs 0.5G of memory space, and two bitmaps need 1G of memory space, so the code chooses to open up space in the heap area. If it is opened in the stack area, it will cause a stack overflow problem.

Question 2: Given two files, each with 10 billion integers, we only need 1G memory, how to find the intersection of the two files.

Idea one:

1. Read the value of the first file in turn, read it into a bitmap of the memory, and then read a file to judge whether it is in the first bitmap or not, it is the intersection. This method requires about 0.5G of memory , but there is a problem, that is, there are duplicate values in the found intersection, and we need to deduplicate them.

2. Improvement method: Every time the value of the intersection is found, the value corresponding to the first bitmap is set to 0, so that the problem of repeated values in the found intersection can be solved.

Idea 2:
1. Read the data of file 1 sequentially and map it to bitmap 1;
2. Read the data of file 2 sequentially and map it to bitmap 2;
3. Perform the AND & amp; & amp; operation on the data of bitmap 1 and bitmap 2, and the result is 1, which is the intersection.

Question 3: There are 10 billion ints in a file, 1G memory, design an algorithm to find all integers that appear no more than 2 times.

The idea of this question is the same as that of question 1. We can use the following four binary representations to represent the four states:

  1. 00 appears 0 times
  2. 01 appears 1 time
  3. 10 appears 2 times
  4. 11 appears more than 2 times
#include <iostream>
#include <vector>
#include <bitset>
using namespace std;

int main()
{<!-- -->
int a[] = {<!-- --> 3, 45, 53, 32, 32, 43, 3, 2, 5, 2, 32, 55, 5, 53,43,9,8,7,8 ,55};
// Allocate space on the heap
bitset<4294967295>* bs1 = new bitset<4294967295>;
bitset<4294967295>* bs2 = new bitset<4294967295>;
for (auto e : a)
{<!-- -->
if (!bs1->test(e) & amp; & amp; !bs2->test(e)) //00->01
{<!-- -->
bs2->set(e);
}
else if (!bs1->test(e) & amp; & amp; bs2->test(e)) //01->10
{<!-- -->
bs1->set(e);
bs2->reset(e);
}
else if (bs1->test(e) & amp; & amp; !bs2->test(e)) //10->11
{<!-- -->
bs2->set(e);
}
\t\t
}
for (size_t i = 0; i < 4294967295; i ++ )
{<!-- -->
if ((!bs1->test(i) & amp; & amp; bs2->test(i)) || (bs1->test(i) & amp; & amp; !bs2->test(i)) ) //01 or 10
cout << i << endl;
}
return 0;
}