Unordered series associative containers–detailed explanation and usage examples of hash structure

Directory

  • unordered series associative containers
    • unordered_map
  • Hash
    • Hash concept
    • Hash function
      • Direct addressing method:
      • Division with remainder method:
    • Hash collision
    • Resolve hash collisions
      • Closed hash:
      • Open hash:
        • Underlying structure hash table–code implementation:
        • Use hash structure to implement set:
        • Use hash structure to implement map:
  • Applications of hashing
    • bitmap
      • bitmap concept
      • Bitmap implementation
    • bloom filter
      • Bloom filter concept
      • Bloom filter implementation

unordered series associative containers

As explained before, STL in C++98 provides a series of associative containers with red-black tree structure at the bottom layer, which can achieve query efficiency of

l

o

g

2

N

log_2N

log2?N, that is, in the worst case, the height of the red-black tree needs to be compared. When there are very many nodes in the tree, the query efficiency is not ideal. The best query is to find the element with a small number of comparisons. Therefore, in C++11, STL provides four more unordered series of associative containers. These four containers The usage of an associative container with a red-black tree structure is basically similar, except that its underlying structure is different using a hash structure. unordered_map, unordered_set, unordered_multimap and unordered_multiset

unordered_map

  1. unordered_map is an associative container that stores key-value pairs, which allows fast indexing to the corresponding value through keys.
  2. In unordered_map, the key value is usually used to uniquely identify the element, while the map value is an object whose content is associated with this key. Keys and mapped values may be of different types.
  3. Internally, unordered_map does not sort in any specific order. In order to find the value corresponding to the key within a constant range, unordered_map places key-value pairs with the same hash value in in the same bucket.
  4. The unordered_map container is faster than map for accessing individual elements by key, but it is generally less efficient at range iteration over a subset of elements.
  5. unordered_maps implements the direct access operator (operator[]), which allows direct access to value using key as parameter.
  6. Its iterators are only forward iterators.

third party

  • The usage is the same as map
    But the operation performance is higher than map: O(1) complexity
 //Used in the same way as map
unordered_map<int, int> m;
//Operation performance is higher than map: O(1) complexity
m.insert(make_pair(1, 1));
m[2] = 2;
  • The difference is that the values traversed by map are ordered and the values traversed by unordered_map are unordered
    unordered_map only has forward iterators and no reverse iterators.
 for (int i = 3; i < 100; + + i)
{<!-- -->
m[i] = i;
}
\t
//Compared to map/set, unordered_map/set only has forward iterators
//Iterator traversal, not ordered
unordered_map<int, int>::iterator it = m.begin();
while (it != m.end())
{<!-- -->
cout << it->first << " ";
}
cout << endl;
  • equal_range: an interval that is closed on the left and open on the right, query key

Since it is a map, Key duplication is not allowed, so only one value 3 is output; if it is a multimap that can have multiple key values, more will be output.

 //equal_range:
auto range = m.equal_range(3);
it = range.first;
while (it != range.second)
{<!-- -->
cout << it->first << " ";
+ + it;
}
cout << endl;

Hash

The unordered series of associative containers are more efficient because they use a hash structure at the bottom.

Hash concept

In sequential structures and balanced trees, there is no corresponding relationship between element keys and their storage locations. Therefore, when looking for an element, multiple comparisons of the keys must be made. The time complexity of sequential search is O(N). In a balanced tree, it is the height of the tree, which is O(

l

o

g

2

N

log_2 N

log2?N), the efficiency of the search depends on the number of comparisons of elements during the search process.
Ideal search method: you can get the element to be searched directly from the table at once without any comparison.
If you construct a storage structure and use a certain function (hashFunc) to establish a one-to-one mapping relationship between the storage location of the element and its key code, then you can quickly find the element through this function during search. .
When entering this structure:

  • Insert element

According to the key code of the element to be inserted, this function calculates the storage location of the element and stores it according to this location.

  • Search element

Perform the same calculation on the key code of the element, use the obtained function value as the storage location of the element, and compare the elements based on this location in the structure. If the key codes are equal, the search is successful.

This method is the hash (hash) method. The conversion function used in thehash method is called the hash (hash) function, and the constructed structure is called a hash table (Hash Table) (or hash table). list)

Hash function

Direct addressing method:

kx + b: Suitable for position calculation of small range data. If the data range is too large, space will be wasted.

Division with remainder method:

x% space size: general
For example: data set {1, 7, 6, 4, 5, 9};
The hash function is set to: hash(key) = key % capacity; capacity is the total size of the underlying space for storing elements.

Hash collision

Keywords for two data elements

k

i

k_i

ki?and

k

j

k_j

kj?(i != j), there is

k

i

k_i

ki? !=

k

j

k_j

kj?, but there is: Hash(

k

i

k_i

ki?) ==Hash(

k

j

k_j

kj?), that is: Different keywords calculate the same hash address through the same hash number. This phenomenon is called hash conflict or hash collision. As shown in Figures 4 and 14 above, the hash addresses are the same.
Data elements with different keys but the same hash address are called “synonyms”.

Resolving hash conflicts

Closed hash:

  • Linear detection
    Starting from the calculated hash position, find the first free position and store the data.
//Record position status (data deletion in the hash table is equivalent to pseudo deletion, because if the query data finds a vacant position, the search will stop, such as 5, 15, 25. After deleting 15, when querying 25 When you reach position 15 and find it empty, the search will stop, so you cannot delete it directly when deleting 15)
enum STATE
{<!-- -->
EXIST, //exists
DELETE, //Delete
EMPTY //Empty
};

template <class K, class V>
struct hashNode
{<!-- -->
pair<k, V> _kv;
STATE _state = EMPTY;
};

//Sequence table implements hash
template <class K, class V>
classHashTable
{<!-- -->
public:
typedef HashNode<K, V> Node;

HashTable(size_t n = 10)
:_hTable(n)
,_size(0)
{<!-- -->}

bool insert(const pair<K, V> & amp; kv)
{<!-- -->
//0. Check capacity
checkCapacity();
//1. Calculate the hash position
int idx = kv.first % _hTable.size();

//2. Determine whether the key exists
while (_hTable[idx]._state != EMPTY)
{<!-- -->
//If the current location data is valid and the key is the same, the insertion fails
if (_hTable[idx]._state == EXIST & amp; & amp; kv.first == _hTable[idx]._kv.first)
{<!-- -->
return false;
}
//continue searching
+ +idx;
if (idx == _hTable.size())
idx = 0;
}
//Insert
_hTable[idx] = kv;
_hTable[idx]._state = EXIST;
+ + _size;
\t\t
return true;
}

void checkCapacity()
{<!-- -->
//Load factor: <1 Number of valid elements/capacity size
//The smaller the load factor, the more elements can be stored, but the more elements are wasted, so the trade-off is: 0.7
if (_hTable.size() == 0 || _size * 10 / _hTable.size() >= 7)
{<!-- -->
//Open a new table
int newC = _hTable.size() == 0 ? 10 : 2 * _hTable.size();
HashTable<K, V> newHt(newC);

for (int i = 0; i < _hTable.size(); + + i)
{<!-- -->
//Insert data with status exist
if (_hTable[i]._state == EXIST)
{<!-- -->
newHt.insert(_hTable[i]._kv);
}
}
Swap(newHt);
}
}

void Swap(HashTable<K, V> & amp; Ht)
{<!-- -->
swap(_hTable, Ht._hTable);
swap(_size, Ht._size);
}

Node* find(const K & amp; key)
{<!-- -->
//Calculate position
int idx = key % _hTable.size();
while (_hTable[idx]._state != EMPTY)
{<!-- -->
if (_hTable[idx]._state == EXIST & amp; & amp; key == _hTable[idx]._kv.first)
{<!-- -->
return &_hTable[idx];
}
+ +idx;
if (idx == _hTable.size())
{<!-- -->
idx = 0;
}
}
return nullptr;
}

bool erase(const K & amp; key)
{<!-- -->
Node* node = find(key);
if(node)
{<!-- -->
//fake delete
--_size;
node->_state = DELETE;
return true;
}
return false;
}

private:
vector<Node> _hTable;
size_t _size; //The number of valid elements
};

void test()
{<!-- -->
HashTable<int, int> ht;
ht.insert(make_pair(1, 1));
ht.insert(make_pair(14, 14));
ht.insert(make_pair(16, 16));
ht.insert(make_pair(11, 11));

cout << ht.erase(11) << endl;
cout << ht.erase(100) << endl;
}
  • secondary detection
    The flaw of linear detection is that conflicting data accumulates together, which is related to finding the next empty position, because the way to find empty positions is to find them one by one. Therefore, in order to avoid this problem, secondary detection needs to find the next empty position. The method for empty position is:

    H

    i

    H_i

    Hi? = (

    H

    0

    H_0

    H0? +

    i

    2

    i^2

    i2 )% m, or:

    H

    i

    H_i

    Hi? = (

    H

    0

    H_0

    H0? –

    i

    2

    i^2

    i2 )% m. Among them: i =1,2,3…,

    H

    0

    H_0

    H0? is the position obtained by calculating the key code key of the element through the hash function Hash(x), and m is the size of the table.
    The difference from linear detection is that it finds an empty position. Linear detection searches for one empty position. Secondary detection searches for the second position each time. For example, the first position is found for the first time, and the second position is found for the second time. location, the third time looking for the fourth location.

Open hash:

The open hash method is also called the chain address method (open chain method). First, a hash function is used to calculate the hash address of the key set. Key codes with the same address belong to the same sub-set, and each sub-set is called a Buckets, the elements in each bucket are linked through a singly linked list, and the head node of each linked list is stored in the hash table.

Underlying structure hash table – code implementation:
using namespace std;
//************************************************ ***************************************hash bucket

//open hash

//Nodes of singly linked list
//Because the underlying structure of map and set needs to be implemented, only value is placed in the node, not key.
template <class V>
struct HashNode
{<!-- -->
V_val;
HashNode<V>* _next;

HashNode(const V & val)
:_val(val)
, _next(nullptr)
{<!-- -->}
};

//Predeclaration of hash table
template <class K, class V, class KeyOfValue, class HashFun = HashFun<K>>
class HashTable;

//Hash table iterator: custom type
//Implemented by encapsulating the nodes of a singly linked list
template <class K, class V, class KeyOfValue, class HashFun = HashFun<K>>
struct HashIterator
{<!-- -->
typedef HashIterator<K, V, KeyOfValue, HashFun> Self;
typedef HashTable<K, V, KeyOfValue, HashFun> HT;
//Member: node pointer
typedef HashNode<V> Node;
Node* _node;
HT* _hPtr;

HashIterator(Node* node, HT* hPtr)
:_node(node)
, _hPtr(hptr)
{<!-- -->}

V & operator*()
{<!-- -->
return _node->_val;
}

V* operator->()
{<!-- -->
return _node->_val;
}

bool operator!=(const Self & amp; it)
{<!-- -->
return _node != it._node;
}

Self & operator + + ()
{<!-- -->
if (_node->_next)
{<!-- -->
_node = _node->_next;
}
else
{<!-- -->
//Find the head node of the next non-empty linked list
//Calculate the position of the current node in the hash table
KeyOfValue kov;
HashFun hfun;
size_t idx = hfun(kov(_node->_val)) & _hPtr->_ht.size();
//Start searching from the next hash position
+ +idx;
for (; idx < _hPtr->_ht.size(); + + idx)
{<!-- -->
if (_hPtr->_ht[idx])
{<!-- -->
//Find the next non-empty linked list
_node = _hPtr->_ht[idx];
break;
}
}
if (idx == _hPtr->_ht.size())
{<!-- -->
_node = nullptr;
}
}
return *this;
}
};

template <class K, class V, class KeyOfValue, class HashFun>
classHashTable
{<!-- -->
public:
typedef HashNode<V> Node;
typedef HashIterator<K, V, KeyOfValue, HashFun> iterator;

//Declare the iterator as a friend class
template <class K, class V, class KeyOfValue, class HashFun>
friend struct HashIterator;

HashTable(int n = 10)
: _ht(n)
,_size(0)
{<!-- -->}

iterator begin()
{<!-- -->
//Head node of the first non-empty linked list
for (size_t i = 0; i < _ht.size(); + + i)
{<!-- -->
if (_ht[i])
{<!-- -->
return iterator(_ht[i], this);
}
}
return iterator(nullptr, this);
}

iterator end()
{<!-- -->
return iterator(nullptr, this);
}

bool insert(const V & val)
{<!-- -->
//0. Check capacity
checkCapacity();

//1. Calculate hash position
KeyOfValue kov;
HashFun hfun;
int idx = hfun(kov(val)) % _ht.size();

//2. Search
Node* cur = _ht[idx];
while (cur)
{<!-- -->
if (kov(cur->_val) == kov(val))
{<!-- -->
//key repeats
return false;
}
cur = cur->_next;
}

//3. Insert: header plug
cur = new Node(val);
cur->_next = _ht[idx];
_ht[idx] = cur;
+ + _size;
reutrn true;
}

void checkCapacity()
{<!-- -->
if (_size == _ht.size())
{<!-- -->
//int newC = _size == 0 ? 10 : 2 * _size;
size_t newC = GetNextPrime(_ht.size());
\t
//Create a new pointer array
vector<Node*> newHt(newC);

KeyOfValue kov;
HashFun hfun;
//Traverse the old table
for (size_t i = 0; i < _ht.size(); + + i)
{<!-- -->
Node* cur = _ht[i];
//Traverse singly linked list
while (cur)
{<!-- -->
Node* next = cur->_next;
//Calculate new position
int idx = hfun(kov(cur->_val)) & amp; newHt.size();
//Head plug
cur->_next = newHt[idx];
newHt[idx] = cur;

cur = next;
}
//Clear the old table pointer
_ht[i] = nullptr;
}
//New table, old table exchange
swap(_ht, newHt);
}
}

private:
//array of pointers
vector<Node*> _ht;
int _size;
};

template <class K>
struct HashFun
{<!-- -->
size_t operator()(const K & amp; key)
{<!-- -->
return key;
}
};

template <>
struct HashFun<string>
{<!-- -->
size_t operator()(const string & amp; key)
{<!-- -->
size_t hash = 0;
for (const auto & amp; ch : key)
{<!-- -->
hash = hash * 131 + ch;
}
return hash;
}
};
Use hash structure to implement set:
template <class K>
class Set
{<!-- -->
struct SetKeyOfValue
{<!-- -->
const K & amp; operator()(const K & amp; key)
{<!-- -->
return key;
}
};
public:

typedef typename HashTable<K, K, SetKeyOfValue>::iterator iterator;

iterator begin()
{<!-- -->
return _ht.begin();
}

iterator end()
{<!-- -->
reutrn _ht.end();
}

pair<iterator, bool> insert(const K & amp; key)
{<!-- -->
return _ht.insert(key);
}
private:
HashTable<K, K, SetKeyOfValue> _ht;
};

void test_set()
{<!-- -->
Set<int>s;
s.insert(1);
s.insert(2);
s.insert(3);
s.insert(4);
s.insert(41);
s.insert(42);
s.insert(11);

Set<int>::iterator it = s.begin();
while (it != s.end())
{<!-- -->
cout << *it << " ";
+ + it;
}
cout << endl;
\t
for (const auto & e : s)
{<!-- -->
cout << e << " ";
}
cout << endl;
}
Use hash structure to implement map:
template <class K, class V>
class Map
{<!-- -->
struct MapKeyOfValue
{<!-- -->
const K & amp; operator()(const pair<K, V> & amp; val)
{<!-- -->
return val.first;
}
};
public:
typedef typename HashTable<K, pair<K, V>, MapKeyOfValue>::iterator iterator;
\t
pair<iterator, bool> insert(const pair<K,V> & amp; val)
{<!-- -->
return _ht.insert(val);
}

iterator begin()
{<!-- -->
return _ht.begin();
}

iterator end()
{<!-- -->
reutrn _ht.end();
}
\t
V & amp; operator[](const K & amp; key)
{<!-- -->
pair<iterator, bool> ret = _ht.insert(make_pair(key, V()));
return ret.first->second;
}
private:
HashTable<K, pair<K, V>, MapKeyOfValue> _ht;
};

void test_map()
{<!-- -->
Map<int, int> m;
m.insert(make_pair(1, 1));
m[20] = 20;
m[30] = 30;
m[1] = 10;

Map<int, int>::iterator it = m.begin();
while (it != m.end())
{<!-- -->
//it's dereference is the pair object
cout << it->first << "-->" << it->second << endl;
+ + it;
}

cout << "range for" << endl;
for (const auto & amp; p : m)
{<!-- -->
cout << p.first << "-->" << p.second << endl;
}
}

Applications of hashing

Bitmap

Bitmap concept

Bitmap is to use each bit to store a certain state. It is suitable for scenarios where massive data is not repeated. It is usually used to determine whether a certain data exists.

Normally storing an integer requires 4 bytes and 32 bits; using a bitmap to store the information of an integer only requires one bit, which is equivalent to 1/32 of the normal storage.

Bitmap implementation

//Bitmap: Usage scenario-->Storage simple information of non-duplicate data, no need to store the data itself
//Advantages: space saving, high search efficiency: O(1)
class BitSet
{<!-- -->
public:
//The memory size of the bitmap is related to the data range
BitSet(size_t range)
:_bit(range / 32 + 1)
{<!-- -->}

//Storage information
void set(size_t num)
{<!-- -->
//Calculate position
//1. Calculate the integer position: / 32
int idx = num / 32;
//2. Calculate bit position: % 32
int bitIdx = num % 32;
//Set the corresponding bit position to 1
//Bitwise OR operation
_bit[idx] |= 1 << bitIdx;
}

\t//Finding information
bool find(size_t num)
{<!-- -->
int idx = num / 32;
int bitIdx = num % 32;
//Move the high bit of bitIdx to the right to bit 0, and AND it with 1
return (_bit[idx] >> bitIdx) & amp; 1;
}

\t//delete message
void reset(size_t num)
{<!-- -->
int idx = num / 32;
int bitIdx = num % 32;
_bit[idx] & amp;= ~(1 << bitIdx);
}
private:
//integer array
vector<int> _bit;
};

void test()
{<!-- -->
//Open 512 data locations to record the binary information of 512 data
BitSet bit(512);
//Set 1 to the following data position
bit.set(1);
bit.set(32);
bit.set(512);

//If the value exists, return 1, if it does not exist, return 0
bool ret = bit.find(1);
ret = bit.find(3);

//Delete the value from the bitmap (set the position to 0)
bit.reset(32);
}

Bloom filter

Bloom filter concept

Probabilistic data structure, characterized by efficient insertion and query, can be used to tell you “something must not exist or may exist”. It uses multiple hash functions to A data is mapped into a bitmap structure. This method can not only improve query efficiency, but also save a lot of memory space.

Bloom filter implementation

//m: required bit size
//n: number of elements
//k: number of hash functions
// k = m/n * ln2
// m = k * n * 1.4
//Usage scenario: store simple information of various data
//Probabilistic container: Judgment exists, the result may not be correct
//Generally there is no deletion operation: there is accidental deletion
//Time complexity: O(k)
template <class T, class Hash1, class Hash2, class Hash3>
classBloomFilter
{<!-- -->
public:
BloomFilter(size_t num)
//5 ≈ 3*1.4
:_bit(5 * num)
, _bitCount(5 * num)
{<!-- -->}

//Storage information: use multiple bits to store information
void set(const T & val)
{<!-- -->
{<!-- -->
Hash1 h1;
Hash2 h2;
Hash3 h3;
int idx1 = h1(val) % _bitCount;
int idx2 = h2(val) % _bitCount;
int idx3 = h3(val) % _bitCount;
_bit.set(idx1);
_bit.set(idx2);
_bit.set(idx3);
}
}

bool test(const T & val)
{<!-- -->
{<!-- -->
Hash1 h1;
Hash2 h2;
Hash3 h3;
int idx1 = h1(val) % _bitCount;
int idx2 = h2(val) % _bitCount;
int idx3 = h3(val) % _bitCount;
\t\t\t
if (!_bit.find(idx1))
return false;
if (!_bit.find(idx2))
return false;
if (!_bit.find(idx3))
return false;

return true; //may exist
}
}
private:
BitSet_bit;
//Number of bits
size_t _bitCount;
};

struct HashFun1
{<!-- -->
size_t operator()(const string & amp; str)
{<!-- -->
size_t hash = 0;
for (const auto & ch : str)
{<!-- -->
hash = hash * 131 + ch;
}
return hash;
}
};

struct HashFun2
{<!-- -->
size_t operator()(const string & amp; str)
{<!-- -->
size_t hash = 0;
for (const auto & ch : str)
{<!-- -->
hash = hash * 65599 + ch;
}
return hash;
}
};

struct HashFun3
{<!-- -->
size_t operator()(const string & amp; str)
{<!-- -->
size_t hash = 0;
for (const auto & ch : str)
{<!-- -->
hash = hash * 1313131 + ch;
}
return hash;
}
};

void test_b()
{<!-- -->
BloomFilter<string, HashFun1, HashFun2, HashFun3> bf(10);

string str1 = "https://blog.csdn.net/m0_49619206?spm=1000.2115.3001.5343";
string str2 = "https://mp.csdn.net/mp_blog/creation/success/134140411";
string str3 = "https://blog.csdn.net/m0_49619206/article/details/134034481?spm=1001.2014.3001.5501";

bf.set(str1);
bf.set(str2);

bool ret = bf.test(str1);
ret = bf.test(str2);
ret = bf.test(str3);
}