[C++] Hash – unordered series containers | Hash conflict | Closed hashing

Article directory

- 1. Unordered series of associative containers
- 2. Hash concept
- 3. Hash collision
- 4. Hash function
- 5. Resolve hash conflicts
- - 1. Closed hashing – open addressing method
  - 2. Code implementation
  - 3. Open hashing – open chain method
  - 4. Code implementation
- 6. Conclusion

1. Unordered series of associative containers

In C++98, STL provides a series of associative containers with the underlying red-black tree structure. The query efficiency can reach log2N, that is, in the worst case, red and black need to be compared. The height of the tree is second, and when there are very many nodes in the tree, the query efficiency is not ideal. The best query is to find the element with a small number of comparisons. Therefore, in C++11, STL provides four unordered series of associative containers. These four containers are related to the red-black tree. The usage of structured associative containers is basically similar, except that the underlying structure is different: the unordered series of associative containers are more efficient because they use a hash structure

unordered_set:

The difference from set is that it does not support direction iterators, only one-way iterators, and other interfaces are basically similar to set

int main() { unordered_set<int> us; us.insert(10); us.insert(1); us.insert(10); us.insert(3); us.insert(4); us.insert(4); auto it = us.begin(); while (it != us.end()) { cout << *it << " "; it++; } cout << endl; return 0; }

Unordered + Deduplication:

unordered_map:

Iterators are also one-way iterators, and others are basically similar to map

int main() { unordered_map<string, int> countMap; string arr[] = { "apple","banana","apple" }; for (auto & e : arr) { auto it = countMap.find(e); /*if (it == countMap.end()) { countMap.insert(make_pair(e, 1)); } else { it->second + + ; }*/ countMap[e] + + ; } for (auto & amp; kv : countMap) { cout << kv.first << ":" << kv.second << endl; } return 0; }

Bucket operation of unordered_map

Function declaration Function introduction

size_t bucket_count()const Returns the total number of buckets in the hash bucket

size_t bucket_size(size_t n)const Returns the valid number in bucket n The total number of elements

size_t bucket(const K & amp; key) Returns the bucket number where the element key is located

practical use:

Find elements repeated N times in an array of length 2N

You are given an integer array nums with the following properties:

nums.length == 2 * n.
nums contains n + 1 distinct elements
Exactly one element in nums is repeated n times
Find and return the element repeated n times.

class Solution { public: int repeatedNTimes(vector<int> & amp; nums) { unordered_map<int,int> CountMap; //Number of statistics for(auto & e:nums) { CountMap[e] + + ; } //Meet the criteria for(auto & amp;kv:CountMap) { if(kv.second==nums.size()/2) return kv.first; } return -1; } };

Insert\find\erase performance comparison: The unordered series is more efficient with random numbers, but not with ordered numbers.

int main() { const size_t N = 1000000; unordered_set<int> us; set<int> s; vector<int> v; v.reserve(N); srand(time(0)); for (size_t i = 0; i < N; + + i) { v.push_back(rand()); } size_t begin1 = clock(); for (auto e : v) { s.insert(e); } size_t end1 = clock(); cout << "set insert:" << end1 - begin1 << endl; size_t begin2 = clock(); for (auto e : v) { us.insert(e); } size_t end2 = clock(); cout << "unordered_set insert:" << end2 - begin2 << endl; size_t begin3 = clock(); for (auto e : v) { s.find(e); } size_t end3 = clock(); cout << "set find:" << end3 - begin3 << endl; size_t begin4 = clock(); for (auto e : v) { us.find(e); } size_t end4 = clock(); cout << "unordered_set find:" << end4 - begin4 << endl; cout << s.size() << endl; cout << us.size() << endl; size_t begin5 = clock(); for (auto e : v) { s.erase(e); } size_t end5 = clock(); cout << "set erase:" << end5 - begin5 << endl; size_t begin6 = clock(); for (auto e : v) { us.erase(e); } size_t end6 = clock(); cout << "unordered_set erase:" << end6 - begin6 << endl; \t return 0; }

2. Hash concept

In the sequential structure and balanced tree, there is no corresponding relationship between the element key and its storage location. Therefore, when searching for an element, multiple comparisons of the key must be made. The time complexity of sequential search is O(N). In a balanced tree, it is the height of the tree, which is O(). The efficiency of the search depends on the number of comparisons of elements during the search process.
Ideal search method: The element to be searched can be obtained directly from the table at one time without any comparison. If you construct a storage structure and use a certain function (hashFunc) to establish a one-to-one mapping relationship between the storage location of the element and its key code, then the element can be found quickly through this function during search.

Hash map: establish an association between key value and storage location

When inserting elements into this structure
According to the key code of the element to be inserted, this function calculates the storage location of the element and stores it according to this location
Search element
Perform the same calculation on the key code of the element, use the obtained function value as the storage location of the element, and compare the elements based on this location in the structure. If the key codes are equal, the search is successful

This method is the hash (hash) method. The conversion function used in the hash method is called the hash (hash) function, and the constructed structure is called a hash table (Hash Table) (or hash table).

The hash function is set to: hash(key) = key % capacity; capacity is the total size of the underlying space for storing elements: for example, data set {1, 7, 6, 4, 5, 9}

3. Hash conflict

Searching using this method does not require multiple key code comparisons, so the search speed is faster. But when inserting element 44, a hash conflict occurs

Hash conflict: Different keywords calculate the same hash address through the same hash number. This phenomenon is called hash conflict or hash collision, such as 44 = 4, but 4 The spot is already taken.

4. Hash function

If the hash function is not designed reasonably enough, it will cause a hash conflict.

Hash function design principles:

The domain of the hash function must include all keys that need to be stored, and if the hash table allows m addresses, its value range must be between 0 and m-1
The addresses calculated by the hash function can be evenly distributed throughout the space
The hash function should be simple

Common hash functions

Direct customization method – (commonly used)
Take a linear function of the keyword as the hash address: Hash (Key) = A*Key + B Advantages: Simple and uniform Disadvantages : Need to know the distribution of keywords in advance. Usage scenario: Suitable for finding relatively small and continuous situations.

Division with remainder method – (commonly used)
Assume that the number of addresses allowed in the hash table ism, take a prime numberp that is not greater thanm, but is closest to or equal tom As a divisor, according to the hash function: **Hash(key) = key% p(p<=m),** convert the key code into a hash address

Square-Medium Method – (Understand)
Assume that the keyword is 1234, and squaring it is 1522756, and extracts the middle 3 digits 227 as the hash address; another example is the keyword 4321, squaring it is 18671041, and extracts the middle 3 digits 671 (or 710) as the hash address The square-middle method is more suitable: the distribution of keywords is not known, and the number of digits is not very large.

Folding method – (understand)
The folding method is to divide the keyword into several parts with equal digits from left to right (the last part can have shorter digits), then superimpose and sum these parts, and according to the length of the hash table, take the last few parts as the hash. Column address. The folding method is suitable for situations where the distribution of keywords does not need to be known in advance and is suitable for situations where there are a large number of keyword digits.

Random Number Method – (Understand)
Choose a random function, and take the random function value of the keyword as its hash address, that is, H(key) = random(key), where random is the random number function. This method is usually used when keywords are of different lengths.

Mathematical Analysis – (Understand)
Suppose there are n d digits. Each digit may have r different symbols. The frequency of these r different symbols appearing in each digit may not be the same. They may be evenly distributed in some digits. The frequency of each symbol appearing is Equal opportunity, uneven distribution on some bits, only certain symbols appear frequently. According to the size of the hash table, several bits in which various symbols are evenly distributed can be selected as the hash address.

The more sophisticated the hash function is designed, the lower the possibility of hash collisions, but hash collisions cannot be avoided.

5. Resolve hash conflicts

Two common ways to resolve hash collisions are: closed hashing and open hashing

1. Closed hashing – open addressing method

Closed hashing: also called open addressing method. When a hash conflict occurs, if the hash table is not full, it means that there must be an empty position in the hash table, then the key can be stored Go to the “next” empty position in the conflicting position. So how to find the next empty position?

Linear Detection

Starting from the position where the conflict occurs, probe backwards until the next empty position is found.

Insertion: Obtain the position of the element to be inserted in the hash table through the hash function

Deletion: When using closed hashing to handle hash conflicts, you cannot physically delete existing elements in the hash table. Direct deletion of elements will affect the search for other elements. For example, in the above example, if you delete 27. If you are looking for 38 at this time, you will find that you will encounter an empty space during the search process, which affects the search for 38. Solution: Linear detection uses the marked pseudo-deletion method to delete an element and adds a status identifier to each position.

In a limited space, as we insert more and more data, the probability of conflict becomes greater and greater, and the search efficiency becomes lower and lower. Therefore, it is impossible for the conflict table of closed hashing to fill up, so it is introduced Load factor:

Load factor/Load factor: It is equal to the number of valid data in the table/the size of the table, which measures the fullness of the table. In closed hashing, the load factor cannot exceed 1 (1 means full). Under normal circumstances, the load factor is generally around 0.7. The smaller the load factor, the smaller the conflict probability, but the greater the space consumed. The larger the load factor, the greater the conflict probability, and the higher the space utilization.

When the load factor is greater than 0.7, expansion is required: direct copying cannot be performed for expansion, and the mapping position will change with the size of the space, so the mapping position needs to be recalculated.

Advantages of linear detection: simple logic and simple implementation
Disadvantages of Linear Detection: Once a hash conflict occurs, all the conflicts are connected together, and it is easy to produce data "accumulation", that is: different key codes occupy available empty positions. , so that finding the location of a certain key code requires many comparisons until an empty key is found, resulting in reduced search efficiency.

Second detection

The flaw of linear detection is that conflicting data accumulates together, which is related to finding the next empty position, because the way to find empty positions is to find them one by one (start + i). In order to avoid this problem, secondary detection , The method to find the next empty position is: use i raised to the power of 2 to detect (start + i^2):

But essentially it still doesn’t solve the problem and takes up other people’s space.

2. Code implementation

#include <vector> template<class K> //functor struct HashFunc { size_t operator()(const K & amp; key) { return (size_t)key; } }; //Specialization template<> struct HashFunc<string> { size_t operator()(const string & amp; key) { size_t hash = 0; for (auto ch : key) { hash *= 131; \t\t\t//order? abc,cba hash + = ch; } return hash; } }; //Closed hash namespace closehash { enum State { EMPTY, EXIST, DELETE }; template<class K,class V> struct HashData { pair<K, V> _kv; State _state = EMPTY; }; template<class K,class V,class Hash = HashFunc<K>> classHashTable { typedef HashData<K, V> Data; public: HashTable() :_n(0) { _tables.resize(10); } bool Insert(const pair<K, V> & amp; kv) { if (Find(kv.first)) return false; if (_n * 10 / _tables.size() >= 7) { HashTable<K, V, Hash> newHT; newHT._tables.resize(_tables.size() * 2); for (auto & e : _tables) { if (e._state == EXIST) { newHT.Insert(e._kv); } } _tables.swap(newHT._tables); } Hash hf; size_t hashi = hf(kv.first)% _tables.size(); while (_tables[hashi]._state == EXIST) { + + hashi; hashi %= _tables.size(); } _tables[hashi]._kv = kv; _tables[hashi]._state = EXIST; + + _n; return true; } Data* Find(const K & amp; key) { Hash hf; size_t hashi = hf(key) % _tables.size(); size_t starti = hashi; while (_tables[hashi]._state != EMPTY) { if (_tables[hashi]._state == EXIST & amp; & amp; key == _tables[hashi]._kv.first) { return &_tables[hashi]; } + + hashi; hashi %= _tables.size(); // if (hashi == starti) { break; } } return nullptr; } bool Erase(const K & amp; key) { Data* ret = Find(key); if (ret) { ret->_state = DELETE; --_n; return true; } else { return false; } } private: vector<Data> _tables; size_t_n = 0; }; void TestHT1() { HashTable<int, int> ht; int a[] = { 18, 8, 7, 27, 57, 3, 38, 18 }; for (auto e : a) { ht.Insert(make_pair(e, e)); } ht.Insert(make_pair(17, 17)); ht.Insert(make_pair(5, 5)); cout << ht.Find(7) << endl; cout << ht.Find(8) << endl; ht.Erase(7); cout << ht.Find(7) << endl; cout << ht.Find(8) << endl; } void TestHT2() { string arr[] = { "apple", "watermelon", "banana", "strawberry", "apple", "watermelon", "apple", " Apple", "Watermelon", "Apple", "Banana", "Apple", "Banana" }; //HashTable<string, int, HashFuncString> countHT; HashTable<string, int> countHT; for (auto & e : arr) { HashData<string, int>* ret = countHT.Find(e); if (ret) { ret->_kv.second + + ; } else { countHT.Insert(make_pair(e, 1)); } } HashFunc<string>hf; cout << hf("abc") << endl; cout << hf("bac") << endl; cout << hf("cba") << endl; cout << hf("aad") << endl; } }

Points to note when coding:

1. Functor: Considering the number of statistical occurrences: because strings cannot be modulo, we can add a functor Hash to HashTable, which can Convert the type that cannot be modulo into a type that can be modulo, and specialize string to solve the problem of string that cannot be modulo

2. String hash method: Considering the order problem, such as abc, cba, if you only multiply by 131, the result will be the same, so we can add ch and multiply by 131

3. Open hashing – open chain method

Open Hash: The open hash method is also called the chain address method (open chain method). First, the hash function is used to calculate the hash address of the key code set. Key codes with the same address belong to the same sub-set. , each sub-set is called a bucket, the elements in each bucket are connected through a single linked list chain, and the head node of each linked list is stored in the hash table

As can be seen from the above figure, each bucket in open hashing contains elements that have hash conflicts, and they do not have to be in order.

Open hash capacity expansion problem:

Since the number of buckets is fixed, as elements are continuously inserted, the number of elements in each bucket continues to increase. In extreme cases, there may be a large number of linked list nodes in a bucket, which will affect the performance of the hash table. , so the hash table needs to be expanded under certain conditions.

The best situation for open hashing is: there is exactly one node in each hash bucket, and when you continue to insert elements, a hash conflict will occur every time.

Therefore, when the number of elements is exactly equal to the number of buckets, the capacity of the hash table can be increased. Research and analysis show that prime numbers as the length of the hash table can reduce hash conflicts as much as possible. Therefore, a prime number table can be defined in advance.

4. Code implementation

//Open hash namespace buckethash { template<class K,class V> struct HashNode { pair<K, V> _kv; HashNode<K, V>* _next; HashNode(const pair<K, V> & amp; kv) :_kv(kv) , _next(nullptr) {} }; template<class K,class V,class Hash=HashFunc<K>> classHashTable { typedef HashNode<K, V> Node; public: HashTable() :_n(0) { _tables.resize(__stl_next_prime(0)); } ~HashTable() { for (size_t i = 0; i < _tables.size(); + + i) { \t\t\t\t// freed Node* cur = _tables[i]; while (cur) { Node* next = cur->_next; delete cur; cur = next; } _tables[i] = nullptr; } } bool Insert(const pair<K, V> & amp; kv) { if (Find(kv.first)) { return false; } if (_tables.size() == _n) { //Consumption? /*HashTable<K, V, Hash> newHT; newHT._tables.resize(_tables.size() * 2); for (auto cur : _tables) { while (cur) { newHT.Insert(cur->_kv); cur = cur->_next; } } _tables.swap(newHT._tables);*/ vector<Node*> newTables; newTables.resize(__stl_next_prime(_tables.size()), nullptr); for (size_t i = 0; i < _tables.size(); i + + ) { Node* cur = _tables[i]; while (cur) { Node* next = cur->_next; size_t hashi = Hash()(cur->_kv.first) % newTables.size(); cur->_next = newTables[hashi]; newTables[hashi] = cur; cur = next; } _tables[i] = nullptr; } _tables.swap(newTables); } size_t hashi = Hash()(kv.first) % _tables.size(); Node* newnode = new Node(kv); newnode->_next = _tables[hashi]; _tables[hashi] = newnode; + + _n; return true; } Node* Find(const K & amp; key) { size_t hashi = Hash()(key) % _tables.size(); Node* cur = _tables[hashi]; while (cur) { if (cur->_kv.first == key) { return cur; } else { cur = cur->_next; } } return nullptr; } bool Erase(const K & amp; key) { size_t hashi = Hash()(key) % _tables.size(); Node* prev = nullptr; Node* cur = _tables[hashi]; while (cur) { if (cur->_kv.first == key) { if (cur == _tables[hashi]) { _tables[hashi] = cur->_next; } else { prev->_next = cur->_next; } delete cur; --_n; return true; } else { prev = cur; cur = cur->_next; } } return false; } inline unsigned long __stl_next_prime(unsigned long n) { static const int __stl_num_primes = 28; static const unsigned long __stl_prime_list[__stl_num_primes] = { 53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593, 49157, 98317, 196613, 393241, 786433, 1572869, 3145739, 6291469, 12582917, 25165843, 50331653, 100663319, 201326611, 402653189, 805306457, 1610612741, 3221225473, 4294967291 }; for (int i = 0; i < __stl_num_primes; + + i) { if (__stl_prime_list[i] > n) { return __stl_prime_list[i]; } } return __stl_prime_list[__stl_num_primes - 1]; } private: vector<Node*> _tables; size_t_n = 0; }; void TestHT1() { HashTable<int, int> ht; int a[] = { 18, 8, 7, 27, 57, 3, 38, 18,17,88,38,28 }; for (auto e : a) { ht.Insert(make_pair(e, e)); } ht.Insert(make_pair(5, 5)); ht.Erase(17); ht.Erase(57); } void TestHT2() { string arr[] = { "apple", "watermelon", "banana", "strawberry", "apple", "watermelon", "apple", " Apple", "Watermelon", "Apple", "Banana", "Apple", "Banana" }; //HashTable<string, int, HashFuncString> countHT; HashTable<string, int> countHT; for (auto & e : arr) { auto ret = countHT.Find(e); if (ret) { ret->_kv.second + + ; } else { countHT.Insert(make_pair(e, 1)); } } } }

6. Conclusion

Comparison of open hashing and closed hashing
Applying the chain address method to handle overflow requires adding a link pointer, which seems to increase storage overhead. In fact: Since the open address method must maintain a large amount of free space to ensure search efficiency, for example, the secondary exploration method requires a loading factor a <= 0.7, and the table entry occupies much larger space than the pointer, so a chain is used On the contrary, the address method saves storage spacethan the open address method.