unordered_map/set

1.unordered series associative containers

In C++98, STL provides a series of associative containers with a red-black tree structure at the bottom. However, when there are many nodes, the query efficiency is not ideal. Therefore, in C++11, STL provides There are four unordered series of associative containers. These four containers are basically similar to the associative containers of the red-black tree structure. The only difference is that their underlying structures are different and the query efficiency is faster.

2.unordered_map

(1). unordered_map is an associative container that stores key-value pairs, which allows quick indexing through keys.

Its corresponding value.

(2). In unordered_map, the key value is usually used to uniquely identify the element, and the map value is an object whose content is the same as this

key association. Keys and mapped values may be of different types.

(3). Internally, unordered_map does not sort in any specific order, in order to be within a constant range

Find the value corresponding to the key, and unordered_map puts key-value pairs with the same hash value in the same bucket.

(4). The unordered_map container accesses a single element through key faster than map, but it usually iterates over the range of a subset of elements.

Generation efficiency is low.

(5). unordered_maps implements the direct access operator (operator[]), which allows direct access using key as a parameter

value.

(6). Its iterator is at least a forward iterator.

#pragma once
#include "HashTable.h"

namespace L
{

// A type to be used as the key of unordered_map/unordered_set must support conversion to a modulo functor and == comparison
// For a type to be used as a set/map key, it must support < comparison
template <class K, class V, class Hash = HashFun<K>>
class unordered_map
{
public:
struct KeyOfT_map
{
const K & amp; operator()(const pair<K, V> & amp; kv)
{
return kv.first;
}
};

typedef typename HashBucket::HashBucket<K, pair<const K, V>, KeyOfT_map,Hash>::iterator iterator;
typedef typename HashBucket::HashBucket<K, pair<const K, V>, KeyOfT_map, Hash>::const_iterator const_iterator;

iterator begin()
{
return _bucket.begin();
}

iterator end()
{
return _bucket.end();
}

const_iterator begin() const
{
return _bucket.begin();
}

const_iterator end() const
{
return _bucket.end();
}

V & amp; operator[](const K & amp; key)
{
pair<iterator, bool> ret = _bucket.Insert(make_pair(key,V()));
return ret.first->second;
}

pair<iterator, bool> Insert(const pair<K, V> & amp; kv)
{
return _bucket.Insert(kv);
}
private:
HashBucket::HashBucket<K, pair<const K, V>, KeyOfT_map,Hash> _bucket;
};

3.unordered_set

(1) Data is no longer stored in the form of key-value pairs, but the value of the data is stored directly.

(2) The values of each element stored inside the container are not equal to each other and cannot be modified.

(3) The internally stored data will not be sorted (this is related to the fact that the bottom layer of the container uses a hash table structure to store data.

#pragma once
#include "HashTable.h"

namespace L
{
// A type to be used as the key of unordered_map/unordered_set must support conversion to a modulo functor and == comparison
// For a type to be used as a set/map key, it must support < comparison

template <class K, class Hash = HashFun<K>>
class unordered_set
{
public:

struct KeyOfT_set
{
const K & amp; operator()(const K & amp; key)
{
return key;
}
};

typedef typename HashBucket::HashBucket<K, K, KeyOfT_set,Hash>::const_iterator iterator;
typedef typename HashBucket::HashBucket<K, K, KeyOfT_set, Hash>::const_iterator const_iterator;

iterator begin()
{
return _bucket.begin();
}

iterator end()
{
return _bucket.end();
}

const_iterator begin() const
{
return _bucket.begin();
}

const_iterator end() const
{
return _bucket.end();
}

pair<iterator, bool> Insert(const K & amp; key)
{
return _bucket.Insert(key);
}
private:
HashBucket::HashBucket<K, K, KeyOfT_set,Hash> _bucket;
};

4. Differences from map/set

The differences between the new unordere series of map and set in C++11 and the map and set in C++98 are:

(1) The underlying structures are different. The underlying structure of C++98 is a red-black tree structure, and the underlying structure of C++11 is a hash table/hash bucket structure.

(2) Query efficiency is different, C++11’s map and set query efficiency is higher.

(3) The functions are slightly different. C++98’s map and set support sorting by key, while unordered_map/set does not support sorting.

5. Underlying structure

The unordered series of associative containers are more efficient because they use a hash structure at the bottom.

Use a certain function (hashFunc) to establish the relationship between the storage location of the element and its key code
One-to-one mapping relationship, then the element can be found quickly through this function during search.

This method is the hash (hash) method. The conversion function used in the hash method is called the hash (hash) function, and the constructed structure is called

It is a Hash Table (or hash table).

5.1 Hash collision

Different keywords calculate the same hash address using the same hash number. This phenomenon is called hash conflict or hash collision. In order to avoid as many hash conflicts as possible, a hash function is introduced.

5.2 Hash function

Common Hash Functions

1. Direct addressing method–(commonly used)

Take a linear function of the key as the hash address: Hash (Key) = A*Key + B.

Advantages: Simple and uniform.

Disadvantages: Need to know the distribution of keywords in advance.

2. Division with remainder method–(commonly used)

Suppose the number of addresses allowed in the hash table is m, and take a prime number p that is no greater than m, but closest to or equal to m as the divisor,

According to the hash function: Hash(key) = key% p(p<=m), convert the key code into a hash address.

3. Method of finding the middle of squares–(understand)

Assume that the keyword is 1234, square it is 1522756, and extract the three digits 227 in the middle as the hash address;

For another example, the keyword is 4321. Square it is 18671041, and extract the three digits 671 (or 710) in the middle as the hash address.

The square-middle method is more suitable for situations where the distribution of keywords is not known and the number of digits is not very large.

4. Folding method–(understand)

The folding method is to divide the keyword into several parts with equal digits from left to right (the last part can have shorter digits), and then divide these parts into

Several parts are superimposed and summed, and according to the length of the hash table, the last few bits are taken as the hash address.

The folding method is suitable for situations where the distribution of keywords does not need to be known in advance and is suitable for situations where there are a large number of keyword digits.

5. Random number method–(understand)

Choose a random function and take the random function value of the keyword as its hash address, that is, H(key) = random(key), where

random is a random number function.

6. Mathematical Analysis–(Understand)

Suppose there are n d digits. Each digit may have r different symbols. The frequency of these r different symbols appearing in each digit is not necessarily certain.

The same, it may be relatively evenly distributed on some bits, and each symbol has an equal chance of appearing, but it may be unevenly distributed on some bits.

There are certain symbols that appear frequently. According to the size of the hash table, several bits in which various symbols are evenly distributed can be selected as the hash table.

Column address.

Numerical analysis is usually suitable for dealing with situations where the number of keywords is relatively large, if the distribution of keywords is known in advance and the

Several bits are evenly distributed.

Note: The more sophisticated the hash function is, the lower the possibility of hash collisions, but hash collisions cannot be avoided.

5.3 Hash conflict resolution

Two common methods of resolving hash collisions are: Closed hashing and Open hashing

5.3.1 Closed Hashing

Closed hashing: also called open addressing method. When a hash conflict occurs, if the hash table is not full, it means that there must be more in the hash table

Empty position, then the key can be stored in the “next” empty position in the conflicting position.

It is further divided into linear detection and quadratic linear detection.

Linear detection: Starting from the position where the conflict occurs, detect backwards until the next empty position is found.

Generally, when the load factor is greater than or equal to 0.7, the hash table begins to expand.

 size_t hashi = kv.first % _tables.size();
size_t i = 1;
size_t index = hashi;
while (_tables[index]._state == EXIST)
{
index = hashi + i;
index %= _tables.size();
+ + i;
}

_tables[index]._kv = kv;
_tables[index]._state = EXIST;
+ + _n;

return true;
}

The flaw of linear detection is that conflicting data accumulates together, which is related to finding the next empty position, because finding the empty position

The way to locate them is to look for them one by one, so the second detection is to avoid this problem.

The secondary detection is plus i squared

The biggest flaw of closed hashing is its relatively low space utilization, which is also a flaw of hashing.

5.3.2 Open hashing (chain address method)

The open hash method is also called the chain address method (open chain method). First, the hash function is used to calculate the hash address of the key code set, which has the same characteristics

The key codes of the address belong to the same sub-set, each sub-set is called a bucket, and the elements in each bucket are linked through a singly linked list

After that, the head nodes of each linked list are stored in the hash table.

5.3.2.1 Capacity expansion issue

The number of buckets is certain. As elements are continuously inserted, the number of elements in each bucket continues to increase. In extreme cases, it can

This may lead to a large number of linked list nodes in a bucket, which will affect the performance of the hash table. Therefore, under certain conditions, it is necessary to modify the hash

When the table is expanded, there is exactly one node in each hash bucket. When elements are inserted again, a hash conflict will occur every time. Therefore, when the number of elements is exactly equal to the number of buckets, the hash table can be added. Allow.

If the key is of string type, you can provide a functor to convert the string into size_t type.

size_t operator()(const string & amp; str)
{
size_t hash = 0;
for (auto ch : str)
{
hash + = ch;
hash *= 31;
}

return hash;
}

// A type that needs to be the key of unordered_map/unordered_set must support conversion to a modulo functor and == comparison.
// For a type to be used as a set/map key, it must support < comparison, because its bottom layer is a red-black tree, and a red-black tree is essentially a binary search tree.

//If a bucket in the hash bucket links too many nodes, this bucket can be changed to a red-black tree.

5.3.3 Rehashing

Construct multiple different hash functions at the same time, and when a hash conflict occurs, use the second, third… and other hash functions to calculate the address until no conflicts occur. Although aggregation is less likely to occur, it increases calculation time.

5.4 Get value through key

struct KeyOfT_map
{
const K & amp; operator()(const pair<K, V> & amp; kv)
{
return kv.first;
}
};