Code implementation of hash structure (open hash, closed hash)

Hash structure

The reason why the associative containers of the unordered series are relatively efficient is that the underlying layer uses a hash structure.

Hash mapping, the key value and the storage location establish a mapping relationship.

Hash Table or Hash Table (a storage structure), through Hash Function or Hash Function( hashFunc) can establish a one-to-one mapping relationship between the storage location of the element and its key code, then the element can be found quickly through this function when searching.

Common hash functions

Hash function design principle: the definition domain of the hash function must include all key codes that need to be stored, and if the hash table allows m addresses, its value range must be between 0 and m-1; the hash function calculates The addresses can be evenly distributed throughout the space; the hash function should be relatively simple.

Direct addressing method – (commonly used)
Take a linear function of the key as the hash address:

h

a

the s

h

(

K

e

the y

)

=

A

?

K

e

the y

+

B

Hash (Key) = A*Key + B

Hash (Key) = A?Key + B

Advantages: simple, uniform, no hash collision;
Disadvantages: You need to know the distribution of keywords in advance, and a bunch of integer values are hash-mapped to the table. If these values are too scattered (such as 2, 3, 12, 9999), the space consumed is too large;
Usage scenario: suitable for finding integers, small and continuous values
Remainder method – (commonly used)
Let the number of addresses allowed in the hash table be m, and take a prime number p that is not greater than m but closest to or equal to m as the divisor. According to the hash function: Hash(key) = key % p (p<=m), the The key is converted into a hash address. There are hash collisions. Generally, p takes the size() of the hash table.

Note: The more sophisticated the hash function is, the less likely hash collisions will occur, but hash collisions cannot be avoided.

Hash Collision

Concept

Different keywords calculate the same hash address through the same hash number, this phenomenon is called hash collision or hash collision. In short it means that different values map to the same location. Data elements with different keys but the same hash address are called “synonyms“.

For the following set of data 3, 5, 7, 12, 999, 13, when using the remainder method, p=10, Hash(key) = Key % p, (p<=m), when encountering 3 and At 13 o'clock, 3 = 3, 13 = 3, so does the hash map store 3 or 13?

Solution 1: Closed hashing

It is also called open addressing method. When a hash conflict occurs, if the hash table is not full, it means that there must be an empty position in the hash table, then the key can be stored in the “next” empty position in the conflict position to go. Closed hashing does not allow the storage space to be full, and relies on the load factor to determine whether it is full.

The storage node structure in the hash table is defined as

enum State
{
EMPTY,//The location is empty, you can store a new key
EXIST,//The location has already recorded the key
DELETE,//The location data is deleted, and a new key can be stored
};

template<class K, class V>
struct HashData
{
std::pair<K, V> _kv;
State _state;//The state to be used to mark the location, used to stop the logical judgment of the search, when the state is
};

Find the mapping location according to the key

key is an integer, etc.

When the Key is an integer value, find the corresponding mapping position method: Hash(Key) = Key % _table.size(). Here, size cannot be used for capapcity, because when subscripts are used to access the data at the corresponding location, operator[] will forcibly check whether the accessed location exceeds the valid data range size. In practical applications, size==capacity is generally used. Avoid wasting space.

The key is a character type – BKDR idea

When the Key is of other types, secondary mapping is required, which is done with the help of functor conversion. BKDR ideas

template<>//Specialization
struct HashFunc<std::string>
{
size_t operator()(const std::string & key)
{
size_t hash = 0;
for (auto ch : key)
{
hash *= 131;//BKDR idea Each time the result is 31, 131, 13131...
hash += ch;
}
return hash;
}
};

When using closed hashing to deal with hash conflicts, the existing elements in the hash table cannot be physically deleted casually. If an element is directly deleted, it will affect the search of other elements. Therefore, linear detection uses a marked pseudo-deletion method to delete an element, that is, the state set to DELETE.

Load Factor – Control Scaling

The load factor of the hash table is defined as: a = the number of elements filled in the table / the length size of the hash table. a is a sign factor of the fullness of the hash table.

Since the table length is a fixed value, a is proportional to the “number of elements filled in the table”, so the larger a is, the more elements are filled in the table, the greater the possibility of conflict, and the space utilization rate On the contrary, the smaller a is, the fewer elements are marked to be filled in the table, the less possibility of conflict and the lower the space utilization rate. A typical practice of exchanging space for time.

In fact, the average lookup length of the hash table is a function of the load factor a, but different methods of dealing with conflicts have different functions.
For the open addressing method, the load factor is a particularly important factor and should be strictly limited below 0.7-0.8. If it exceeds 0.8, the CPU cache miss (cachemissing) when looking up the table rises according to an exponential curve.

If the calibrated load factor is exceeded, expand the hash table first, and then re-establish the mapping relationship.

Closed hashing will occupy the position of other key-value pairs, and the space utilization rate is relatively low, which is not reasonable. So how to find the next empty position?

Method 1 for finding empty positions: linear detection

Starting from the location where the conflict occurs, probe backwards in turn until the next empty location is found.

The advantage of linear detection: the implementation is very simple; the disadvantage: once a hash collision occurs, all the collisions are connected together, and it is easy to generate data “accumulation”, that is, different keys occupy the available empty positions, making it possible to find the position of a certain key Many comparisons are required, resulting in reduced search efficiency.

while (_tables[hashi]._state == EXIST)
{
    // linear detection
     + +hashi;
    hashi %= _tables.size();//Avoid out of bounds
}

Method 2 to find empty positions: secondary detection

The method to find the next empty position is:

H_i

Hi? =(

H_0

H0? +

i^2

i2 ) % m, or:

H_i

Hi? =(

H_0

H0? –

i^2

i2 ) % m. Where: i =1,2,3…,

H_0

H0? is the position obtained by calculating the key code of the element through the hash function Hash(x), and m is the size of the table (_tables.size()).

size_t hash = hf(kv.first) % _tables.size();//Hash() function

Solution 2: Open hashing (better)

Also known as the chain address method (open chain method), first use the hash function to calculate the hash address for the key code set. Key codes with the same address belong to the same sub-set, and each sub-set is called a bucket, the elements in each bucket are linked through a headless single linked list, and the head node of each linked list is stored in the hash table.

When using an unordered container to store key-value pairs, it will first apply for a whole block of continuous storage space, but this space is not used to directly store key-value pairs, but to store the head pointers of each linked list, and the real storage of each key-value pair The positions are the nodes of the respective linked lists. The STL standard library usually chooses the vector container to store the head pointer of each linked list.

Insert()

When a new key-value pair is stored in an unordered container, the entire storage process is divided into

If the load factor exceeds 1.0, expand the capacity;
Bring the Key in the key-value pair into the designed hash function, and you will get a hash value (an integer, denoted by H);
Divide H and the number n of buckets owned by the unordered container (that is, H % n), and the result is the number of the bucket where this key-value pair should be stored;
new stores this key-value pair in a new node, and at the same time inserts the node head into the bucket with the corresponding number.

//Insert of normal version
bool Insert(const std::pair<K, V> & kv)
{
    if (Find(kv. first))
        return false;

    if (_tables.size() == _n)//By default, the maximum load factor for unordered containers is 1.0
    {
        //Expansion
        HashTable<K, V, Hash> newHT;
        newHT._tables.resize(_tables.size() * 2);
        for (auto cur : _tables)
        {
            while (cur)
            {
                newHT.Insert(cur->_kv);//Each node is newly re-inserted, which is not suitable when the amount of data is large
                cur = cur->_next;
            }
        }
        _tables. swap(newHT._tables);
    }

    Hash hf;
    / / Find the subscript according to the mapping relationship
    size_t hash = hf(kv.first) % _tables.size();

    Node* newnode = new Node(kv);
    newnode->_next = _tables[hashi];
    _tables[hashi] = newnode;
     + + _n;
    return true;
}

The implementation of ordinary version insertion is to create a new HashTable, re-expand the _tables in this table, and then insert each node, so that data can be rearranged and hash conflicts can be reduced. The downside is that 1. Each node is re-new and then inserted, which is not suitable when the amount of data is large; 2. HashTable has its own destructor. When the function is released, the destructor must be called to release all nodes. . ==>In general, it is the performance consumption caused by the failure of nodes to be reused.

The new idea is to directly create a new _tables and recycle the old table nodes. Remap each old table node (this step cannot be omitted), and then insert the head into the corresponding position of the new table, empty the pointer stored in the old table, and find no other pointers, which is equivalent to letting the destructor of the old table start No substantive effect is enough. Finally, the expansion is completed by exchanging the old and new tables.

bool Insert(const std::pair<K, V> & kv)
{
    if (Find(kv. first))
        return false;

    if (_tables.size() == _n)//By default, the maximum load factor for unordered containers is 1.0
    {
        //Expansion
        std::vector<Node*> newtables;
        newtables.resize(_tables.size() * 2, nullptr);
        for (size_t = 0; i < _tables. size(); + + i)
        {
            Node* cur = _tables[i];
            while (cur)
            {
                Node* next = cur->_next;
                size_t hash = Hash()(cur->_kv->first) % newtables.size();
                //head inserted into the new table
                cur->_next = newtables[hashi];
                newtables[hashi] = cur;
                //Loop through the old table
                cur = next;
            }
            //Reuse the old table nodes, so that the destructor of the old table cannot destroy the nodes
            _tables[i] = nullptr;
        }
        _tables.swap(newtables);//Exchange the old and new tables
    }

    Hash hf;
    / / Find the subscript according to the mapping relationship
    size_t hash = hf(kv.first) % _tables.size();

    Node* newnode = new Node(kv);
    newnode->_next = _tables[hashi];
    _tables[hashi] = newnode;
     + + _n;
    return true;
}

Load factor

The hash table storage structure has an important property called load factor. This attribute is also applicable to unordered containers. It is used to measure the empty/full program of container storage key-value pairs. That is, the larger the load factor, the fuller the container, that is, the more key-value pairs are mounted in each linked list. It will undoubtedly reduce the efficiency of the container to find the target key-value pair; conversely, the smaller the load factor, the empty the container must be, but it does not necessarily mean that there are fewer key-value pairs mounted in each linked list.

The calculation method of the load factor is: Load factor = total key-value pairs stored in the container/number of buckets.

if (_tables.size() == _n)//By default, the maximum load factor for unordered containers is 1.0

**By default, out-of-order containers have a maximum load factor of 1.0. **If the maximum complexity factor exceeds the default value during the operation of the unordered container, the container will automatically increase the number of buckets and perform hashing again to reduce the value of the load factor. It should be noted that this process will invalidate the container iterator, but the reference or pointer to a single key-value pair is still valid.

Destructor

Use the hash structure of the open hash to write the destructor yourself.

~HashTable()
{
    for (size_t i = 0; i < _tables. size(); i ++ )
    {
        Node* cur = _tables[i];
        while (cur)
        {
            Node* next = cur->_next;
            delete cur;
            cur = next;
        }
        _tables[i] = nullptr;
    }
}

The size() of the hash table should preferably be a prime number

The researchers believe that hash collisions are less likely when modulo a prime number.

size_t hash = hf(kv.first) % _tables.size();

// In addition to leaving the remainder method, it is best to modulo a prime number, that is, the size of the hash table is required to be a prime number
inline unsigned long __stl_next_prime(unsigned long n)
{
    static const int __stl_num_primes = 28;
    static const unsigned long __stl_prime_list[__stl_num_primes] =
    {
        53, 97, 193, 389, 769,
        1543, 3079, 6151, 12289, 24593,
        49157, 98317, 196613, 393241, 786433,
        1572869, 3145739, 6291469, 12582917, 25165843,
        50331653, 100663319, 201326611, 402653189, 805306457,
        1610612741, 3221225473, 4294967291
    };

    for (int i = 0; i < __stl_num_primes; + + i)
    {
        if (__stl_prime_list[i] > n)
        {
            return __stl_prime_list[i];
        }
    }
    return __stl_prime_list[__stl_num_primes - 1];
}

In some scenarios, such as large data, considering space and efficiency, etc., the underlying structure of the bucket can be changed from a single linked list to a red-black tree.

Comparison of open hash and closed hash

Applying the chain address method to deal with overflow requires adding a link pointer, which seems to increase storage overhead. In fact: Since the open address method must maintain a large amount of free space to ensure search efficiency, such as the secondary detection method requires a load factor a <= 0.7, and the space occupied by the table entry is much larger than the pointer, so using the chain address method instead Compared with the open address method, it saves storage space.

Note that map and set, unordermap and unorderset have different requirements for keys. Sequence containers require that keys can be compared in size; unordered containers require that keys can be converted into integers.