[Data structure] Hash underlying structure

Directory

1. Hash concept

2. Hash implementation

1. Closed hashing

1.1. Linear detection

1.2. Second detection

2. Open hash

2.1. The concept of open hashing

2.2, Open hash structure

2.3. Open hash lookup

2.4. Open hash insertion

2.5. Deletion of open hash

3. Performance Analysis

1. Hash concept

In the sequential structure and balanced tree, there is no corresponding relationship between the element key and its storage location, so when looking for an element, it must go through multiple comparisons of the key. The time complexity of sequential search is O(N), and the height of the tree in the balanced tree is O(logN). The efficiency of the search depends on the number of comparisons of elements during the search process.

Ideal search method: the element to be searched can be directly obtained from the table at one time without any comparison. If a storage structure is constructed, and a one-to-one mapping relationship can be established between the storage location of an element and its key code through a certain function (hashFunc), then the element can be found quickly through this function when searching.

When adding to this structure:

Insert element:
According to the key code of the element to be inserted, use this function to calculate the storage position of the element and store it according to this position.
Search element:
Carry out the same calculation on the key code of the element, take the obtained function value as the storage position of the element, compare the element according to this position in the structure, if the key code is equal, the search is successful.

This method is a hash (hash) method, the conversion function used in the hash method is called a hash (hash) function, and the constructed structure is called a hash table (or hash table).

For example: data set {1, 7, 6, 4, 5, 9};
The hash function is set as: hash(key) = key % capacity; capacity is the total size of the underlying space of the storage element.

Searching with this method does not need to compare multiple key codes, so the search speed is relatively fast. But it may cause hash collision.

Hash overall code structure:

enum State
{
EMPTY,
EXIST,
DELETE
};

template<class K, class V>
struct HashData
{
pair<K, V> _kv;
State_state = EMPTY;
};

template<class K>
struct HashFunc
{
//key itself can perform implicit type conversion
size_t operator()(const K & key)
{
return key;
}
};

template<>
struct HashFunc<string>
{
//BKDR hash
size_t operator()(const string & s)
{
size_t hash = 0;
for (auto ch : s)
{
hash += ch;
hash *= 31; //In this way, the conflict problem caused by the same ascll code value of different strings is reduced
//This value can take 31, 131, 1313, 131313, etc.
}
return hash;
}
};

template<class K, class V, class Hash = HashFunc<K>>
class HashTable
{
public:
bool Insert(const pair<K, V> & kv)
{}

private:
vector<HashData<K, V>> _tables;
size_t _n = 0; //The number of stored data
};

Among the template parameters, Hash is a functor used to convert the key value into an integer. If the key is a string type, specialization is used to convert it to an integer by means of BKDR.

2. Hash implementation

For the keywords k_i and k_j(i != j) of two data elements, there is k_i != k_j, but there is: Hash(k_i) ==Hash(k_j), that is, different keywords are calculated by the same hash number The same hash address is generated, this phenomenon is called hash collision or hash collision. Data elements with different keys but the same hash address are called “synonyms”.

The reason for the hash collision is that the design of the hash function is not reasonable enough. Hash function design principles:

The definition domain of the hash function must include all key codes that need to be stored, and if the hash table allows m addresses, its value range must be between 0 and m-1.
The addresses calculated by the hash function can be evenly distributed in the entire space.
The hash function should be relatively simple.

Common hash functions:

Direct addressing method — (commonly used)
Take a linear function of the key as the hash address: Hash(Key) = A*Key + B.
Advantages: simple and uniform.
Disadvantage: Need to know the distribution of keywords in advance.
Usage scenario: suitable for finding relatively small and continuous situations.
Remainder method–(commonly used)
Assuming that the number of addresses allowed in the hash table is m, take a prime number p that is not greater than m but closest to or equal to m as the divisor, and according to the hash function: Hash(key) = key% p(p<=m), the The key is converted into a hash address.

There are two main methods to resolve hash conflicts: closed hashing and open hashing.

1, closed hash

Closed hashing: also known as open addressing method, when a hash conflict occurs, if the hash table is not full, it means that there must be an empty space in the hash table, then the key can be stored in the “bottom” of the conflicting position a” to go in the empty slot. So how to find the next empty position?

1.1, linear detection

Linear detection: start from the position where the conflict occurs, and probe backwards in turn until the next empty position is found.

Insert:

Obtain the position of the element to be inserted in the hash table through the hash function.
If there is no element in this position, insert a new element directly. If there is an element in this position, a hash collision occurs, use linear detection to find the next empty position, and insert a new element.

Insert Code:

bool Insert(const pair<K, V> & kv)
{
if (Find(kv. first))
{
return false;
}

    Hash hash;
\t
    size_t hash = hash(kv.first) % _tables.size(); //Here is size, not capacity
                                          //Because [] cannot access values outside size
//Linear detection
size_t i = 1;
size_t index = hash;
while (_tables[index]._state == EXIST)
{
index = hash + i;
index %= _tables. size();
+ + i;
}

_tables[index]._kv = kv;
_tables[index]._state = EXIST;
_n++;
    return true;
}

Because the size may be 0 or the capacity is not enough, the expansion operation is required:

//size is 0, or if the load factor exceeds 0.7, expand the capacity
if (_tables. size() == 0 || _n * 10 / _tables. size() >= 7)
{
size_t newsize = _tables. size() == 0 ? 10 : _tables. size() * 2;
vector<HashData> newtables(newsize); //Create a new vector object
// traverse the old table and remap to the new table
for (auto & data : _tables)
{
if (data._state == EXIST)
{
//Recalculate the position in the new table
            size_t hash = hash(data.kv.first) % newtables.size();
size_t i = 1;
size_t index = hash;
while (newtables[index]._state == EXIST)
{
index = hash + i;
index %= newtables. size();
+ + i;
}

newtables[index]._kv = data.kv;
newtables[index]._state = EXIST;
}
}
_tables. swap(newtables);
}

It should be noted that when expanding, a vector object needs to be re-opened, and all data must be re-inserted. It is not possible to expand the capacity of the original vector object, because in this way, after the capacity expansion, the mapping position relationship will change, and the original non-conflicting values may conflict, and the original conflicting values may not conflict.

Because the above writing method has code redundancy, the following writing method can be used to simplify the code:

bool Insert(const pair<K, V> & kv)
{
if (Find(kv. first))
{
return false;
}

    Hash hash;

//Expansion if the load factor exceeds 0.7
if (_tables. size() == 0 || _n * 10 / _tables. size() >= 7)
{
size_t newsize = _tables. size() == 0 ? 10 : _tables. size() * 2;
HashTable<K, V> newht; //Create a new hash table object
newht._tables.resize(newsize); //Expansion
// traverse the old table and remap to the new table
for (auto & data : _tables)
{
if (data._state == EXIST)
{
newht.Insert(data._kv);
}
}
_tables. swap(newht. _tables);
}

size_t hash = hash(kv.first) % _tables.size(); //Here is size, not capacity

//Linear detection
size_t i = 1;
size_t index = hash;
while (_tables[index]._state == EXIST)
{
index = hash + i;
index %= _tables. size();
+ + i;
}

_tables[index]._kv = kv;
_tables[index]._state = EXIST;
_n++;
    return true;
}

delete:

When using closed hashing to handle hash conflicts, existing elements in the hash table cannot be physically deleted casually. If elements are directly deleted, the search for other elements will be affected. For example, if you delete element 4, if you delete it directly, the lookup of 44 may be affected. So linear probing uses marked pseudo-deletion to delete an element.

Delete code:

HashData<K, V>* Find(const K & amp; key)
{
if (_tables. size() == 0)
{
return nullptr;
}

    Hash hash;

size_t hash = hash(key) % _tables. size();

size_t i = 1;
size_t index = hash;
while (_tables[index]._state != EMPTY)
{
if (_tables[index]._state == EXIST & amp; & amp; _tables[index]._kv.first == key)
{
return &_tables[index];
}

index = hash + i;
index %= _tables. size();
+ + i;

//If you find a circle, then it means that all exist or delete
if (index == hash)
{
break;
}
}
return nullptr;
}

bool Erase(const K & amp; key)
{
HashData<K, V>* ret = Find(key);
if (ret)
{
ret->_state = DELETE;
--_n;
return true;
}
else
{
return false;
}
}

It should be noted that in order to prevent the infinite loop problem caused by only existence and deletion in the hash table, a judgment needs to be added in the function to limit the number of searches.

Advantages of linear detection: very simple to implement,
Disadvantages of linear detection: Once a hash conflict occurs, all the conflicts are connected together, which is prone to data “pile-up”, that is, different keys occupy available empty positions, so that many comparisons are required to find the position of a certain key, resulting in Search efficiency is reduced.

1.2, secondary detection

The defect of linear detection is that conflicting data are piled up together, which has something to do with finding the next empty position, because the way to find empty positions is to find one after another, so in order to avoid this problem, the second detection should find the next The method of empty position is: H_i = (H_0 + i^2) % m, or: H_i = (H_0 – i^2) % m. Among them: i = 1,2,3…, H_0 is the position obtained by calculating the key code key of the element through the hash function Hash(x), and m is the size of the table.

Research shows that: when the length of the table is a prime number and the table load factor a does not exceed 0.5, new entries must be inserted, and any position will not be probed twice. Therefore, as long as there are half of the empty positions in the table, there will be no problem of the table being full. You can ignore the fullness of the table when searching, but you must ensure that the load factor a of the table does not exceed 0.5 when inserting. If it exceeds, you must consider increasing the capacity. Therefore: the biggest disadvantage of hashing is that the space utilization rate is relatively low, which is also the defect of hashing.

2, open hash

2.1, the concept of open hashing

The open hash method is also called the chain address method (open chain method). First, the hash function is used to calculate the hash address for the key code set. The key codes with the same address belong to the same sub-set. Each sub-set is called a bucket. The elements in the bucket are linked through a singly linked list, and the head nodes of each linked list are stored in the hash table.

As can be seen from the figure above, each bucket in the open hash contains elements that have hash collisions.

2.2, Open Hash Structure

template<class K, class V>
struct HashNode
{
HashNode<K, V>* _next;
pair<K, V> _kv;

HashNode(const pair<K, V> & kv)
:_next(nullptr)
,_kv(kv)
{}
};

template<class K>
struct HashFunc
{
//key itself can perform implicit type conversion
size_t operator()(const K & key)
{
return key;
}
};

template<>
struct HashFunc<string>
{
//BKDR hash
size_t operator()(const string & s)
{
size_t hash = 0;
for (auto ch : s)
{
hash += ch;
hash *= 31; //In this way, the conflict problem caused by the same ascll code value of different strings is reduced
//This value can take 31, 131, 1313, 131313, etc.
}
return hash;
}
};

template<class K, class V, class Hash = HashFunc<K>>
class HashTable
{
typedef HashNode<K, V> Node;
public:
    ~HashTable()
{
for (auto & cur : _tables)
{
while (cur)
{
Node* next = cur->_next;
delete cur;
cur = next;
}
cur = nullptr;
}
}
bool Insert(const pair<K, V> & kv)
{}

private:
vector<Node*>_tables;
size_t_n = 0;
};

2.3, search with hash

Node* Find(const K & amp; key)
{
if (_tables. size() == 0)
return nullptr;

    Hash hash;

size_t hash = hash(key) % _tables. size();
Node* cur = _tables[hashi];
while (cur)
{
if (cur->_kv. first == key)
return cur;
cur = cur->_next;
}
return nullptr;
}

2.4, insert with open hash

bool Insert(const pair<K, V> & kv)
{
if (Find(kv. first))
return false;

    Hash hash;

    size_t hash = hash(kv.first) % _tables.size();
//head plug
Node* newnode = new Node(kv);
newnode->_next = _tables[hashi];
_tables[hashi] = newnode;
+ + _n;

return true;
}

Because it is possible that the size of the hash table is 0 or needs to be expanded. So the hash table needs to be expanded. The larger the load factor, the higher the probability of conflict, the lower the search efficiency, and the higher the space utilization rate.

Because the nodes in the original table are all custom types, they will not be automatically destructed. We only need to recalculate the position of the nodes in the original table and move them to the new table.

bool Insert(const pair<K, V> & kv)
{
if (Find(kv. first))
return false;

    Hash hash;

//When the load factor is 1, expand the capacity
if (_n == _tables. size())
{
size_t newsize = _tables. size() == 0 ? 10 : _tables. size() * 2;
vector<Node*> newtables(newsize, nullptr);
for (Node* & cur : _tables)
{
while (cur)
{
Node* next = cur->_next;
size_t hash = hash(cur->_kv.first) % newtables.size();
//head inserted into the new table
cur->_next = newtables[hashi];
newtables[hashi] = cur;

cur = next;
}
}
_tables. swap(newtables);
}

size_t hash = hash(kv.first) % _tables.size();
//head plug
Node* newnode = new Node(kv);
newnode->_next = _tables[hashi];
_tables[hashi] = newnode;
+ + _n;
return true;
}

2.5, Deletion of Open Hash

bool Erase(const K & amp; key)
{
    Hash hash;

size_t hash = hash(key) % _tables. size();
Node* prev = nullptr;
Node* cur = _tables[hashi];
while (cur)
{
if (cur->_kv. first == key)
{
if (prev == nullptr)
{
_tables[hashi] = cur->_next;
}
else
{
prev->_next = cur->_next;
}
delete cur;
return true;
}
else
{
prev = cur;
cur = cur->_next;
}
}
return false;
}

Applying the chain address method to deal with overflow requires adding a link pointer, which seems to increase storage overhead. In fact: Since the open address method must maintain a large amount of free space to ensure search efficiency, such as the secondary detection method requires a loading factor a <= 0.7, and the space occupied by the table entry is much larger than the pointer, so using the chain address method instead Compared with the open address method, it saves storage space.

3, performance analysis

For the hash of the open hash, the time complexity of adding, deleting, checking and modifying is O(1) , although in the worst case (all values are hung on the same subscript, That is, in the same bucket), the time complexity is O(N), but because of the expansion operation, this worst case is almost impossible to occur.

If there is an extreme situation, all the data will be in one bucket. Then, when a single bucket exceeds a certain length, change the bucket to a red-black tree: set the hash data type to a structure, which includes the linked list pointer, the length of the bucket, and the pointer to the tree. If the bucket If the length exceeds the specified value, the pointer of the tree is used, otherwise, the pointer of the linked list is used.

That’s all for the content of the hash underlying structure. I hope you will support me a lot. If there is something wrong, please correct me, thank you!

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge algorithm skill treehome page overview 47267 people are learning