[Translation] Analyzing Hash Table Implementations in Mainstream Languages

  • Original address: An Analysis of Hash Map Implementations in Popular Languages
  • Author: Russell Cohen
  • The translation comes from: Nuggets Translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: wangxuanni
  • Proofreader: Quincy_Ye, CompetitiveLin

Analyze the implementation of hash tables in mainstream languages

In real world development, few data structures are more common than hash tables. Almost every major piece of programming is implemented in the standard library or built into the runtime. However, there is not yet an optimal and consistent implementation strategy, so mainstream programming languages have very different implementations! I survey hash table implementations in Go, Python, Ruby, Java, C#, C++, and Scala, comparing and contrasting their implementations.

Note: The remainder of this article assumes knowledge of how hash tables work and the most common scenarios for implementing them. In case you need a refresher, Wikipedia provides a fairly readable explanation. In addition to the basics, sufficient background knowledge is provided on linked lists and open addressing.

For each hash table I would compare:

  • Solution: How does the hash table handle collisions? What is the optimization of the selection strategy?
  • Growth rate: When the hash table expands, how much will it grow?
  • Load factor: How full is the hash table to trigger expansion? (proportion of num_keys/slots )

Some words to remember:

  • Perturbation function: The size of many of the hash tables below will always be a power of 2. So n % (i**2) is equivalent to not using the high-order bits. To solve this problem, a hash table combines some part of the hash code with itself through XOR or simple addition of numbers. to ensure that the influence from the upper bits is not lost.
  • Tombstone: In open addressing, when an element is deleted, it must be marked as deleted rather than actually physically deleted. This way, the search for elements further downstream in the chain continues when the tombstone is hit.

Although all implementations are quite different, there are certain things in common:

  • Open addressing (Python, Ruby, C++) and linked lists (Java, Scala, Go) behave roughly the same.
  • There are no such “fancy” implementations of cuckoo hashing in the languages surveyed, although most include varying degrees of optimizations that make the code look complex.
  • Most languages try to add entropy to hashcodes by mixing low and high bits at some point in the process. This is necessary because the primitive types of these languages contain low-entropy hash functions.
  • grow at least 2x. Most guaranteed sizes are always powers of 2.

About the details:

Python (CPython)

source code comments

Scheme: Custom open addressing sequence. This sequence is

next = ((5*prev) + 1 + perturb) % TABLE_SIZE

perturb is originally a hash code. For each successive element in the chain, perturb = perturb >> 5. If perturb starts at 2^32, it will affect the first 7 elements in the linked list. It’s funny because it’s not everyone learned linear or quadratic probing in school.

Growth rate: At least twice, and always a square of 2 in size. In the absence of deletion, the size doubles. Due to tombstones (see explanation of tombstones above), it is possible for us to have very long linked lists whose actual size does not satisfy the load factor. The growth rate is NUM_ITEMS*2 + capacity/2.1 by taking into account the size of the hash table to ensure that the hash table always grows when scaling is triggered.

Load factor: 0.66

Other considerations:

  • Although Ruby uses different perturbation strategies, they all use the same underlying probing scheme next = (prev * 5) + 1 mod TABLE_SIZE)

  • Implements a special-case hash table whose keys are unicode strings2. The motivation for this comes from the fact that a lot of Python internally relies on dictionaries with unicode keys (e.g. looking up local variables)3

    • There are only string keys, the keys are stored directly into the array, and the pointers to the values are stored in a separate array. This enables some optimizations and means that no pointer dereferences are required when keys are read.
    • For non-string keys, key-value pairs are stored together in a struct, and have those structs in an array.
  • The perturbation strategy was focused and heavily tuned for poorly performing hash functions for integers hash(i) == i in Python.

  • Adjusted a lot of magic number 4 based on experience

  • In 3.3.0.1, the growth rate changed from used*4 to used*2

Ruby

source code comments

Scheme open addressing: Open addressing uses j = ((5*j) + 1 + perturb) mod TABLE_SIZE. This is largely the same as the Python structure, but the perturbation strategies they use are somewhat different.

Growth rate: 2x. The number of slots is always a power of 2.

Load factor: 0.5

Other considerations:

  • The old implementation used a linked list. The new implementation is said to be 40% faster5
  • Entries array (for fast iteration) is split from bins array for hash lookup
  • Small arrays do not use bins, but linear scans.
  • Ruby tries quadratic programming (you can turn it on at compile time #define QUADRATIC_PROBE), but it’s slower in practice. 6

Java

Scheme: Linked list, when the length of the linked list is greater than 8, the linked list is converted into a TreeMap. This conversion is most helpful in the following situations:

  • K implements the Comparable<> interface
  • Collision when hash code is modulo table size, but hash code is not equal to table size

Growth rate: 2x. The number of slots is always a power of 2.

Load factor: 0.75

Other considerations:

  • Since the size of a Java hash table is always a power of 2, when using hash_code % tablesize the upper bits are not used until the hash table is 2^32 . To fix this, Java XORs the hash code with itself, right shifted by 16 bits. This ensures that the upper bits have some effect. int h = key.hashCode(); h = h ^ h >>> 16;
  • When resizing, the element goes into one of two positions, k or k + oldSize. This is the convenience of doubling the capacity each time.
  • The code is really hard to understand, mainly because of the conversion between trees and linked lists.

Scala

Immutable hash table

Most hash tables in source code Scala are immutable, so I’ll discuss that first.

Scheme: Linked list hash tree. A hash tree is a recursive data structure (so the tree is oriented downwards). The Scala documentation provides a decent explanation. For a more in-depth look, Phil Bagwell’s paper is an excellent resource. I will provide a brief summary:

For hash tables of size 0 to 4, it uses a hard-coded hash table. For larger hash tables it uses a hash tree. Each level of the hash tree considers some subset of bits of the hash code. When inserting or retrieving, using the next subset of bits as an argument, recursively go to the branches of the tree to match the bits. Scala’s hash tree implementation has a branching factor of 32, so each level considers the hash code to be 5 bytes (2^5 = 32). Since hash codes in Java/Scala are 32-bit integers, it means that if all hash codes are unique, a hash tree will store 2^32 elements without collisions.

If the hash codes are the same, use a linked list, included in the HashMapCollision data structure.

Scala also provides a mutable hashtable. Since it lacks the optimizations of the languages I’ve seen, it’s the only one that seems straightforward.

Mutable hash table

source code

Solution: Use a linked list link

Growth rate: 2x

Load factor: 0.75

Other considerations:

  • This is what I originally expected the hash table to look like, it’s less than 500 lines, the core code is less than 100 lines, it’s straightforward, uncomplicated and easy to read.
  • Like many other implementations, it tries to increase the entropy of the incoming hashcode with some mixing:
var h: Int = hcode + ~(hcode << 9)
h = h^(h >>> 14)
h = h + (h << 4)
h ^ (h >>> 10)

Golang

source code

Solution: A linked list with some optimizations. A linked list is made up of buckets. Each barrel has 8 slots. Once all 8 slots have been used, the overflow bucket will be linked to the first bucket. Storing 8 key-value pairs in contiguous memory reduces the amount of memory accesses and memory allocations when reading and writing to the hash table.

Growth rate: 2x. When a large number of deletes occurs, a hash table of the same size is allocated to garbage collect unused buckets.

Load Factor: 6.5! Not 6.5%, but 6.5 times. This means that on average, the hash table will resize when each bucket has 6.5 elements. This is in stark contrast to all other hash table implementations that use load factors less than 1.

Precautions:

  • In all other implementations, copying elements from the old array to the new array works on expansion triggered by a single insertion. In Golang, the resizing operation of moving elements into a new array is done incrementally as more and more keys are added! For each new/updated element, 2 keys are moved from the old array to the new array, ensuring that none of the writes would cause O(n) performance. Once all keys are removed from the old array, the array will be reallocated.
  • Two conditions can trigger expansion.
  1. The number of elements >= 6.5 times the size of the array, the new array is the same size as the old array.
  2. The number of buckets is too large.

Under condition #2, the newly allocated array is the same size as the old array. This seemingly pointless behavior comes from this commit. In the case of deletion, allocating and slowly moving to the new array means we’ll garbage collect the old buckets instead of slowly leaking them. They chose this approach to ensure that iterators continue to work properly.

C#

source code

Scheme: linked list

Growth rate: >2x. The new size is the smallest prime number greater than 2 times the old size.

Load Factor: 1

Precautions:

  • Although it uses a linked list, it does it in a neat way, the hash table stores 2 arrays.
  1. An array of integers, when looking for a certain K in the hash table, use the hash code to take the length of the array modulo, and check its subscript in array #1.
  2. An array of Entry, each Entry stores a key, a value and the index of another Entry in the same array.
private struct Entry
{
  public TKey key; ? ? ? // Key of entry
  public TValue value; ? // Value of entry
  public int next; ? ? ? // Index of next entry, -1 if last
 ? ? ? ? ? ? ? ? ? ? // (or only) item in chain
}

The neat thing here is that when we need to link these items together, we don’t need to allocate linked lists, they are already pre-allocated. Also, they are already in a contiguous piece of memory, which improves cache locality.

C++ (GCC STL)

I had this piece completely wrong and was looking at a wrong source code, thanks to reddit user u/raevnos for making me clear

source code which #includes the actual source code

Scheme: linked list

Growth rate: >2x. The new size is the smallest prime number greater than 2 times the old size.

Load Factor: 1

Precautions:

  • The C++ standard library doesn’t implement it, but the standard library seems to require a linked list. As stated in this stack overflow answer. The proposal to add a hash table to the spec is ruled out by directly implementing the open addressing method in the C++ standard library hash table.
  • Growth behavior is similar to C#
  • The table size is always a prime number. 7 This surprised me because I thought C++ would try to align powers of 2 to help malloc.
  • The C++ standard library is hard to read

Summary

I find it interesting that there are so many different implementations of hashtables used in production languages. Ruby’s transition from linked lists to open addressing is particularly interesting, as it apparently improved quite a bit on benchmarks. It would be interesting to write an open addressable hash table for Java or Go and compare the performance.

[1] ?github.com/python/cpyt…

[2] ?github.com/python/cpyt…

[3] ? Perhaps you will be surprised, starting the Python interpreter and running several commands not related to the dictionary results in about 100 dictionary lookups

[4] github.com/python/cpyt…

[5] ?github.com/ruby/ruby/b…

[6] ?github.com/ruby/ruby/b…

[7] ?github.com/gcc-mirror/…

If you find errors in the translation or other areas that need improvement, you are welcome to go to the Nuggets Translation Project to modify and PR the translation, and you can also get corresponding reward points. The permanent link to this article at the beginning of the article is the MarkDown link for this article on GitHub.

The Nuggets Translation Project is a community for translating high-quality Internet technical articles. The source of the articles is the English shared articles on Nuggets. The content covers Android, iOS, front-end, back-end, blockchain, product, design, artificial intelligence and other fields. If you want to view more high-quality translations, please continue to pay attention to the Nuggets Translation Project, official Weibo, and Zhihu column.