Why is the load factor of HashMap not set to 1

In the Java foundation, the collection class is a key piece of knowledge, and it is often used in daily development. For example, List and Map are also very common in code.

In my opinion, regarding the implementation of HashMap, JDK engineers have actually done a lot of optimization. If you want to say which class has the most eggs buried in all JDK source codes, then I think HashMap can at least rank in the top five.

It is precisely because of this that many details are easily overlooked. Today we will focus on one of the issues, that is:

Why is the load factor of HashMap set to 0.75 instead of 1 or 0.5? What are the considerations behind this?

Don’t underestimate this question, because the load factor is a very important concept in HashMap, and it is also a common test point for high-end interviews.

In addition, some people will use the wrong setting of this value, such as:

Since someone will try to modify the load factor, is it appropriate to change it to 1? Why doesn’t HashMap use 1 as the default value of load factor?

1. What is loadFactor

First of all, let’s introduce what is the load factor (loadFactor). If the reader already understands this part, you can skip this paragraph directly.

We know that when the HashMap is created for the first time, its capacity will be specified (if not specified, the default is 16), then as we continue to put elements into the HashMap, it may exceed its capacity, then There needs to be an expansion mechanism.

The so-called expansion is to expand the capacity of HashMap:

From the code, we can see that in the process of adding elements to the HashMap, if the number of elements (size) exceeds the threshold (threshold), automatic expansion (resize) will be performed, and, after the expansion, you need Rehash the original elements in the HashMap, that is, redistribute the elements in the original bucket to the new bucket.

In HashMap, threshold (threshold) = load factor (loadFactor) * capacity (capacity).

loadFactor is the loading factor, indicating how full the HashMap is. The default value is 0.75f. That is to say, by default, when the number of elements in the HashMap reaches 3/4 of the capacity, it will automatically expand.

void addEntry(int hash, K key, V value, int bucketIndex) {
if ((size >= threshold) & amp; & amp; (null != table[bucketIndex])) {
resize(2 * table. length);
hash = (null != key) ? hash(key) : 0;
bucketIndex = indexFor(hash, table. length);
}
createEntry(hash, key, value, bucketIndex);
}

Second, why expand

Remember what we said before, HashMap not only needs to expand its capacity in the process of expansion, but also needs to be rehash! Therefore, this process is actually very time-consuming, and the more elements in the Map, the more time-consuming.

The process of rehash is equivalent to re-hash all the elements in it, and the recalculation should be allocated to that bucket.

So, has anyone thought about a question, since it is so troublesome, why expand the capacity? Isn’t HashMap an array linked list? If it is not expanded, it can also be stored infinitely. Why expand?

This is actually related to hash collisions.

Hash Collision

We know that HashMap is actually implemented based on hash functions at the bottom layer, but hash functions have the following basic characteristics: if the hash values calculated according to the same hash function are different, then the input values must also be different. However, if the hash values calculated by the same hash function are the same, the input values are not necessarily the same.

The phenomenon that two different input values have the same hash value calculated according to the same hash function is called a collision.

An important indicator to measure the quality of a hash function is the probability of collision and the solution to the collision.

In order to solve the hash collision, there are many ways, the more common one is the chain address method, which is also the method adopted by HashMap.

HashMap combines arrays and linked lists to take advantage of both, and we can understand it as an array of linked lists.

HashMap is implemented based on the data structure of an array of linked lists.

When we put an element into the HashMap, we need to first locate which linked list in the array, and then hang this element behind the linked list.

When we get elements from HashMap, we also need to locate which linked list in the array, and then traverse the elements in the linked list one by one until we find the required element.

However, if the collisions in a HashMap are too high, then the linked list of the array will degenerate into a linked list. At this time, the query speed will be greatly reduced.

Therefore, in order to ensure the reading speed of HashMap, we need to find a way to ensure that the conflict of HashMap is not too high.

Expansion to avoid hash collision

So how can we effectively avoid hash collisions?

Let’s think in reverse first. What do you think will cause more hash collisions in HashMap?

Nothing more than two situations:

1. The capacity is too small. The smaller the capacity, the higher the probability of collision. If there are more wolves and less meat, there will be competition.

2. The hash algorithm is not good enough. If the algorithm is unreasonable, they may all be allocated to the same or several buckets. If the distribution is uneven, competition will also occur.

Therefore, solving hash collisions in HashMap also starts from these two aspects.

These two points are well reflected in HashMap. Combining the two methods, expanding the capacity of the array at an appropriate time, and then calculating which array the elements are allocated to through a suitable hash algorithm can greatly reduce the probability of conflicts. The problem of low query efficiency can be avoided.

Third, why the default loadFactor is 0.75

So far, we know that loadFactor is an important concept in HashMap, which indicates the maximum fullness of this HashMap.

In order to avoid hash collisions, HashMap needs to be expanded at an appropriate time. That is when the number of elements in it reaches a critical value, and this critical value is related to loadFactor as mentioned earlier. In other words, setting a reasonable loadFactor can effectively avoid hash conflicts.

So, what is the proper loadFactor setting?

This value is now 0.75 in the JDK sources:

/**
* The load factor used when none specified in constructor.
*/

static final float DEFAULT_LOAD_FACTOR = 0.75f;

So why choose 0.75? What are the considerations behind it? Why not 1, not 0.8? Not 0.5, but 0.75?

In the official documentation of JDK, there is such a description:

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put ).

Roughly what it means: In general, the default load factor (0.75) provides a good trade-off between time and space costs. Higher values reduce space overhead but increase lookup costs (reflected in most operations of the HashMap class, including get and put).

Just imagine, if we set the load factor to 1 and the capacity uses the default initial value of 16, it means that a HashMap will not be expanded until it is “full”.

Then in HashMap, the best case is that these 16 elements fall into 16 different buckets after passing the hash algorithm, otherwise hash collisions will inevitably occur. And as the number of elements increases, the probability of hash collision increases, and the search speed will decrease.

Mathematical basis of 0.75

In addition, we can use a kind of mathematical thinking to calculate how much this value is appropriate.

We assume that the probability of a bucket being empty and non-empty is 0.5, we use s to represent the capacity, and n represents the number of added elements.

Let s denote the size of the added key and the number of n keys. According to the binomial theorem, the probability that the bucket is empty is:

P(0) = C(n, 0) * (1/s)^0 * (1 – 1/s)^(n – 0)

Therefore, a bucket may be empty if the number of elements in the bucket is less than:

log(2)/log(s/(s – 1))

As s tends to infinity, n/s quickly approaches log(2) if the increased number of keys makes P(0) = 0.5:

log(2) ~ 0.693…

Therefore, a reasonable value is around 0.7.

Of course, this mathematical calculation method is not reflected in the official Java documents, and we have no way of investigating whether there is such a consideration, just like we don’t know what Lu Xun thought when he wrote the article, we can only speculate. This speculation comes from Stack Overflow (https://stackoverflow.com/questions/10901752/what-is-the-significance-of-load-factor-in-hashmap)

The inevitable factor of 0.75

In theory, we believe that the load factor should not be too large, otherwise it will cause a large number of hash collisions, and it should not be too small, which will waste space.

Through a mathematical reasoning, it is more reasonable to calculate that this value is around 0.7.

So why was 0.75 chosen in the end?

Remember the formula we mentioned earlier, that is, threshold = loadFactor * capacity.

We know that according to the expansion mechanism of HashMap, he will ensure that the value of capacity will always be a power of 2.

Then, in order to ensure that the result of the load factor (loadFactor) * capacity (capacity) is an integer, this value is 0.75 (3/4), which is more reasonable, because the product of this number and any power of 2 is an integer.

Fourth, summary

HashMap is a K-V structure. In order to improve the speed of its query and insertion, the bottom layer is implemented by the data structure of an array of linked lists.

But because the hash algorithm needs to be used when calculating the position of the element, and the hash algorithm used by HashMap is the chain address method. This approach has two extremes.

If the hash collision probability in HashMap is high, then HashMap will degenerate into a linked list (not really degraded, but the operation is like directly operating the linked list), and we know that the biggest disadvantage of the linked list is that the query speed is relatively slow. The header starts to traverse one by one.

Therefore, in order to avoid a large number of hash collisions in HashMap, it is necessary to expand it at an appropriate time.

The condition for expansion is when the number of elements reaches a critical value. The calculation method of the critical value in HashMap:

Critical value (threshold) = load factor (loadFactor) * capacity (capacity)

Among them, the load factor represents the maximum fullness that an array can achieve. This value should neither be too large nor too small.

If the loadFactor is too large, for example equal to 1, then there will be a high probability of hash collision, which will greatly reduce the query speed.

If the loadFactor is too small, for example equal to 0.5, then frequent capacity expansion will waste a lot of space.

So, this value needs to be between 0.5 and 1. Calculations based on mathematical formulas. This value is more reasonable when log(2).

In addition, in order to improve the expansion efficiency, the capacity of HashMap has a fixed requirement, that is, it must be a power of 2.

Therefore, if the loadFactor is 3/4, then the result of the product with the capacity can be an integer.

Therefore, in general, we do not recommend modifying the value of loadFactor unless there are special reasons.

For example, I clearly know that my Map only stores 5 kv and will never change, so I can consider specifying loadFactor.

But in fact, I don’t recommend it. We can achieve this by specifying capacity.