Containers (collections) in java, underlying principles of HashMap, differences between ArrayList, LinkedList, and Vector, reasons for hashMap loading factor 0.75

1. Containers in java

Collections are mainly divided into two major interfaces: Collection and Map; the sub-interfaces of Collection include List and Set; the implementation classes of List collection include ArrayList, whose bottom layer is an array, and the bottom layer of LinkedList, which is a bidirectional acyclic list and Vector; the implementation classes of Set collection include HashSet, TreeSet; the implementation classes of Map collection include HashMap, TreeMap, and HashTable;

(Supplement: HashTable is similar to HashMap, thread-safe, and the sub-interface has the Properties interface, which is thread-safe)

1. The underlying principle of HashMap?

HashMap stores data in the form of key-value pairs. The bottom layer is composed of hash tables. Before jdk1.8, it was array + linked list. After jdk1.8, it was composed of array + linked list + red-black tree. (Default array length: 16)

When adding elements, if the length of the linked list is greater than or equal to 8 and the length of the array is less than 64, the array length will be expanded by 2 times the length of the original array; when the length of the linked list is greater than or equal to 8 and the length of the array is greater than or equal to 64, the linked list will be turned red. black tree. Red-black tree is a balanced binary search tree with high efficiency.

When an element is deleted and the length of the linked list is less than 7, the red-black tree is converted into a linked list.

(Additional: Head interpolation before jdk1.8, tail interpolation after jdk1.8; the default capacity is 16 when creating a map in 1.7, and no capacity by default when creating a map in 1.8. After adding, the initial length is 16

Hash conflict: chain address method, open address method, re-hash method, establishing public overflow area)

2. What are the differences between ArrayList, LinkedList, and Vector collections?

The bottom layer of the ArrayList collection is an array, which is suitable for collection traversal and random access to an element; when adding elements, the capacity is expanded to 1.5 times the length of the original array each time. (The length defaults to 0, and the length is 10 if the length is not specified after calling the add method)

The bottom layer of the LinkedList collection is a two-way acyclic linked list. The efficiency of inserting and deleting elements in the middle is relatively high, but the traversal efficiency is relatively low.

Vector collection is similar to ArrayList. The bottom layer is also an array. It is thread-safe. Each method is modified by synchronized, and the execution efficiency is low. (Each expansion is 2 times the original array length)

(Supplement: Thread safety can use the collection CopyOnWriteArrayList provided by juc to copy on write)

2. Why is the loading factor of HashMap 0.75?

Why does HashMap need loading factor

What are some ways to resolve conflicts?
- 1. Open addressing method
- 2. Rehashing
- 3. Create a public overflow area
- 4.Chain address method (zipper method)
Why does the HashMap loading factor have to be 0.75? Instead of 0.8, 0.6?
So why can’t it be 0.8 or 0.6?

The bottom layer of HashMap is a hash table, which is a structural type that stores key-value pairs. It requires certain calculations to determine the storage location of the data in the hash table:

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
//AbstractMap
public int hashCode() {
     int h = 0;
     Iterator<Entry<K,V>> i = entrySet().iterator();
     while (i.hasNext())
         h + = i.next().hashCode();

     return h;
}

General data structures are either fast to query or fast to insert. HashMap is a data structure that is slow to insert and fast to query.

However, this data structure is prone to two problems:

① If the space utilization is high, then when the hash algorithm calculates the storage location, it will be found that many storage locations already have data (hash conflict);

② If the array capacity is increased in order to avoid hash conflicts, it will lead to low space utilization.

The Loading factor indicates the degree of filling of elements in the Hash table.

1. Loading factor

Loading factor = number of elements filled in the table / length of the hash table

The larger the loading factor, the more elements are filled and the higher the space utilization, but the chance of conflict increases;

The smaller the loading factor, the fewer elements are filled, and the chance of conflict is reduced, but more space is wasted, and the number of expansion rehash operations will be increased.

The greater the chance of conflict, it means that the data that needs to be found needs to be found through another way, so the cost of searching is higher. Therefore, a balance and compromise must be found between “conflicting opportunities” and “space utilization”.

So we can also know that the main factors that affect search efficiency include the following:

Can a hash function evenly hash the data in a hash table?
How to deal with conflicts?
How to choose the loading factor of hash table?

2. What are some ways to resolve conflicts?

1. Open addressing method

Hi = (H(key) + di) MOD m, where i=1,2,…,k(k<=m-1)

H (key) is the hash function, m is the length of the hash table, di is the incremental sequence, and i is the number of conflicts that have occurred. Among them, the open addressing method can be divided into three types according to different step sizes:

1.1 Linear Probing: di = 1,2,3,…,m-1

Simply put, it starts from the current conflict position and searches in a loop with a step size of 1 until an empty position is found. If the position is not occupied after the loop, it means that the container is full. To give you an example, it’s like going to the street to eat at meal time and going from house to house to see if there is a seat.

1.2 Quadratic Probing: di = ±12, ±22, ±32,…, ±k2 (k≤m/2)

Compared with the linear exploration method, this is equivalent to looping the search with a step size of di = i2 until an empty position is found. Looking at the above example, now you don’t go door to door to see if there is a location. Instead, you use your mobile phone to go to the i2th store and then ask if there is a location in this store.

1.3 Pseudo-random detection method: di = pseudo-random number sequence

This is to take a random number as the step size. Still using the above example, this time I just chose a store based on my mood and asked if there was a location.

But the open addressing method has these shortcomings:

The hash table established by this method is easy to pile up data when there are many conflicts, which is not friendly to search at this time;
When deleting a node, you cannot simply empty the node’s space, otherwise the search path for synonym nodes after it is filled in the hash table will be truncated. Therefore, if you want to delete a node, you can only add a deletion mark to the deleted node, but you cannot actually delete the node;
If the hash table space is full, an overflow table needs to be created to store the extra elements.

2. Rehash

Hi = RHi(key), where i=1,2,…,k

The RHi() function is a hash function different from H(). When an address conflict occurs between synonyms, it calculates another hash function address until no conflict occurs. This method is less prone to heaping, but will increase calculation time.

So the disadvantage of rehashing is: increased calculation time.

3. Create a public overflow area

Assume that the value range of the hash function is [0, m-1], let the vector HashTable[0,…,m-1] be the basic table, each component stores a record, and also set the vector OverTable[0,…, v] is the overflow table. The basic table stores keyword records. Once a conflict occurs, no matter what the hash address obtained by their hash function is, it will be filled in the overflow table.

But the disadvantage of this method is that when looking for conflicting data, you need to traverse the overflow table to get the data.

4. Chain address method (zipper method)

Construct the elements at conflicting positions into a linked list. When adding data, if the hash address conflicts with an element in the hash table, it will be placed on the linked list at this location.

Advantages of the zipper method:

The way to handle conflicts is simple and there is no stacking phenomenon. Non-synonymous words will never conflict, so the average search length is short;
Since the node space on each linked list in the zipper method is dynamically applied for, it is more suitable for situations where the table length cannot be determined before creating the table;
The node deletion operation is easy to implement, as long as you simply delete the corresponding node on the linked list.

Disadvantages of the zipper method: Requires additional storage space.

From the underlying structure of HashMap, we can see that HashMap uses a combination of array + linked list / red-black tree as the underlying structure, that is, the open address method + chain address method to implement HashMap.

3. Why must the HashMap loading factor be 0.75? Instead of 0.8, 0.6?

The bottom layer of HashMap is actually a hash table (hash table), and the way to resolve conflicts is the chain address method. The default initial capacity of HashMap is 16. In order to reduce the probability of conflicts, when the array length of HashMap reaches a critical value, expansion will be triggered, and all elements will be rehashed and then placed in the expanded container. This is a Quite a time consuming operation.

This critical value is determined by the load factor and the capacity of the current container:

Threshold = DEFAULT_INITIAL_CAPACITY * DEFAULT_LOAD_FACTOR

That is, by default, when 16×0.75=12, the expansion operation will be triggered.

So why was 0.75 chosen as the loading factor of HashMap? This is related to a very important principle in statistics – Poisson distribution.

The Poisson distribution is a common discrete probability distribution in statistics and probability. It is suitable for describing the probability distribution of the number of random events occurring per unit time. Recommended for those interested: Wikipedia or this article by teacher Ruan Yifeng: Poisson distribution and exponential distribution

On the left side of the equal sign, P represents probability, N represents a certain functional relationship, t represents time, and n represents quantity. On the right side of the equal sign, λ represents the frequency of the event.

There is such a comment in the source code of HashMap:

* Ideally, under random hashCodes, the frequency of
* nodes in bins follows a Poisson distribution
* (http://en.wikipedia.org/wiki/Poisson_distribution) with a
* parameter of about 0.5 on average for the default resizing
* threshold of 0.75, although with a large variance because of
* resizing granularity. Ignoring variance, the expected
* occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
* factorial(k)). The first values are:
* 0: 0.60653066
* 1: 0.30326533
*2: 0.07581633
*3: 0.01263606
* 4: 0.00157952
* 5: 0.00015795
* 6: 0.00001316
* 7: 0.00000094
* 8: 0.00000006
* more: less than 1 in ten million

Ideally, using a random hash code, with an expansion threshold (loading factor) of 0.75, the frequency of nodes appearing in the hash bucket (table) follows a Poisson distribution with a parameter average of 0.5. Ignoring the variance, that is, X = λt, P(λt = k), where λt = 0.5, according to the formula:

The calculation results are shown in the above list. When the length of the linked list in a bin reaches 8 elements, the probability is 0.00000006, which is almost an impossible event.

So in fact, the constant 0.5 is calculated as a parameter substituted into the Poisson distribution, and the loading factor 0.75 is used as a condition. When the HashMap length is length/size ≥ 0.75, it will be expanded. Under this condition, the zipper length and probability results after the conflict for:

0: 0.60653066
1: 0.30326533
2: 0.07581633
3: 0.01263606
4: 0.00157952
5: 0.00015795
6: 0.00001316
7: 0.00000094
8: 0.00000006

4. Why can’t it be 0.8 or 0.6?

In addition to the hash algorithm, there are two parameters in HashMap that affect performance: initial capacity and loading factor. The initial capacity is the capacity of the hash table when it is created, and the load factor is a measure of how full the hash table can be before its capacity automatically expands.

5. Describe load factors in Wikipedia:

For open addressing methods, the load factor is a particularly important factor and should be strictly limited to below 0.7-0.8. If it exceeds 0.8, the CPU cache misses during table lookup will increase exponentially. Therefore, some hash libraries that use open addressing methods, such as Java’s system library, limit the loading factor to 0.75. If this value is exceeded, the hash table will be resized.

When setting the initial capacity, the number of entries required in the map and its loading factor should be taken into consideration to minimize the number of expansion rehash operations. Therefore, when using HashMap, it is generally recommended to set the initial capacity based on the estimated value to reduce expansion operations. .

Choosing 0.75 as the default loading factor is completely a compromise between time and space costs.