Source code and analysis of specific implementation of Set interface

The author was asked this question when participating in an interview at a bank. Have you ever understood the implementation principle of HashSet? Since this small point is only used normally, but the source code has never been seen, so I can only be “short of money”.

(Take openjdk-19 as an example)

Source code

First of all, if we want to analyze HashSet, we also need to look at the methods of the interface Set itself it implements.

Set interface

public interface Set<E> extends Collection<E> {<!-- -->
int size();
\t
boolean isEmpty();
\t
    boolean contains(Object o);

    Iterator<E> iterator();

    Object[] toArray();
    
    <T> T[] toArray(T[] a);

boolean add(E e);
\t
boolean remove(Object o);
\t
boolean containsAll(Coolection<?> c);
\t
boolean addAll(Collection<? extends E> c);
\t
boolean retainAll(Collection<?> c);
\t
boolean removeAll(Collection<?> c);
\t
void clear();
\t
boolean equals(Object o);
\t
int hashCode();

@Override
default Spliterator<E> spliteraotr() {<!-- -->
return Spliterators.spliteraotr(thiss, spliterator.DISTINCT);
}
}

It can be said that due to the perfect development process and rich history, the annotations in the source code are very complete. We can already understand the purpose of different methods from the abbreviations, but some default annotations require us to pay extra attention, such as NotNull.

HashSet implementation

We should first look at the overall comments in the source code. The original text will not be included here. Those who are interested can read it by themselves.

Loading factor

In the underlying source code of HashSet, the input parameters of some methods involve loadFactor

public HashSet(int initialCapacity, float loadFactor) { map = new HashMap<>(initialCapacity, loadFactor); } HashSet(int initialCapacity, float loadFactor, boolean dummy) { map = new LinkedHashMap<>(initialCapacity, loadFactor); } public HashSet(Collection<? extends E> c) { map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16)); addAll(c); }

The filling factor is usually used to measure the filling degree of the hash table. Specifically, its calculation formula is as follows

Loading factor = number of elements / hash table capacity

When HashMap is expanding, it will determine whether it needs to be expanded based on the value of the filling factor. The .75f in the above source code corresponds to the default threshold of 0.75. This is a balance point obtained based on experience, a trade-off between memory usage and search efficiency.

Source code comment summary

HashSet is a Set interface class implemented based on hash table

This implementation does not guarantee the iteration order of the collection, and does not guarantee that the order will remain constant over time.

This implementation provides constant time performance for basic operations, including add, delete, include, size operations

When iterating over a HashSet, the time complexity is proportional to the size of the set and the capacity of the hash table.

HashSet can store null elements, but only one at most. The reason is that the hash table uses the hash code of the element to determine its location in the internal storage structure. The hash code of null is 0. When multiple null elements are tried to be added, conflicts will occur if the hash codes are the same, and only one can be stored.

HashSet is not thread-safe. When multiple threads access concurrently and at least one thread attempts to modify the set at the same time, external synchronization must be performed.

The iterator of HashSet is fail-fast. If the set is modified during the iteration process other than the remove method of the iterator itself, the iterator will throw ConcurrentModificationException

Fail-fast behavior is not guaranteed because no hard guarantees can be made in the presence of concurrent modifications.

What is fast failure and is there slow failure?

In Java development, “fail-fast” (fail-fast) is a behavioral mechanism of an iterator (Iterator) or a collection (Collection). When an iterator detects that the collection has been modified during the iteration (other than through the iterator’s own remove method), it will immediately throw a ConcurrentModificationException exception to avoid uncertainty in future iterations. the behavior of.

The purpose of the fail-fast mechanism is to detect and report concurrent modifications in advance to ensure that the program can detect errors during iterations rather than possibly leading to indeterminate results in subsequent iterations. This can help developers detect and fix potential concurrency problems early and improve program reliability and stability.

There is no “slow-fail” mechanism in Java. If Java uses a slow-fail mechanism, then during the iterator traversing the collection, if the collection is modified, the iterator may continue to execute without throwing an exception. This may cause the iterator to produce incorrect results in subsequent operations.

Summary overview

Overview

HashSet, LinkedHashSet and TreeSet are implementation classes of the Set interface. They all have the characteristics of ensuring the uniqueness of elements, and they are not thread-safe. The main difference between the three is the underlying data structures they use.

HashSet uses a hash table as the underlying data structure and is actually implemented based on HashMap. Hash tables store elements by mapping their hash codes to bucket indexes, and use linked lists or red-black trees to resolve hash conflicts. HashSet is suitable for scenarios where the order of insertion and removal of elements does not need to be guaranteed. Due to the characteristics of the hash table, HashSet provides fast addition, deletion and search operations, with a time complexity of O(1).

The underlying data structure of LinkedHashSet includes linked lists and hash tables. It inherits from HashSet and maintains the insertion order of elements by using a linked list based on HashSet. Specifically, when an element is added to a LinkedHashSet, it will be inserted at the end of the linked list, thus ensuring that the insertion and removal order of elements meets the FIFO (first in, first out) characteristics. LinkedHashSet is suitable for scenarios where the insertion order of elements needs to be preserved.

The underlying data structure of TreeSet is a red-black tree (a self-balancing binary search tree). It implements the SortedSet interface, which ensures that elements are sorted in a specific order. TreeSet supports natural ordering (natural ordering of elements) and custom ordering (sorting rules defined through the Comparator interface). Due to the characteristics of red-black trees, the elements in TreeSet are ordered. The time complexity of insertion, deletion and search operations is O(log N). TreeSet is suitable for scenarios where elements need to be arranged in an orderly manner or customized sorting rules are required.

Summary

HashSet uses a hash table as the underlying data structure and is suitable for scenarios where the order of elements does not need to be guaranteed.

The underlying data structure of LinkedHashSet includes linked lists and hash tables, ensuring that the order of insertion and removal of elements satisfies FIFO.

The underlying data structure of TreeSet is a red-black tree, the elements are ordered, and it supports natural sorting and customized sorting.

The difference in the underlying data structure leads to differences in the application scenarios of HashSet, LinkedHashSet and TreeSet.

HashSet is suitable for scenarios where the order does not need to be guaranteed, LinkedHashSet is suitable for scenarios where FIFO order is guaranteed, and TreeSet is suitable for scenarios where sorting and custom sorting rules are required.

The following are source code snippets for the three implementations, focusing on the data structures they use:

HashSet:

public class HashSet<E> extends AbstractSet<E> implements Set<E>, Cloneable, Serializable { private transient HashMap<E,Object> map; // ... public boolean add(E e) { return map.put(e, PRESENT)==null; } // ... }

LinkedHashSet:

public class LinkedHashSet<E> extends HashSet<E> implements Set<E>, Cloneable, Serializable { private transient LinkedHashMap<E,Object> map; // ... public boolean add(E e) { return map.put(e, PRESENT)==null; } // ... }

TreeSet:

public class TreeSet<E> extends AbstractSet<E> implements NavigableSet<E>, Cloneable, Serializable { private transient NavigableMap<E,Object> map; // ... public boolean add(E e) { return map.put(e, PRESENT)==null; } // ... }

The above source code snippet shows the specific implementation of the underlying data structure of HashSet, LinkedHashSet and TreeSet. HashSet and LinkedHashSet use HashMap and LinkedHashMap as the underlying data structure, while TreeSet uses NavigableMap (usually TreeMap) as the underlying data structure.

During interviews or computer-based exams, we can consider using the key-ordered features of TreeMap to facilitate our problem-solving.