HashSet deduplication principle

1. What is Hashset

  1. Collections in Java are divided into Collection collections (single-column collections) and Map collections (double-column collections)
  2. The Hashset collection is an implementation class of the set interface. The set interface also inherits from the top-level parent class Collection interface, so HashSet can have methods common to Collection.
  3. The characteristics of the set interface: unordered, no subscripts, and no repeated elements.
  4. HashSet definition: This class implements the Set interface, and the underlying layer is supported by a hash table (actually a HashMap instance). The iteration order of set is not guaranteed. In particular, this class allows null elements.

2. The principle of non-duplication of elements in HashSet

1. HashSet bottom layer

It can be seen that the underlying layer inherits HashMap and is a hash table.

2. Duplicate data of HashSet

Create a Student class and put it as an object in the HashSet collection

public class Student {
private String name;
private int age;
\t
\t
public Student() {
super();
}
\t
public Student(String name, int age) {
super();
this.name = name;
this.age = age;
}

public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}

@Override
public String toString() {
return "Student [name=" + name + ", age=" + age + "]";
}

Create a test class test and add repeated data

public class Test {
public static void main(String[] args) {
//Create a HashSet collection to store Student objects
HashSet<Student> hset=new HashSet<Student>();
//Create three Student specific implementations
Student stu1=new Student("张三", 21);
Student stu2=new Student("张三", 21);
Student stu3=new Student("李思", 28);
//Put the above three objects into the HashSet collection
hset.add(stu1);
boolean b1=hset.add(stu2);//Insert duplicate objects (different addresses, but the same content), insertion is allowed
hset.add(stu3);
boolean b2=hset.add(stu1);//Insert duplicate objects (same address), insertion is not allowed
//Use enhanced for loop to output elements in HashSet
for (Student student : hset) {
System.out.println(student);
}
//Observe the situation when two duplicate data are added to the HashSet collection
System.out.println("Insert duplicate object (different address, same content)" + b1);
System.out.println("Insert duplicate object (same address)" + b2);
}

  • As can be seen from the above results, we added two types of duplicate data, one successful and one failed. We think that the two added Zhang San objects have the same content, and they should also be duplicate data and should not be added to the HashSet collection. According to the characteristics of the Ste interface inherited by HashSet, it should be unordered and not repeated. So why was the repeated element Zhang San added.
  • This is related to the compiler’s method of identifying duplicate data. How to let the compiler identify objects with the same content. This requires us to override the equals() method in the Student class to compare the similarities and differences of object attribute values. If equals() returns true, it is duplicate data, otherwise it is not.

In the Student class, you can directly generate the equals() method or define it yourself, and then run the test class again

//Customized overriding of the equal() method
@Override
public boolean equals(Object obj) {
if(this==obj) {
return true;
}
if(obj==null || this.getClass() != obj.getClass()) {
return false;
}
//downward transformation
Student stu=(Student) obj;
return Objects.equals(name, stu.name) & amp; & amp;
Objects.equals(age, stu.age);
}

We will be surprised to find that the results are different from what we expected, and the equals() method we overridden has no effect. So why?

?In fact, it is to save memory space and save resources. Although the user has written the override eauals() method at this time, the program compiler thinks it is not necessary to use it. That’s right – “It can be used, but it’s not necessary.”

The reason is simple. If equals() is called every time, how many times will it be called to insert 5 objects? (Suppose the total number of comparisons is n)

Inserting object 1 requires 0 comparisons (n=0);

Inserting object 2 requires 1 comparison (n=1);

Inserting object 3 requires 2 comparisons (n=3);

Inserting object 4 requires 3 comparisons (n=6);

Inserting object 5 requires 4 comparisons (n=10);

Just inserting 5 objects requires using the equals() method 10 times, which greatly reduces the efficiency of the program. Obviously, the compiler did not call this method. From this we can also see that the equals() method has usage conditions, and this condition is the HashCode() encoding value. Only when the values returned by the HashCode() method are the same (that is, the addresses are the same), the equals() method is called to compare the contents.

This also proves why we use equals(); alone without achieving the expected deduplication effect.

Therefore, in order to ensure that objects with different addresses and the same content can also be checked for duplication, we need to intervene manually. The method of intervention is to make objects with different addresses and the same content (same attributes) have consistent HashCode, so that we can Use the equals() method to compare the similarities and differences in their contents. So we can customize the HashCode() method to override the HashCode() method of the parent class. How to define it? Of course, arbitrary definition is obviously unreasonable. It needs to be combined with the HashCode of each attribute of the object, which is the HashCode of the entire object;

Override the HashCode() method and run the test class again

//Customized override HashCode() method
public int hashCode() {
return Objects.hash(name,age);
}

It can be found that at this time, there is only one Zhang San object in the HashSet collection.

3. How does HashSet identify duplicate elements

1. When our HashSet calls the add() method to store an object, it first calls the object’s HashCode() to get a hash value, and then checks whether there are objects with the same hash value in the set.
  1. If there are no objects with the same hash value, add them directly to the collection without going through the equals() method for comparison.
  2. If there are objects with the same hash value, further use the equals() method to compare and remove duplicates.
2. When storing our customized objects in the set collection:
  1. The two methods equals() and HashSet() must be rewritten in the class (must be rewritten at the same time, you cannot just rewrite one of them separately, it will have no effect)
  2. HashCode(): The return values of objects with the same attributes must be the same, and the return values of objects with different attributes must be different.
  3. equals(): Returns true if the attributes are the same, returns false if the attributes are different; (this method is the final test to determine whether they are the same object)

3. Summary

1. Specific flow chart

2. Summary
  • HashSet implements non-repeatable elements based on HashCode
  • When the HashCode of the stored elements is the same, equals() will be called to confirm. If the result is true, the deposit will be refused.

I hope all readers will like and support~

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Java Skill TreeSetSet Interface 138681 people are learning the system