4 billion QQ numbers, limited to 1G memory, how to deduplicate?

4 billion QQ numbers, limited to 1G memory, how to deduplicate?

4 billion unsigned ints, if stored directly in memory, need:

4*4000000000 /1024/1024/1024 = 14.9G, considering that there are some repetitions, the space of 1G is basically not enough.

To achieve this function, you can use a bitmap.

If you use a bitmap, a number only needs to occupy 1 bit, so 4 billion numbers are:

4000000000 * 1 /8 /1024/1024 = 476M

Compared with the previous 14.9G, it greatly saves a lot of space.

For example, if I want to put my QQ number “907607222” in the Bitmap, I need to find the position of 907607222, and then set it to 1.

In this way, after putting 4 billion numbers into the Bitmap, all positions of 1 indicate existence, and those not of 1 indicate non-existence. The same QQ number only needs to be set to 1 once. Then, in the end, all are The number of 1 can be traversed out.

What is BitMap? What is the use?

Bitmap (BitMap), the basic idea is to use a bit to mark an element. Bit is the smallest unit in a computer, that is, what we often call 0 and 1 in a computer. This is represented by a bit.

The so-called bitmap is actually a bit array, that is, each position is a bit, and the value in it can be 0 or 1

Like the bitmap above, it can be used to represent 1, 4, 6:

If we don’t use bitmap, if we want to record the three integer types of 1, 4, and 6, we need to use three unsigned ints. It is known that each unsigned int occupies 4 bytes, then it is 34 = 12 bytes, one byte has 8 bits, then 128 = 96 bits.

So, the biggest advantage of bitmap is to save space.

Bitmaps have many uses, especially in scenarios such as deduplication and sorting. The famous Bloom filter is implemented based on bitmaps.

But the bitmap also has certain limitations, that is, it can only represent 0 and 1, and cannot store other numbers. So he is only suitable for this kind of scene that can express true or false.

What is a Bloom filter and what is its implementation principle?

A Bloom filter is a data structure used to quickly retrieve whether an element is likely to exist in a set (bit array).

Its basic principle is to use multiple hash functions to map an element into multiple bits, and then set these bits to 1. When querying for an element, if these bits are all set to 1, the element is considered likely to exist in the collection, otherwise it definitely does not

Therefore, the Bloom filter can accurately determine whether an element must not exist, but because of the existence of hash collisions, it cannot determine that an element must exist. It can only be judged that it may exist.

Therefore, the Bloom filter may have misjudgment, that is, when a non-existing Hero element, after hash1, hash2, and hash3, just conflicts with the hash results of other values. Then it will be misjudged to exist, but in fact he does not exist.

To reduce the probability of this misjudgment, the main method is to reduce the probability of hash collision and introduce more hash algorithms.

The following is the working process of Bloom filter:

1. Initialize Bloom filter

When initializing the Bloom filter, you need to specify the size of the set and the false positive rate. The Bloom filter contains a bit array and multiple hash functions, and each hash function generates an index value.

2. Add elements to Bloom filter

To add an element to a Bloom filter, first pass the element through multiple hash functions to generate multiple index values, and then set the bits corresponding to these index values to 1. If these index values have already been set to 1, there is no need to set them again.

3. Whether the query element exists in the Bloom filter

To query whether an element exists in the Bloom filter, it is necessary to generate multiple index values for the element through multiple hash functions, and determine whether the bits corresponding to these index values are all set to 1. If these bits are all set to 1, the element is considered likely to be present in the set, otherwise it is definitely not present.

The main advantage of the Bloom filter is that it can quickly determine whether an element belongs to a certain set, and can achieve high efficiency in space and time. However, it also has some disadvantages, such as:

Bloom filter has a certain misjudgment rate when judging whether an element exists. ,
Bloom filter removal is difficult because removing an element requires setting its corresponding bits to 0, but these bits may be shared by other elements.

Application scenario

The Bloom filter is widely used because of its high efficiency. The typical scenarios are as follows:

1. Web crawler: Crawler programs can use Bloom filters to filter out crawled webpages, avoiding repeated crawling and waste of resources.

2. Cache system: The cache system can use Bloom filters to judge whether a query may exist in the cache, thereby reducing the number of query caches and improving query efficiency. Bloom filters are also often used to solve the problem of cache penetration.

3. Distributed system: In a distributed system, a Bloom filter can be used to determine whether an element exists in the distributed cache, avoiding queries on all nodes and reducing network load.

4. Spam filtering: Bloom filter can be used to judge whether an email address is in the spam list, so as to filter out spam.

5. Blacklist filtering: Bloom filters can be used to determine whether an IP address or mobile phone number is in the blacklist, thereby preventing malicious requests.

How to use

Bloom filters can be implemented using third-party libraries in Java. Common ones include Google Guava library, Apache Commons library, and Redis.

Such as Guava:

import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
public class BloomFilterExample {<!-- -->
    public static void main(String[] args) {<!-- -->
        // Create a Bloom filter, expect to insert 100 elements, and the false positive rate is 0.01
        BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(), 100, 0.01);
        // insert element
        bloomFilter. put("Lynn");
        bloomFilter. put("666");
        bloomFilter.put("eight-legged essay");
        // Check if element exists
        System.out.println(bloomFilter.mightContain("Lynn")); // true
        System.out.println(bloomFilter.mightContain("Zhang San")); // false
    }
}

Apache Commons:

import org.apache.commons.lang3.StringUtils;
import org.apache.commons.collections4.BloomFilter;
import org.apache.commons.collections4.functors.HashFunctionIdentity;
public class BloomFilterExample {<!-- -->
    public static void main(String[] args) {<!-- -->
        // Create a Bloom filter, expect to insert 100 elements, and the false positive rate is 0.01
        BloomFilter<String> bloomFilter = new BloomFilter<>(HashFunctionIdentity. hashFunction(StringUtils::hashCode), 100, 0.01);
        // insert element
        bloomFilter. put("Lynn");
        bloomFilter. put("666");
        bloomFilter.put("eight-legged essay");
        // Check if element exists
        System.out.println(bloomFilter.mightContain("Lynn")); // true
        System.out.println(bloomFilter.mightContain("Zhang San")); // false
    }
}

Redis can be used through the Bloom module, and Redisson can be used to:

Config config = new Config();
config.useSingleServer().setAddress("redis://127.0.0.1:6379");
RedissonClient redisson = Redisson.create(config);
RBloomFilter<String> bloomFilter = redisson.getBloomFilter("myfilter");
bloomFilter. tryInit(100, 0.01);
bloomFilter.add("Lynn");
bloomFilter. add("666");
bloomFilter.add("eight-legged essay");
System.out.println(bloomFilter.contains("Lynn"));
System.out.println(bloomFilter.contains("Zhang San"));
redisson. shutdown();

First create a RedissonClient object, and then obtain an RBloomFilter object through this object, use the tryInit method to initialize the Bloom filter, specify the maximum number of elements that can be added as 100, and the false positive rate is 0.01.

Then, use the add method to add the elements “Hollis”, “666” and “Style” to the Bloom filter, and use the contains method to check whether the element exists in the Bloom filter.

Or Jedis can also:

Jedis jedis = new Jedis("localhost");
jedis.bfCreate("myfilter", 100, 0.01);
jedis.bfAdd("myfilter", "Lynn");
jedis.bfAdd("myfilter", "666");
jedis.bfAdd("myfilter", "stereotyped essay");
System.out.println(jedis.bfExists("myfilter", "Lynn"));
System.out.println(jedis.bfExists("myfilter", "Zhang San"));
jedis. close();