Meituan Ermian: How to check whether a username exists among 1 billion users?

Coder Technology Column 2023-11-01 08:50 Published in Zhejiang

The following article comes from JAVA Xuyang, author JAVA Xuyang

JAVA Xuyang.

Rising Sun, I hope I can be like the rising sun, full of hope for everything~~

Java Backend Interviewer

Selected Java backend full-stack interview questions, personal website: java-family.cn, database, message queue, architecture… everything you want to know is here!

17 pieces of original content

No public

Hello everyone, I am Bucai Chen~

I don’t know if you have noticed that when registering with some apps, you are prompted that your username has been occupied and you need to change it. How is this achieved? You may be thinking, isn’t this very simple? Just go to the database and check if there are any. So what if there are a lot of users, reaching hundreds of millions?

Database scheme

The first solution is to check the database. Everyone can think of it. The code is as follows:

public class UsernameUniquenessChecker {
    private static final String DB_URL = "jdbc:mysql://localhost:3306/your_database";
    private static final String DB_USER = "your_username";
    private static final String DB_PASSWORD = "your_password";

    public static boolean isUsernameUnique(String username) {
        try (Connection conn = DriverManager.getConnection(DB_URL, DB_USER, DB_PASSWORD)) {
            String sql = "SELECT COUNT(*) FROM users WHERE username = ?";
            try (PreparedStatement stmt = conn.prepareStatement(sql)) {
                stmt.setString(1, username);
                try (ResultSet rs = stmt.executeQuery()) {
                    if (rs.next()) {
                        int count = rs.getInt(1);
                        return count == 0; // If count is 0, username is unique
                    }
                }
            }
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return false; // In case of an error, consider the username as non-unique
    }

    public static void main(String[] args) {
        String desiredUsername = "new_user";
        boolean isUnique = isUsernameUnique(desiredUsername);
        if (isUnique) {
            System.out.println("Username '" + desiredUsername + "' is unique. Proceed with registration.");
        } else {
            System.out.println("Username '" + desiredUsername + "' is already in use. Choose a different one.");
        }
    }
}

This method will bring the following problems:

Performance issues, high latency. If the amount of data is large, the query speed is slow. Additionally, database queries involve network communication between the application server and the database server. The time it takes to establish a connection, send a query, and receive a response also contributes to latency.
Database load is too high. Frequently execute SELECT queries to check username uniqueness, each query requires database resources, including CPU and I/O.
Poor scalability. The database has limits on concurrent connections and resources. If registration rates continue to grow, the database server may have difficulty handling the increased number of incoming requests. Vertically scaling a database (adding more resources to a single server) can be costly and potentially limiting.

Caching scheme

In order to solve the performance problem of database calling user name uniqueness check, efficient Redis cache is introduced.

public class UsernameCache {

    private static final String REDIS_HOST = "localhost";
    private static final int REDIS_PORT = 6379;
    private static final int CACHE_EXPIRATION_SECONDS = 3600;

    private static JedisPool jedisPool;

    // Initialize the Redis connection pool
    static {
        JedisPoolConfig poolConfig = new JedisPoolConfig();
        jedisPool = new JedisPool(poolConfig, REDIS_HOST, REDIS_PORT);
    }

    // Method to check if a username is unique using the Redis cache
    public static boolean isUsernameUnique(String username) {
        try (Jedis jedis = jedisPool.getResource()) {
            // Check if the username exists in the Redis cache
            if (jedis.sismember("usernames", username)) {
                return false; // Username is not unique
            }
        } catch (Exception e) {
            e.printStackTrace();
            // Handle exceptions or fallback to database query if Redis is unavailable
        }
        return true; // Username is unique (not found in cache)
    }

    // Method to add a username to the Redis cache
    public static void addToCache(String username) {
        try (Jedis jedis = jedisPool.getResource()) {
            jedis.sadd("usernames", username); // Add the username to the cache set
            jedis.expire("usernames", CACHE_EXPIRATION_SECONDS); // Set expiration time for the cache
        } catch (Exception e) {
            e.printStackTrace();
            // Handle exceptions if Redis cache update fails
        }
    }

    // Cleanup and close the Redis connection pool
    public static void close() {
        jedisPool.close();
    }
}

The biggest problem with this solution is that it takes up too much memory. If each user name requires about 20 bytes of memory. If you want to store 1 billion usernames, you will need 20G of memory.

Total memory = Memory usage per record * Number of records = 20 bytes/record * 1,000,000,000 records = 20,000,000,000 bytes = 20,000,000 KB = 20,000 MB = 20 GB

Bloom filter scheme

Direct caching determines that the memory usage is too large. Is there any better way? Bloom filters are a good choice.

So what exactly is a Bloom filter?

Bloom filter (Bloom Filter) is a data structure used to quickly check whether an element exists in a large data set. It is usually used to quickly filter out impossible to exist in some cases. elements to reduce subsequent more expensive query operations. The main advantage of a Bloom filter is that it provides fast lookup and insertion operations and is very efficient in terms of memory footprint.

The specific implementation principle and data structure are shown in the figure below:

The core idea of the Bloom filter is to use a bit array (bit array) and a set of hash functions.

Bit Array: Bloom filters use an array containing a large number of bits, usually initialized to all zeros. Each bit can store two values, usually 0 or 1. These bits are used to indicate the presence or possible presence of an element.
Hash Functions: Bloom filters use multiple hash functions, each of which maps input elements to one or more positions in a bit array. These hash functions must be independent and uniformly distributed.

So how exactly is it done?

Add elements: As shown in the figure above, when the strings “xuyang“, “alvin” are inserted into the Bloom filter, the elements are mapped in place through multiple hash functions multiple locations in the array and then sets the bits at those locations to 1.
Query elements: When you want to check whether an element exists in a Bloom filter, map the elements to the corresponding positions of the bit array through the same hash function, and then check whether the bits in these positions are all 1. If any bit is 0, then you can be sure that the element does not exist in the data set. But if all bits are 1, the element may exist in the data set, but it may also be a false positive.

Redis itself supports the data structure of Bloom filter. Let’s simply implement it with code to understand:

import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
import redis.clients.jedis.JedisPoolConfig;

public class BloomFilterExample {
    public static void main(String[] args) {
        JedisPoolConfig poolConfig = new JedisPoolConfig();
        JedisPool jedisPool = new JedisPool(poolConfig, "localhost", 6379);

        try (Jedis jedis = jedisPool.getResource()) {
            // Create a Bloom filter named "usernameFilter", you need to specify the expected number of elements and the expected error rate
            jedis.bfCreate("usernameFilter", 10000000, 0.01);
            
            //Add username to bloom filter
            jedis.bfAdd("usernameFilter", "alvin");
            
            // Check if the username already exists
            boolean exists = jedis.bfExists("usernameFilter", "alvin");
            System.out.println("Username exists: " + exists);
        }
    }
}

In the above example, we first create a bloom filter called “usernameFilter” and then add the username to the bloom filter using bfAdd. Finally, use bfExists to check if the username already exists.

advantage:

Save memory space. Compared with using data structures such as hash tables, Bloom filters usually require less memory space because it does not store the actual elements, but only the hash values of the elements. If 1 billion billion records are stored with a probability of error of 0.001, only 1.67 GB of memory is required, compared with the original 20G , greatly reduced.
Efficient search, Bloom filter can quickly find whether an element exists in the set in constant time (O(1)) without traversing the entire set.

shortcoming:

Existence of false positive rate: Bloom filter has a certain false positive rate when judging whether an element exists. This means that in some cases it may incorrectly report that an element exists, but not that it does not.
Elements cannot be deleted: Bloom filters generally do not support deleting elements from a collection, because deleting an element will affect the hash values of other elements, increasing the false positive rate.

Summary

The Redis Bloom filter solution provides an efficient memory-based solution for uniqueness verification under large data volumes. It requires a balance between memory consumption and error rate. Of course, Bloom filters have more application scenarios, such as preventing cache penetration, preventing malicious access, etc.

Final words (don’t prostitute for nothing, please pay attention)

Every article written by Chen is carefully written. If this article is helpful or inspiring to you, please like, read, repost, and collect it. Your support is my biggest motivation to persevere!

In addition, Chen’s Knowledge Planet has been opened. It only costs 199 yuan to join. The value of planet feedback is huge. Currently, it has updated the Code Yuan chronic disease cloud management practical project, the Spring family bucket practical series, the billion-level data sub-database and table practical, and the DDD micro Service practice column, I want to join a big factory, Spring, Mybatis and other framework source codes, 22 lectures on architecture practice, RocketMQ, etc…

More introduction