.NET high-performance development-bitmap index

First, let’s assume such a business scenario. Everyone should be familiar with air tickets. When purchasing air tickets, you first select your desired departure and arrival city and time, then select the cabin class (business class, economy class), and click Query. A flight list will appear. If you click on a flight at will, you will find that there are many sets of prices. Because air tickets are different from train tickets, their rights and rules are more complicated. For example, there are discount tickets for age groups, and there are discount tickets for different age groups. Exclusive tickets for students have different free checked baggage allowances, meals, different cancellation and change rules, and even Moutai cashback when buying air tickets.

19b14e5ddeec9b881f0bce37454be4a2.png

There are dozens of airlines, hundreds of airports, thousands of routes, and tens of thousands of flights in China. Each flight has dozens or hundreds of product types. This is one day’s data. Tickets can be purchased one year in advance. Total There should be billions, and they change in real time. No database can solve the problem of high concurrency and real-time search at this level.

Solutions in the industry load data into memory for calculation, but memory computing also poses challenges. How to process billions of data and present search results to customers in just tens of milliseconds?

There are many places to talk about. Today we will mainly talk about a small optimization point of large-scale real-time search engine technology; through this simple scenario, we will see how to use .NET to build a memory bitmap index to optimize the search engine calculation speed.

Disclaimer: To simplify knowledge and facilitate understanding, the scenarios and solutions in this article are fictitious. Any similarity is purely coincidental.

Due to space issues, this series of articles is divided into four articles:

  1. Introducing what a bitmap index is and how to build and use it in .NET

  2. Performance of bitmap index, .NET BCL library source code analysis, how to accelerate the calculation of bitmap index through SIMD

  3. Has CPU SIMD come to an end? What’s the next step?

  4. Build an efficient Bitmap memory index library and achieve observability (to be determined, there is not so much time to organize now)

What is a bitmap index

To answer such a question, let’s first assume a case where we abstract the flight rules into the following record type, and then some flight rule data like the following is loaded into the memory:

/// <summary>
/// cabin class
/// </summary>
public enum CabinClass {
    // first class
    F,
    // Economy class
    Y
}

/// <summary>
/// Flight rules
/// </summary>
/// <param name="Airline">Airline</param>
/// <param name="Class">Class</param>
/// <param name="Origin">Departure airport</param>
/// <param name="Destination">Arrival at the airport</param>
/// <param name="DepartureTime">Departure Time</param>
public record FlightRule(string Airline, CabinClass Class, string Origin, string Destination, string FlightNo, DateTime DepartureTime);

var flightRules = new FlightRule[]
{
    new ("A6", CabinClass.F, "PEK", "SHA", "A61234", DateTime.Parse("2023-10-11 08:00:00")) ,
    new ("CA", CabinClass.Y, "SHA", "PEK", "CA1234", DateTime.Parse("2023-10-13 08:00:00")) ,
    new ("CA", CabinClass.Y, "SHA", "PEK", "CA1234", DateTime.Parse("2023-10-14 08:00:00")) ,
    new ("CA", CabinClass.Y, "SHA", "PEK", "CA1234", DateTime.Parse("2023-10-15 08:00:00")) ,
    new ("CA", CabinClass.F, "SHA", "PEK", "CA1234", DateTime.Parse("2023-10-15 08:00:00")) ,
    new ("MU", CabinClass.F, "PEK", "CSX", "MU1234", DateTime.Parse("2023-10-16 08:00:00")) ,
    new ("9C", CabinClass.Y, "PEK", "CSX", "9C1234", DateTime.Parse("2023-10-17 08:00:00")) ,
};

Then there is a search form record type. If you want to write a search method for this record to filter the search results, I believe everyone will be able to implement a code soon. , for example, the following uses a simple for loop to achieve all this.

// Search method condition is the search condition
FlightRule[] SearchRule(FlightRuleSearchCondition condition)
{
    var matchRules = new List<FlightRule>();
    foreach (var rule in flightRules)
    {
        if (rule.Airline == condition.Airline & amp; & amp;
            rule.Class == condition.Class & amp; & amp;
            rule.Origin == condition.Origin & amp; & amp;
            rule.Destination == condition.Destination & amp; & amp;
            rule.DepartureTime.Date == condition.DepartureTime.Date)
        {
            matchRules.Add(rule);
        }
    }
    
    return matchRules.ToArray();
}

This solution is perfect when the amount of data is small, but its time complexity is O(N). You can recall the conclusion of the previous article on how to quickly traverse the List collection. We know that even if it is an empty loop, faced with dozens of When the amount of data is tens of thousands or millions, it will take several seconds.

4e491f388581976402bcd18afa1af969.png

image-20231015155730785

When the database engine faces this problem, it solves it through various index algorithms, such as B + tree, hash, inversion, skip list, etc., and of course the bit index we are going to mention today. Figure index.

Let’s first look at the definition of bitmap index: Bitmap index is a database index method that creates a bit vector for each possible column value. Each bit represents a row, and the bit is 1 if the row’s column value is equal to the value of the bit vector, and 0 otherwise. Particularly useful for working with columns with a small number of possible values. Sounds abstract, right? It doesn’t matter, we will know what it is through the following examples.

Build bitmap index

It’s still the flight rule data mentioned above. For example, the first Bit array is the row where the airline is CA, then the 0th bit represents the 0th element in the flight rule array, and its airline is CA, so this Bit It is True and assigned a value of 1; similarly, the first bit represents the first element in the flight rule data. Its airline is not CA, so it is assigned a value of 0.

new ("A6", CabinClass.F, "PEK", "SHA", "A61234", DateTime.Parse("2023-10-11 08:00:00\ ")),
new ("CA", CabinClass.Y, "SHA", "PEK", "CA1234", DateTime.Parse("2023-10-13 08:00:00")) ,
new ("CA", CabinClass.Y, "SHA", "PEK", "CA1234", DateTime.Parse("2023-10-14 08:00:00")) ,
new ("CA", CabinClass.Y, "SHA", "PEK", "CA1234", DateTime.Parse("2023-10-15 08:00:00")) ,
new ("CA", CabinClass.F, "SHA", "PEK", "CA1234", DateTime.Parse("2023-10-15 08:00:00")) ,
new ("MU", CabinClass.F, "PEK", "CSX", "MU1234", DateTime.Parse("2023-10-16 08:00:00")) ,
new ("9C", CabinClass.Y, "PEK", "CSX", "9C1234", DateTime.Parse("2023-10-17 08:00:00")) ,
Characteristics Rule 0 Rule 1 Rule 2 Rule 3 Rule 4 Rule 5 Rule 6
Airline CA 0 1 1 1 1 0 0

According to this rule, we can construct several Bit arrays with different dimensions according to its different dimensions. These arrays are combined together to form a Bitmap.

Rule number Airline CA Airline A6 Airline MU Airline 9C Economy Class Departure Airport PEK Departure Airport SHA Departure Airport CSX Arrival airport PEK Arrival airport SHA Arrival airport CSX
0 0 1 0 0 0 1 0 0 0 1 0
1 1 0 0 0 1 0 1 0 1 0 0
2 1 0 0 0 1 0 1 0 1 0 0
3 1 0 0 0 1 0 1 0 1 0 0
4 1 0 0 0 0 0 1 0 1 0 0
5 0 0 1 0 0 1 0 0 0 0 1
6 0 0 0 1 1 1 0 0 0 0 1

The word length of modern CPUs is 64 bits. It can process 64 bit data in one cycle. According to a loose algorithm, it is 64 times faster than direct for search (of course this is not the limit, The reason will be explained in a later article).

Bitmap index logical operation

The bitmap index has been built, so how to perform the search operation?

AND operation

For example, if we need to query the flights from CA with airline CA and departure airport SHA to PEK, we can use ANDoperator, perform AND operations on them respectively.

You can get the following Bit array, and the bit subscript corresponding to the bit 1 in this Bit array is the rule that meets the conditions. You can see that subscripts 1~4 are all rules that meet the conditions. .

Rule number Airline CA Departure airport SHA Arrival airport PEK AND results
0 0 0 0 0
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 1 1 1
5 0 0 0 0
6 0 0 0 0
OR operation

What should I do if I want to search for flights departing on October 13th and October 15th? In fact, it is very simple, just use the OR operator to first get the rules for October 13th and October 15th (please note , in actual projects, for high-cardinality data such as time, indexes will not be created for each day, but will be created using BSI, REBSI, etc.; or a more efficient index algorithm such as B + Tree will be used):

Rule number Departure date October 13th Departure date October 15th OR result
0 0 0 0
1 1 0 1
2 0 0 0
3 0 1 1
4 0 1 1
5 0 0 0
6 0 0 0

Then AND the result array shown above. You can see that only rules 1, 3 and 4 meet the requirements.

Rule number Last operation result OR result This time result
0 0 0 0
1 1 1 1
2 1 0 0
3 1 1 1
4 1 1 1
5 0 0 0
6 0 0 0
Not operation

So what should users do if they don’t want to take economy class? We have not constructed a Bit array for non-economy class here; the solution is actually very simple, we perform the NOT operation for the economy class:

Rule number Economy class NOT result
0 0 1
1 1 0
2 1 0
3 1 0
4 0 1
5 0 1
6 1 0

Then AND the results above can be used to get a list of flights that meet the above conditions but are not economy class. It can be found that only rule 4 is left to meet the needs:

Rule number Last operation result NOT result This time result
0 0 1 0
1 1 0 0
2 0 0 0
3 1 0 0
4 1 1 1
5 0 1 0
6 0 0 0

Code implementation

Please note that the code in this article is generated by AI and is for demonstration and reference only. It cannot be used in actual production environments. Please use other more mature implementations (such as BitArray).

So how to implement a Bitmap index? In fact, it is very simple. .NET already comes with the BitArray class. Bitmap can be implemented by combining multiple BitArrays using Dictionary .

In order to describe the principle in detail here, we do not use the officially provided BitArray, but implement a simple one by ourselves, which is actually a stored array and simple bit operations.

public class MyBitArray
{
    private long[] _data;

    // Each long type has 64 bits
    private const int BitsPerLong = 64;

    public int Length { get; }

    public MyBitArray(int length)
    {
        Length = length;
        // Calculate how many longs are needed to store all bits
        _data = new long[(length + BitsPerLong - 1) / BitsPerLong];
    }

    public bool this[int index]
    {
        // Get the value of the specified bit
        get => (_data[index / BitsPerLong] & amp; (1L << (index % BitsPerLong))) != 0;
        set
        {
            // Set the value of the specified bit
            if(value)
                _data[index / BitsPerLong] |= (1L << (index % BitsPerLong));
            else
                _data[index / BitsPerLong] & amp;= ~(1L << (index % BitsPerLong));
        }
    }

    public void And(MyBitArray other, MyBitArray result)
    {
        // Perform AND operation on two MyBitArrays
        for (int i = 0; i < _data.Length; i + + )
            result._data[i] = _data[i] & amp; other._data[i];
    }

    public void Or(MyBitArray other, MyBitArray result)
    {
        // Perform OR operation on two MyBitArrays
        for (int i = 0; i < _data.Length; i + + )
            result._data[i] = _data[i] | other._data[i];
    }

    public void Xor(MyBitArray other, MyBitArray result)
    {
        // Perform XOR operation on two MyBitArrays
        for (int i = 0; i < _data.Length; i + + )
            result._data[i] = _data[i] ^ other._data[i];
    }

    public void Not(MyBitArray result)
    {
        // Perform NOT operation on MyBitArray
        for (int i = 0; i < _data.Length; i + + )
            result._data[i] = ~_data[i];
    }
}

Then we can use Dictionary to implement a multi-dimensional BitMap:

//Define a class named MyBitmap
public class MyBitmap
{
    //Define a dictionary to store the mapping between strings and MyBitArray
    private readonly Dictionary<string, MyBitArray> _bitmaps;
    //Define an integer to store the length of the bitmap
    private readonly int _length;

    //Constructor, receives an integer as a parameter, and initializes the dictionary and length
    public MyBitmap(int length)
    {
        _bitmaps = new Dictionary<string, MyBitArray>();
        _length = length;
    }

    //Define an indexer to get or set MyBitArray through the string key
    public MyBitArray this[string key]
    {
        get
        {
            //If the key exists in the dictionary, return the corresponding MyBitArray
            //If it does not exist, create a new MyBitArray, add it to the dictionary, and return
            if (_bitmaps.TryGetValue(key, out MyBitArray? value)) return value;
            value = new MyBitArray(_length);
            _bitmaps[key] = value;
            return value;
        }
        set
        {
            //Set the MyBitArray corresponding to the key in the dictionary
            _bitmaps[key] = value;
        }
    }

    //Define an And method, receiving a string key, a MyBitArray and a result MyBitArray as parameters
    //Perform And operation on the MyBitArray corresponding to the key and the incoming MyBitArray, and store the result in the result MyBitArray
    public void And(string key, MyBitArray bitArray, MyBitArray result)
    {
        this[key].And(bitArray, result);
    }

    //Define an Or method that receives a string key, a MyBitArray and a result MyBitArray as parameters
    //Or operate the MyBitArray corresponding to the key and the incoming MyBitArray, and store the result in the result MyBitArray
    public void Or(string key, MyBitArray bitArray, MyBitArray result)
    {
        this[key].Or(bitArray, result);
    }

    //Define an Xor method that receives a string key, a MyBitArray and a result MyBitArray as parameters
    //Perform Xor operation on the MyBitArray corresponding to the key and the incoming MyBitArray, and store the result in the result MyBitArray
    public void Xor(string key, MyBitArray bitArray, MyBitArray result)
    {
        this[key].Xor(bitArray, result);
    }

    //Define a Not method that receives a string key and a result MyBitArray as parameters
    //Perform Not operation on the MyBitArray corresponding to the key, and store the result in the result MyBitArray
    public void Not(string key, MyBitArray result)
    {
        this[key].Not(result);
    }
}

Then write a Build method to create FlightRule[] into MyBitmap. This process can be done automatically using code generation without manual work. Writing, let’s demonstrate it here:

MyBitmap Build(FlightRule[] rules)
{
    var bitmap = new MyBitmap(rules.Length);
    for (int i = 0; i < rules.Length; i + + )
    {
        // Construct the bitmap index dimension
        // You don’t need to write like this in actual projects. You can use code generation technology to automatically build it. This is just an example.
        bitmap["Airline-A6"][i] = rules[i].Airline == "A6";
        bitmap["Airline-CA"][i] = rules[i].Airline == "CA";
        bitmap["Airline-MU"][i] = rules[i].Airline == "MU";
        bitmap["Airline-9C"][i] = rules[i].Airline == "9C";
        bitmap["CabinClass-F"][i] = rules[i].Class == CabinClass.F;
        bitmap["CabinClass-Y"][i] = rules[i].Class == CabinClass.Y;
        bitmap["Origin-PEK"][i] = rules[i].Origin == "PEK";
        bitmap["Origin-SHA"][i] = rules[i].Origin == "SHA";
        bitmap["Destination-CSX"][i] = rules[i].Destination == "CSX";
        bitmap["Destination-PEK"][i] = rules[i].Destination == "PEK";
        //..........Other dimensions
    }

    return bitmap;
}

Call the Build method and simply perform a calculation query (the airline is CA, first class). The code and operation results are as follows:

var flightRuleBitmap = Build(flightRules);

//Search for CA first class flights
var result = new MyBitArray(flightRules.Length);
flightRuleBitmap.And("Airline-CA", flightRuleBitmap["CabinClass-F"], result);

// Output the index that is true in the result
for (int i = 0; i < result.Length; i + + )
{
    if (result[i])
        Console.WriteLine(i);
}

7ca6923f039777bbe3cf322b19fc882e.png

image-20231015170206603

In actual projects, Bitmap indexes can be established for most fields. For those that cannot be established, it does not matter. After the initial screening of the Bitmap index, you can use the for loop to traverse and finely screen the desired data.

The advantages and disadvantages of bitmap index

Of course, bitmap index has its own advantages and disadvantages. We should use it in appropriate scenarios, maximize its advantages, and try to avoid its disadvantages.

Advantages
  1. Efficient set operations: Bitmap indexes can efficiently handle complex query conditions using bit operations (such as AND, OR, NOT, etc.), which are often difficult to achieve in other types of indexes.

  2. Space efficiency: For low-cardinality data, bitmap indexes are generally more space-efficient than other types of indexes. Because each row only requires one bit, not a complete key value and pointer (you can do a simple calculation, 100 million data in one dimension only requires 12MB of space, even if there are 300 dimensions, that is only 3.5GB of space. In addition, there are many bitmap index compression algorithms (such as BBC, RLE, Roaring, etc.), and the space usage will become lower.).

  3. Range queries: Bitmap indexes can efficiently handle range queries, and only require OR operations on related bitmaps (such as the several construction methods mentioned above, BSI, REBSI, etc.).

Disadvantages
  1. Update overhead: If the data changes frequently, the cost of maintaining a bitmap index can be high. Each data change may require updating multiple bitmaps. Therefore, bitmap indexes are typically used in data warehouses and other environments that are primarily read-only, rather than in online transaction processing (OLTP) environments that require frequent updates.

  2. High-cardinality data: For high-cardinality data, that is, data with many possible values, bitmap indexes may take up a lot of space. Each possible value requires a bitmap, so if the data has very many possible values, a bitmap index may not be the best choice.

  3. Concurrency issues: Bitmap indexes can have problems handling large numbers of concurrent writes because each update requires the bitmap to be locked and modified. This may become a performance bottleneck in a highly concurrent OLTP environment (usually solved using Copy On Write).

Summary

In this sharing, we explore the principles and applications of bitmap indexing through a business scenario of air ticket search. As an efficient data indexing method, bitmap index can optimize the calculation speed of search engines, reduce memory usage and improve performance under large-scale data volumes. We introduced in detail the construction of bitmap indexes and how to perform search operations through logical operations. At the same time, we also implemented a simple bitmap index and demonstrated it through an example. Finally, we also discussed the advantages and disadvantages of bitmap indexes, giving us a more comprehensive understanding of the characteristics and applicable scenarios of bitmap indexes.

Although bitmap indexes have significant advantages when processing large-scale data, they may have problems in scenarios with frequent data updates, high cardinality data, and concurrent writes. Therefore, how to optimize bitmap indexes in these scenarios to better adapt to different business needs will be a question that we need to further explore in the future. In addition, how to combine other indexing algorithms, such as B + tree, hashing, inversion, skip list, etc., and how to use the features of modern CPUs, such as SIMD, to further improve the performance of bitmap indexes are also our future research directions. .

Next issue preview

In the next issue, we will delve into the performance issues of bitmap indexes, including analysis of the .NET BCL library source code, and how to accelerate the calculation of bitmap indexes through SIMD. I hope you can continue to pay attention to the following sharing and discuss relevant knowledge of .NET high-performance development.