Hash and BloomFilter, bitmap

Produced by Zero Sound Academy Teacher Mark QQ: 2548898954

Hash and BloomFilter, bitmap

  • background
  • need
  • balanced binary tree
  • hash table
  • hash function
    • select the hash
    • load factor
    • conflict handling
  • bloom filter
    • background
    • constitute
    • principle
    • Application Scenario
    • Application Analysis
      • variable relationship
      • determine n and p
      • Choose a hash function
  • Distributed consistency hash
    • background
    • Application Scenario
    • hash offset
    • virtual node
  • think

Overall Knowledge Context

Background

  • When using a word document, how does word determine whether a word is spelled correctly?
  • Web crawler program, how to make it not crawl the same url page?
  • How is the spam filtering algorithm designed?
  • When the public security handles a case, how to judge whether a suspect is on the fugitive list?
  • How to solve the cache penetration problem?

Requirements

Query whether a certain string exists from massive data?

Balanced binary tree

The time complexity of adding, deleting, modifying and checking is

o

(

l

o

g

2

no

)

O(log2n)

O(log2n)
The purpose of balance is to ensure that half of the data can be stably excluded in the next search after adding, deleting and modifying;

o

(

l

o

g

2

no

)

O(log2n)

Intuitive understanding of O(log2n): 1 million nodes, up to 20 comparisons; 1 billion nodes, up to 30 comparisons;
Summary: Ensure the order by comparison, and achieve the purpose of fast indexing by excluding half of elements each time;

Hash table

The data structure for calculating the position of the key in the table according to the key; it is the mapping relationship between the key and its storage address;
Note: kv is stored together in the nodes of the hash table;

struct node {<!-- -->
void *key;
void *val;
struct node *next;
};

hash function

Mapping function Hash(key)=addr; the hash function may map two or more different keys to the same address, which is
This situation is called Conflict (or hash collision);

Select hash

  • calculation speed
  • Strong random distribution (equal probability, evenly distributed in the entire address space)
  • murmurhash1, murmurhash2, murmurhash3, siphash (used in redis6.0, the hash algorithm chosen by most languages such as rust to implement hashmap), cityhash all have strong random distribution; the test address is as follows: https://github.com/aappleby/ smhasher
  • siphash mainly solves the strong random distribution of strings;

Load factor

  • The number of array storage elements/data length; used to describe the storage density of the hash table; the smaller the load factor, the smaller the conflict, and the larger the load factor, the greater the conflict;

Conflict handling

  • Linked list method:

    Refer to the linked list to handle hash conflicts; that is, link the conflicting elements with a linked list; this is also a common way to deal with conflicts; but there may be an extreme situation where there are many conflicting elements and the conflicting list is too long, this At this time, this linked list can be converted into a red-black tree; from the time complexity of the original linked list to the time complexity of a red-black tree

    o

    (

    no

    )

    O(n)

    O(n); Then what is the basis for judging that the linked list is too long? When more than 256 (experience value) nodes can be used, the linked list structure can be converted into a red-black tree structure;

  • Open addressing method:

Store all elements in the array of the hash table without using additional data structures; generally use the idea of linear exploration to solve:

  1. When inserting a new element, use the hash function to locate the element position in the hash table;
  2. Checks if there is an element at this slot index in the array. If the slot is empty, insert, otherwise 3;
  3. Add a certain step to the index of the slot detected by 2 and then check 2; adding a certain step can be divided into the following types:
    1. i+1,i+2,i+3,i+4, … ,i+n
    2. i-12 ,i + 22 ,i-32 ,1 + 42 , … this Both will lead to similar hash aggregation; that is, the approximate value has similar hash values, so its array slots are also close to each other, forming hash aggregation; the first type of similar aggregation conflicts comes first, and the second type only delays aggregation conflicts; In addition, double hashing can be used to solve the hash aggregation phenomenon above:

      The hash function Hk in the .net HashTable class is defined as follows:
      Hk(key) = [GetHash(key) + k * (1 + (((GetHash(key) >> 5) + 1) %
      (hashsize – 1)))] % hashsize
      Here (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1))) with
      hash size
      Mutually prime numbers (two numbers being mutually prime means that they have no common prime factors);
      After performing hashsize probes, each location in the hash table is visited exactly once,
      That is to say, for a given key, Hi and Hj will not be used at the same time for the same position in the hash table;

    3. Specifically https://www.cnblogs.com/organic/p/6283476.html

Bloom filter

Background

Bloom filter is a probabilistic data structure, which is characterized by efficient insertion and query, and can determine that a certain string must not exist or is possible exists;
The Bloom filter does not store specific data, so it takes up little space , and the query results have errors, but the errors are controllable , and delete operations are not supported;

Composition

Bitmap (BIT array) + n hash functions
m % 2n= m & (2n – 1)

Principle

When an element is added to the bitmap, map this element to k points of the bitmap through k hash functions, and set them to 1;
When retrieving, check whether the k points of the bitmap are all 1 through k hash function operations; if there is a point that is not 1, then the key does not exist; if all are 1, it may exist;

Why is delete operation not supported?

  • Each slot in the bitmap has only two states ( 0 or 1 ), a slot is set to 1 state, but it is not sure how many times it is set; that is, it is not known how many key hashes are mapped And which specific hash function is mapped from;

Application scenario

Bloom filters are usually used to judge the scene where a certain key must not exist, and at the same time allow the judgment that there is an error when it exists;
Common processing scenarios: ① cache penetration solution; ② hot key current limiting;

  • Describe the caching scenario. In order to reduce the access pressure of the database (mysql), a cache is added between the server and the database (mysql) to store hot data;
  • Describe the cache penetration, when the server side requests data, The cache and database do not contain the data, and finally the request pressure all floods to the database;
  • Data request steps, as shown in Figure 2;
  • Cause: Hackers use vulnerabilities to forge data attacks or internal business bugs cause a large number of repeated requests for non-existent data;
  • Solution: as shown in figure 3;

Application analysis

In practical applications, how many hash functions should be selected? How much space to allocate for the bitmap? How many elements are expected to be stored? How to control the error?
The formula is as follows:

n -- the expected number of elements in the Bloom filter, as shown in the figure above, there are only two elements, str1 and str2, then n=
p -- false positive rate, between 0-1 0.000000
m -- the space occupied by the bitmap
k -- the number of hash functions
The formula is as follows:
n = ceil(m / (-k / log(1 - exp(log(p) / k))))
p = pow(1 - exp(-k / (m / n)), k)
m = ceil((n * log(p)) / log(1 / pow(2, log(2))));
k = round((m / n) * log(2));

Variable relationship

Assume 4 initial values:
n = 4000
p = 0.
m = 172532
k = 30



Why does i*31 appear during the implementation of the Baidu hash function in the interview?

  • i * 31 = i * (32-1) = i * (1<<5 -1) = i << 5 - i;
  • 31 prime numbers, hash random distribution is the best;

Determining n and p

When actually using the Bloom filter, you first need to determine n and p, and get m and k through the above operations; usually you can choose the appropriate value on the following website;
https://hur.st/bloomfilter

Choose a hash function

Select a hash function, pass different seed offset values to the hash, and use linear search to construct multiple hash functions;

#define MIX_UINT64(v) ((uint32_t)((v>>32)^(v)))
uint64_t hash1 = MurmurHash2_x64(key, len, Seed);
uint64_t hash2 = MurmurHash2_x64(key, len, MIX_UINT64(hash1));
for (i = 0 ; i < k; i ++ ) // k is the number of hash functions
{<!-- -->
Pos[i] = (hash1 + i*hash2) % m; // m is the size of the bitmap
}

Distributed consistency hash

Background

The distributed consistent hash algorithm organizes the hash space into a virtual ring, and the size of the ring is 232;

The algorithm is: hash(ip) % 232, and finally an unsigned integer between [0, 232– 1] will be obtained, which represents the server number; multiple servers map a point on the hash ring in this way to identify the location of the server; when the user operates a certain key, generates a value through the same algorithm, and locates a server clockwise along the ring, then The key is on the server
in the server; the picture comes from the Internet;

Application scenario

Distributed cache; distribute data evenly among different servers to share the pressure on the cache server;
Solve the change in the number of cache servers so as not to affect cache invalidation;

hash offset

The results obtained by the hash algorithm are random, and it cannot be guaranteed that the server nodes are evenly distributed on the hash ring; uneven distribution causes uneven request access and uneven pressure on the server; the pictures are from the Internet;

Virtual node

In order to solve the problem of hash offset, the concept of virtual nodes is added; in theory, the more nodes on the hash ring, the more balanced the data distribution;
Calculate multiple hash nodes (virtual nodes) for each service node; usually, hash(“IP:PORT:seqno”)% 232;

The picture comes from the Internet;

Thinking

  • Distributed consistency hash How to add or delete nodes for data migration?
    Reference: https://github.com/metang326/consistent_hashing_cpp