Bitmap/hyperloglog/GEO of Redis

bitmap/hyperlog/GEO of Redis

  • Introduction of interview questions
  • Two types of statistics
  • three hyperloglog
    • 3.1 Industry Terminology
    • 3.2 hyperloglog basis
      • 3.2.1 Base
      • 3.2.2 Definition
      • 3.2.3 Cardinality statistics
      • 3.2.4 Basic commands
    • 3.3 Principle of HyperLogLog
      • 3.3.1 Ways to deduplicate statistics
      • 3.3.2 Principle
    • 3.4 HyperLogLog case study
      • 3.4.1 Requirements
      • 3.4.2 Scheme Discussion
      • 3.4.3 HyperLogLogService
  • Four geos
    • 4.1 Introduction of interview questions
    • 4.2 redis GEO basic commands
    • 4.3 Case
      • 4.3.1 Key points
      • 4.3.2 Code combat
  • Five bitmaps
    • 5.1 Introduction of interview questions
    • 5.2 Definition
    • 5.3 Function
    • 5.4 Basic commands

Introduction of interview questions

  • Douyin e-commerce live broadcast, the product introduced by the anchor has comments, 1 product corresponds to 1 series of comments, sort + display + extract 10 records.
  • The user’s check-in and check-in information on the mobile APP corresponds to the check-in record of the 1st series yo curling in one day. How to count whether Dingding check-in and check-in come or not?
  • Apply web page access information on the website: 1 web page corresponds to 1 series of access clicks, Taobao home page, how many people visit the home page every day?
  • After the company’s system goes online, what are the UV, PV, and DAU respectively?
  • Perform statistics on the data in the collection:
    • In mobile applications, it is necessary to count the number of new users every day and the number of retained users on the second day.
    • In product reviews on e-commerce websites, it is necessary to count the latest reviews in the review table.
    • In the check-in and check-in process, it is necessary to count the number of users who have checked in continuously within a month.
    • In the webpage access records, it is necessary to count the number of Unique Visitors (VU)
    • Pain points: The access levels of users like Toutiao, Douyin, and Taobao are all 100 million. How to deal with it?

Second type of statistics

In billion-level systems, there are four common statistics

  • Aggregate statistics: count the aggregation results of multiple collection elements (intersection and union statistics)

  • Sorting statistics: Please design a display list for the scene of the latest comments and comments on Douyin short videos. Investigate the understanding of data structure and design ideas

    • zset:

    • When faced with the need to display the latest list, leaderboard, etc., if the data is frequently updated or needs to be displayed in pages, it is recommended to use Zset.

  • Binary statistics: There are only two values of set elements: 0 and 1. In the scenario of checking in and clocking in at work on DingTalk, we only use the record of whether there is a sign-in (1) or no sign-in (0). (bitmap)

  • Cardinality statistics: refers to counting the number of unique elements in a set. (hyperloglog)

Three hyperloglog

3.1 Industry Terminology

  • UV: Unique Visitor, an independent visitor, generally understood as the client IP. need to reconsider
  • PV: Page View, page views.
  • DAU: Daily Active User, the number of daily active users. The number of users who have logged in or used a certain product (remove users who have logged in repeatedly). Frequent users reflect the operation of websites, Internet applications or online games.
  • MAU: Monthly Active User, the number of monthly active users.

3.2 hyperloglog basis

3.2.1 Base

It is a data set, the real number after deduplication.

3.2.2 Definition

Redis Hyperloglog is an algorithm for cardinality statistics. The advantage of hyperloglog is that when the number or volume of input elements is very large, the space required to calculate the cardinality is always fixed and small.
In Redis, each HyperLogLog key only needs 12KB of memory to calculate the cardinality of nearly 2^64 different elements. This is in stark contrast to collections where more elements consume more memory when computing cardinality.
However, because HyperLogLog only calculates the cardinality based on the input elements, and does not store the input elements themselves, HyperLogLog cannot return the individual elements of the input like a collection.

3.2.3 Cardinality Statistics

It is used to count the number of unique elements in a set, which is the calculation of the remaining elements after deduplication of the set.

3.2.4 Basic commands

3.3 Principle of HyperLogLog

3.3.1 How to remove duplicate statistics

  • HashSet

  • bitMap

    If the amount of data is large, billion-level statistics, using bitmaps will also have this problem.
    The bitmap uses a bit array to indicate whether each element appears, each element corresponds to one bit, and the total memory required is N bits.
    Cardinality The base number is one bit in the bit array corresponding to each element, such as the bit array 010010101 (according to the subscript starting from zero, some are 1, 4, 6, 8).
    The newly entered elements only need to perform a bitwise OR calculation on the existing bit input and the newly added elements. This method can greatly reduce the memory usage and the bit operation is fast.
    But, Assume that a sample case is 100 million cardinal place value data, and a sample is 100 million.
    If you want to count the base digit value of 100 million data, you need about 100000000/8/1024/1024 of memory, which is about 12M, and the effect of reducing memory usage is remarkable. In this way, it takes 12M to obtain the base value of counting an object sample.
    If 10,000 object samples are counted, 117.18G or nearly 120G is required. It can be seen that using bitmaps is still not suitable for cardinality counting scenarios under large data volumes (billion level).
    bitmaps methods are computed exactly.

  • Summary of the above

    The more sample elements there are, the memory consumption will increase sharply, which is difficult to control + all kinds of slowness, which is not suitable for billion-level statistics.

  • Solution: Probabilistic Algorithms

    In exchange for space by sacrificing accuracy, it can be used in scenarios that do not require absolute accuracy. Because the probability algorithm does not directly store the data itself, it estimates the base value through a certain probability and statistics method, and at the same time ensures that the error is within a certain range. Since it does not store data, it can greatly save memory.
    HyperLogLog is an implementation of a probabilistic algorithm.

3.3.2 Principle

HyperLogLog only performs non-repeated cardinality statistics, it is not a collection or saves data, it only records the quantity rather than the specific content.
There is an error: HyperLogLog provides an imprecise deduplication counting solution, sacrificing accuracy in exchange for space, and the error is only about 0.81%.

3.4 HyperLogLog case study

3.4.1 Requirements

  • UV statistics need to be deduplicated, and multiple visits by a user within a day can only be counted as one.
  • The UV on the homepage of Taobao and Tmall is about 100-150 million per day on average.
  • 150 million IPs are stored every day. When the visitor comes, first check whether it exists, if not, join.

3.4.2 Program Discussion

  • Use mysql (o(╥﹏╥)o)

  • Use the hash structure of redis to store

    redis -hash = >
    According to the structure of ipv4, each ipv4 address is up to 15 bytes, 150 million*15 bytes in a day = 2G, 60G a month, the storage capacity is too large.

  • hyperLogLog:

3.4.3 HyperLogLogService

@Service
public class HyperLogLogService{<!-- -->

@Resource
private RedisTemplate redisTemplate;
\t
//Simulate ip click access
@PostConstruct
public void initIp(){<!-- -->
new Thread(()->{<!-- -->
String ip = null;
for(int i =0;i<200;i ++ ){<!-- -->
Random random = new Random();
ip = random. nextInt(256) + "." +
random. nextInt(256) + "." +
random. nextInt(256) + "." +
random. nextInt(256) ;
\t\t\t\t
redisTemplate.opfForHyperLogLog().add("hll",ip);

//simulation pause for 3 seconds
try{<!-- -->
TimeUnit. SECODNS. sleep(3);
}catch(InterruptedException e){<!-- -->
e.printStackTrace();
}
}
},"t1").start();

}

//Get UV
public long uv(){<!-- -->
return redisTemplate.opsForHyperLogLog().size("hll");
}

}

Four GEO

4.1 Introduction of interview questions

In the era of mobile Internet, there are more and more LBS applications, such as people attached to dating software, gourmet shops attached to food delivery software, vehicles attached to taxi-hailing software, etc. Then how is the selection of various XXX address locations in this attachment realized?
What could be the problem?

  • Query performance issues. If the concurrency is high and the amount of data is large, this kind of query will destroy the mysql database.
  • Generally, mysql query is a flat rectangular access, while the car service is covered by a circle with a radius of N kilometers as the center.
  • The problem of accuracy, we know that the earth is not a plane coordinate system, but a sphere. This kind of rectangular calculation will have a large error in long-distance calculations, and mysql is not suitable.

4.2 redis GEO basic commands

  • GEOADD : Add latitude and longitude coordinates
  • GEOPOS: returns latitude and longitude
  • GEOHASH: Returns the GEOHASH representation of the coordinates
  • GEODIST: Distance between two locations
  • GEORADIUS: Find nearby XXX around a radius
  • GEORADIUSBYMEMBER:

4.3 Case

4.3.1 Key points

GEORADIUS: With a given latitude and longitude as the center, find elements within a certain radius.

4.3.2 Code combat

GeoController

@RestController
public class GeoController{<!-- -->

@Resource
private GeoService geoService;

//Add latitude and longitude coordinates
@GetMapping("/geoadd")
public String geoAdd(){<!-- -->
return geoService. getAdd();
}
// Get latitude and longitude coordinates geopos
@GetMapping("/geopos")
public Point position(String member){<!-- -->
return geoService. position(member);
}

//Get the base32 encoded value geohash generated by latitude and longitude
@GetMapping("/geohash")
public Point hash(String member){<!-- -->
return geoService.hash(member);
}

//Get the base32 encoded value geohash generated by latitude and longitude
@GetMapping("/geohash")
public Point hash(String member){<!-- -->
return geoService.hash(member);
}

//Get the distance between the given two positions
@GetMapping("/geodist")
public Distance distance(String member1,String memeber2){<!-- -->
return geoService.distance(member1,member2);
}

// Obtain the position near a fixed point by latitude and longitude (the position is hard-coded)
@GetMapping("/georadius")
public GeoResults radiuByxy(){<!-- -->
return geoService.radiuByxy();
}
}

GeoService

@Service
public class GeoService{<!-- -->
\t
public static final String CITY = "city";
\t
@Autowired
private RedisTemplate redisTemplate;

public String geoAdd(){<!-- -->
Map<String,Point> map = new HashMap<>();
map.put("Tiananmen",new Point(116.41338 , 39.91092 ));
map.put("Forbidden City",new Point(116.40341 , 39.92409 ));
map.put("Great Wall",new Point(116.02407 , 40.36264 ));
redisTemplate.opsForGeo().add(CITY,map);
return map.toString();
}

public Point position(String member){<!-- -->
List<Point> list = redisTemplate.opsForGeo().position(CITY,member);
return list. get(0);
}

public Point hash(String member){<!-- -->
List<String> list = redisTemplate.opsForGeo().hash(CITY,member);
return list. get(0);
}

public Distance distance(String member1,String memeber2){<!-- -->
Distance distance = redisTemplate.opsForGeo().distance(CITY,member1,member2,
RedisGeoCommands.DistanceUnit.KILOMETERS);
return distance;
}

public GeoResults radiuByxy(){<!-- -->
//Latitude and longitude: 116.40048, 39.91680 Zhongshan Park
Circle circle = new Circle(116.40048 , 39.91680,Metrics.KILMETERS.getMultiplier());
//return 50
RedisGeoCommands.GeoRadiusCommandArgs args = RedisGeoCommands.GeoRadiusCommandArgs
.newGeoRadiusArgs().includeDistance()
.includeCoordinates().sortDescending().limit(50);
GeoResults redius = redisTemplate.opsForGeo().radius(CITY,circle,args)
return redius;
}
}

five bitmap

5.1 Introduction of interview questions

  • Daily statistics
  • Continuous check-in
  • Active users in the last week
  • Count the number of days logged in by a specified user in a year
  • According to the 365 days of the year, which days did the user log in, and which days did not log in, how many days did the user log in in the whole year?

5.2 Definition

A bit array of binary bits represented by 0 and 1 states.

Using the String type as the underlying data structure to implement a statistical binary state data type.
A bitmap is essentially an array, which is a bitwise operation based on the String data type. The array consists of bits, each of which corresponds to an offset (we can call it an index or bit cell). The maximum number of bits supported by Bitmap is 2^32 bits, which can greatly save storage space, and can store up to 4.29 billion bytes of information using 512 memory.

5.3 Function

Used for state statistics, Y, N, similar to AtomicBoolean.

  • Whether the user has logged in Y, N
  • Whether movies and advertisements have been clicked to play
  • Dingding clock-in and commute, sign-in statistics

5.4 Basic commands

The offset of Bitmap is calculated from zero.