[1236]Causes and solutions for hive data skew

Article directory

- Performance of hive data skew
- Reasons for hive data skew
- Hive data skew solution
- - Hive tilt group by aggregation tilt
  - Map and Reduce optimization of Hive tilt
  - When Hive tilt’s HQL contains count(distinct)
  - Hive tilt join optimization in HQL
  - Summary of the above description
- The underlying processing logic of ODPS MR can be summarized into the following steps:

Speaking of data skew in hive, some friends may not understand what data skew is, so we will start with the manifestations of hive data skew, the causes of hive data skew, and the solutions to hive data skew Let’s talk about hive’s data tilt from three aspects

Performance of hive data skew

We all know that the bottom layer of hive is actually the mr (MapReduce) engine. hsql actually converts the sql language into mr for running. This greatly reduces the time we spend writing mr. However, sometimes we will find that when you run a task At that time, obviously all the map tasks were completed, and 99% of the reduce tasks were also completed, leaving only one of the latter. A few reduce tasks have been executing, and they have not moved after waiting for a long time. In fact, this situation usually happens. Data skew, to put it bluntly, hive’s data skew is essentially MapReduce’s data skew (Interested friends can check the task monitoring page)

The reason for hive data skew

The main reasons for data skew are join skew and aggregation skew. In fact, the problem of data skew is very common in the MapReduce programming model. The fundamental reason is that a large number of the same keys are allocated. In a reduce, one reduce task is exhausted, but other reduce tasks are idle. Let’s list the common causes of data skew:

1. Unbalanced key distribution
2. Business issues The latter is a problem with the business data itself, some data are relatively concentrated
3. Not thinking carefully when creating the table
4. Some SQL statements themselves have data skew, for example:
(1) Large table join small table: In fact, the keys of small tables are concentrated, and the data distributed to one or several reducers is much higher than the average
(2) Joining a large table: Null values or meaningless values: If there are many missing items, these null values will be very concentrated when doing the join, dragging down the progress.
(3) Group by: When performing group by, the dimension is too small, the number of certain values is too large, and processing the reduce of a certain value is very time-consuming.
(4) Count distinct: There are too many special values, and the reduce processing of this special value is time-consuming.

Hive data skew solution

Hive tilt group by aggregation tilt

reason:
There are too few grouping dimensions and too many values in each dimension, causing the reduce to process a certain value to take a long time;
When counting some types of data, the amount of certain types of data is particularly large, while other data types are particularly small. When performing group by according to type, the data required by the reduce task of the same group by field will be pulled to the same node for aggregation. When the amount of data in each group is too large, calculations for other groups will occur. has been completed but the reduce has not yet been calculated. Other nodes have been waiting for the completion of the task of this node, so you will always see map 100% reduce99%;
Solution: Perform load balancing when there is data skew

set hive.map.aggr=true;
set hive.groupby.skewindata=true;

principle:
hive.map.aggr=true This configuration represents enabling map-side aggregation;
hive.groupby.skewindata=true, when the option is set to true, the generated query plan will have two MRJobs. In the first MRJob, the output result set of the Map will be randomly distributed to the Reduce. Each Reduce will perform part of the aggregation operation and output the result. The result of this processing is the same as the Group By Key and may be distributed to different Reduces. To achieve the purpose of load balancing; the second MRJob is distributed to the Reduce according to the Group By Key based on the preprocessed data results (this process can ensure that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operate.

Hive Tilt Map and Reduce Optimization

Reason 1: When there are too many small files, small files need to be merged. It can be solved by set hive.merge.mapredfiles=true;
Reason 2: There are serious problems with large and small blocks in the input data. For example, a large file is 128M, and there are 1000 small files, each 1KB. Solution: Merge files before inputting the task, and merge many small files into one large file. Solved by set hive.merge.mapredfiles=true;
Reason 3: The size of a single file is slightly larger than the configured block size. In this case, the number of maps needs to be increased appropriately. Solution: set the number of mapred.map.tasks;
Reason 4: The file size is moderate, but the calculation amount on the map side is very large, such as: select id, count(*), sum(case when...), sum(case when...)... Need to add more maps number. Solution: set mapred.map.tasks number, set mapred.reduce.tasks number;

When Hive tilt HQL contains count(distinct)

Reason: If the amount of data is very large, the problem of data skew will occur when executing SQL such as select a, count(distinct b) from t group by a;.
Solution: Use sum…group by instead. For example: select a,sum(1) from(select a,b from t group by a,b) group by a;

Hive tilted join optimization in HQL

Reason: When encountering a large table and a small table for join operation. Use mapjoin to load small tables into memory. like:

select /* + MAPJOIN(a) */ a.c1, b.c1,b.c2 from a join b where a.c1 = b.c1;

When it is necessary to join, but the data in the associated field is null, for example, the id of table 1 needs to be related to the id of table 2;

Solution 1: Those whose id is null do not participate in the association, for example:

select * from log a
?join users b
on a.id is not null and a.id = b.id
union all
select * from log a
where a.id is null;

Solution 2: Assign a random key value to the null value, such as:

select * from log a
left outer join users b
on case when a.user_id is null
then concat('hive',rand())
else a.user_id end = b.user_id;

Set the number of Maps appropriately

When joining large and small tables: use map join to advance the small dimension table into memory and complete the reduce on the map side. high efficiency.
When joining a large table: change the null value key into a string plus a random number, and divide the skewed data into different reducers. Since the null value cannot be associated, the final result will not be affected after processing.

Summary of the above description

1) Normally, a job will generate one or more map tasks through the input directory.
The main determining factors are: the total number of input files, the input file size, and the file block size set by the cluster.
2) Is the more maps the better?
the answer is negative. If a task has many small files (much smaller than the block size of 128m), each small file will also be treated as a block and completed with a map task, and the startup and initialization time of a map task is much longer than that of logical processing. time will cause a huge waste of resources. Moreover, the number of maps that can be executed at the same time is limited.
3) Is it guaranteed that each map can process nearly 128m file blocks, so that we can sit back and relax?
The answer is not necessarily. For example, if there is a 127m file, a map would normally be used to complete it, but this file only has one or two small fields, but has tens of millions of records. If the map processing logic is more complex, it will definitely be done with a map task. It is also more time consuming.
For problems 2 and 3 above, we need to take two ways to solve them: reducing the number of maps and increasing the number of maps;

The underlying processing logic of ODPS MR can be summarized into the following steps:

Task Submission: When a user submits an MR task to ODPS, ODPS will schedule resources and start a corresponding number of Mapper and Reducer tasks on the distributed computing cluster.

Data segmentation: The input data will be divided into several data blocks, and each data block will be assigned to a Mapper task for processing. Segmentation strategies can be configured based on data size, data format, etc.

Map phase: Each Mapper task will read the data block it is responsible for processing, parse the data into key-value pairs, and output the results to the temporary storage area.

Shuffle phase: After the Map phase, the system will merge, sort, and repartition the output results of the Mapper task to facilitate transmission to the Reducer task. The main goal of the Shuffle phase is to ensure that data with the same key can be sent to the same Reducer task.

Reduce phase: Each Reducer task will receive a part of the data processed by the Shuffle phase, aggregate and process the data, and output the final results to the output data table.

Data merging: The output results of multiple Reducer tasks will be merged into the final output data. The data merging method can be configured according to application requirements, such as merging into one file or multiple files.

Result output: The final output data will be saved to the user-specified data table, which can be used for subsequent query, analysis and storage.

In the entire ODPS MR underlying processing logic, data input, segmentation, Map, Shuffle, Reduce and other stages are highly parallelized to facilitate processing of large-scale data sets. At the same time, ODPS will also automatically perform management operations such as task scheduling, resource allocation, fault tolerance recovery and monitoring to ensure the smooth execution of tasks and the accuracy of data.

Reference: https://blog.csdn.net/wind96/article/details/127696043
https://blog.csdn.net/weixin_43542605/article/details/122326059
https://baijiahao.baidu.com/s?id=1764400201507671499 & amp;wfr=spider & amp;for=pc