Operation of partition table and bucket table in Hive

Table of Contents

Partition Table

First level partition table

Knowledge points

Example

multi-level partitioning

Knowledge points

Example

Partition table operations

Example

hadoop_hive documentation

Bucket table

Important parameters

Basic bucket table

Knowledge points

Example

Bucket table sorting

Knowledge points

Example

Bucketing principle

The difference between partition table and bucket table

Partition table

Partitioned table features/benefits: It is necessary to generate a partition directory, and use the partition field to filter data when querying to avoid full table scanning and improve query efficiency.

Note on efficiency: If the partition table is partitioned and the partition field is not used to filter the data when querying the data, the efficiency will not change

Note on partition field name: The partition field name cannot be repeated with the original field name, because the partition field name must be spliced into the table as a field.

Level 1 partition table

Knowledge Point

Create a partitioned table: create table [external] table [if not exists] table name (field name field type, field name field type,…) partition by (partition field name partition field type)…;

Automatically generate a partition directory and insert data: load data [local] inpath ‘file path’ into table partition table name partition (partition field name = ‘value’);

Note: If you add local, the file path after it should be the Linux local path. If not, it will be the hdfs file path.

example

-- 2. Multi-level partition table
--Create table
create table more_part_order(
    oid string,
    name string,
    price float,
    num int
) partitioned by ( year string, month string, day string)
row format delimited fields terminated by ' ';
--Insert data
load data inpath '/source/order202251.txt'
    into table more_part_order
    partition (year='2022',month='2022-05',day='2022-05-01');
load data inpath '/source/order2023415.txt'
    into table more_part_order
    partition (year='2023',month ='2023-04',day ='2023-04-15');
load data inpath '/source/order202351.txt'
    into table more_part_order
    partition (year='2023',month ='2023-05',day ='2023-05-01');
load data inpath '/source/order202352.txt'
    into table more_part_order
    partition (year='2023',month ='2023-05',day ='2023-05-02');
-- verify the data
select * from more_part_order;
-- Benefits of partitioned tables: avoid full table scans and improve query efficiency
-- Requirement: Statistics of total product sales in 2023
select sum(price*num) from more_part_order; -- 4618
-- Requirement: Statistics of total product sales in May 2023
select sum(price*num) from more_part_order where month='2023-05'; --128
-- Requirement: Statistics of total product sales on May 1, 2023
select sum(price*num) from more_part_order where day='2023-05-01'; --78

Multi-level partition

Knowledge points

Create a partitioned table: create [external] table [if not exists] table name (field name field type, field name field type, … ) partitioned by (first-level partition field name partition field type, second-level partition field name partition field type, …) ;

Automatically generate a partition directory and insert data: load data [local] inpath ‘file path’ into table partition table name partition (first-level partition field name = ‘value’, second-level partition field name = ‘value’, …);

Note: If you add local, the file path should be the Linux local path. If not, it should be the hdfs file path.

Example

-- 2. Multi-level partition table
--Create table
create table more_part_order(
    oid string,
    name string,
    price float,
    num int
) partitioned by ( year string, month string, day string)
row format delimited fields terminated by ' ';
--Insert data
load data inpath '/source/order202251.txt'
    into table more_part_order
    partition (year='2022',month='2022-05',day='2022-05-01');
load data inpath '/source/order2023415.txt'
    into table more_part_order
    partition (year='2023',month ='2023-04',day ='2023-04-15');
load data inpath '/source/order202351.txt'
    into table more_part_order
    partition (year='2023',month ='2023-05',day ='2023-05-01');
load data inpath '/source/order202352.txt'
    into table more_part_order
    partition (year='2023',month ='2023-05',day ='2023-05-02');
-- verify the data
select * from more_part_order;
-- Benefits of partitioned tables: avoid full table scans and improve query efficiency
-- Requirement: Statistics of total product sales in 2023
select sum(price*num) from more_part_order; -- 4618
-- Requirement: Statistics of total product sales in May 2023
select sum(price*num) from more_part_order where month='2023-05';
-- Requirement: Statistics of total product sales on May 1, 2023
select sum(price*num) from more_part_order where day='2023-05-01';

Partition table operations

Add partition: alter table partition table name add partition (partition field name=’value’,…);

Delete partition: alter table partition table name drop partition (partition field name=’value’,…);

Modify the partition name: alter table partition table name partition (partition field name = ‘old value’,…) rename to partition (partition field name = ‘new value’)

View all partitions: show partitions partition table name;

Synchronize/repair partition: msck repair table partition table name;

example

-- Add a partition (essentially create a partition directory on hdfs)
alter table one_part_order add partition (year=2024);
alter table more_part_order add partition (year=2024,month='2024-05',day='2024-05-01');
-- Modify the partition (essentially modify the partition directory name on HDFS)
alter table one_part_order partition (year=2024) rename to partition (year=2025);
--The original 2024 directory and the partition directory 2024-05 under the 2024 directory are retained. In fact, the partition directory 2024-05-01 is renamed and a new partition directory 2025/2025-05/2025-05-01 is created.
alter table more_part_order partition (year=2024,month='2024-05',day='2024-05-01')
    rename to partition (year=2025,month='2025-05',day='2025-05-01');
-- View all partitions
show partitions one_part_order;
show partitions more_part_order;
-- Delete partition
--Partition directory 2025 is deleted
alter table one_part_order drop partition (year=2025); -- 2025 is deleted
alter table more_part_order drop partition (year=2023,month='2023-05',day='2023-05-01'); --2023-05-01 is deleted
alter table more_part_order drop partition (year=2023,month='2023-05');-- 2023-05 is deleted
alter table more_part_order drop partition (year=2023); -- 2023 is deleted
-- If you create a folder on HDFS that conforms to the partition directory format, you can use msck repair to repair it.
-- Example: Manually create a year=2033 directory
msck repair table one_part_order;
msck repair table more_part_order;
--View all partitions again after repair
show partitions one_part_order;
show partitions more_part_order;

hadoop_hive document

hive documentation: https://cwiki.apache.org/confluence/display/Hive/Configuration + Properties
hdfs documentation: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
yarn documentation: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
mr documentation: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

Bucket table

Features/benefits of bucket table: Bucket files need to be generated to improve efficiency in specific operations during query (filtering, joining, grouping and sampling)

Note on efficiency: If the bucket table is not used to filter data when querying data, the efficiency remains unchanged.

Note on bucketing field names: Bucketing field names must be the original field names, because bucketing needs to take the remainder based on the corresponding field value and put data with the same remainder into the same bucketing file.

Important parameters

-- Enabled by default, hive2.x version has been removed
set hive.enforce.bucketing; -- View undefined because it has been removed
set hive.enforce.bucketing=true; -- Modify

-- Check the number of reducers
-- Parameter priority: set method > hive document > hadoop document
set mapreduce.job.reduces; -- View the default -1, which means automatically matching the number of reducers based on the number of buckets
set mapreduce.job.reduces=3; -- Set parameters

Basic bucket table

Knowledge points

Create a basic bucket table:
create [external] table [if not exists] table name (
Field name Field type
)
clustered by (bucket field name)
into bucket number buckets ;

Example

-- 1. Create a basic bucket table, requiring 3 buckets
create table course(
    id int,
    subject string,
    name string
)clustered by (id) into 3 buckets
row format delimited fields terminated by '\t';
-- Download Data
load data inpath '/source/course.txt' into table course;
-- verify the data
select * from course;

Bucket table sorting

Knowledge points

Create a basic bucket table and then sort within the buckets:
create [external] table [if not exists] table name (
Field name Field type
)
clustered by (bucket field name)
sorted by (sort field name asc|desc) # Note: asc ascending order (default) desc descending order
into bucket number buckets ;

Example

-- 1. Create a basic bucketing table, which is required to be divided into 3 buckets, in descending order according to id.
create table course_sort(
    id int,
    subject string,
    name string
)clustered by (id) sorted by (id desc )into 3 buckets
row format delimited fields terminated by '\t';
-- Download Data
load data inpath '/source/course.txt' into table course_sort;
-- verify the data
select * from course_sort;

Bucket Principle

Bucketing principle:
If it is a numeric type bucket field: directly use the numerical value to modulo the number of buckets
If it is a string type bucket field: the bottom layer will use the hash algorithm to calculate a number and then take the modulo of the number of buckets.

Hash: Hash is a data encryption algorithm. We will not discuss its principle in detail. We only need to know its main characteristics: the same value will have the same result after being encrypted by Hash.
For example: The result of the string “sisi” after being hashed is 3530540 (just for illustration), then no matter how many times it is calculated, the result of the string “sisi” will be 3530540.
Calculate remainder: hash(‘sisi’)%3==2
Note: The same data will get the same result. For example, the modulo result of ‘sisi’ hash is 2. No matter how many times it is calculated, its modulo result will be 2.

The difference between partition table and bucket table

Partition Table
Use keywords when creating a table: partition by (partition field name partition field type)
Notes on partition field names: It is a new field, the type needs to be specified, and it cannot have the same name as other fields.
Advantages of partitioned tables: When using partition fields as conditions, the bottom layer directly finds the corresponding partition directory, which can avoid full table scanning and improve query efficiency.
The most direct effect of the partition table: under the hfds table directory, it is divided into multiple partition directories (year=xxxx, month=xx, day=xx)
It is not recommended to directly upload files under the root path of the hdfs table: the partition table cannot directly identify the data in the corresponding file, because the partition table will look for data files in the partition directory.
Use the load method to load files in HDFS: The essence is to move the files to the corresponding partition directory

Bucket table
Use keywords when creating a table: clustered by (bucket field name) into bucket number buckets
Notes on bucketing field names: It specifies an existing field and does not need to specify the type.
Benefits of bucket tables: When using bucket fields to perform specific operations such as sampling, performance efficiency can also be improved.
The most direct effect of the bucket table: In the HDFS table directory or partition directory, it is divided into multiple bucket files (000000_0,000001_0,000002_0…)
It is not recommended to directly upload files under the root path of the hdfs table: the bucket table can identify the data in the corresponding file, but it has no bucketing effect and is not recommended.
Use the load method to load files in HDFS: the essence is to copy the data to each bucket file

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. MySQL entry-level skills treeDatabase compositionTable 77989 people are learning the system