Table of Contents
Partition Table
First level partition table
Knowledge points
Example
multi-level partitioning
Knowledge points
Example
Partition table operations
Example
hadoop_hive documentation
Bucket table
Important parameters
Basic bucket table
Knowledge points
Example
Bucket table sorting
Knowledge points
Example
Bucketing principle
The difference between partition table and bucket table
Partition table
Partitioned table features/benefits: It is necessary to generate a partition directory, and use the partition field to filter data when querying to avoid full table scanning and improve query efficiency.
Note on efficiency: If the partition table is partitioned and the partition field is not used to filter the data when querying the data, the efficiency will not change
Note on partition field name: The partition field name cannot be repeated with the original field name, because the partition field name must be spliced into the table as a field.
Level 1 partition table
Knowledge Point
Create a partitioned table: create table [external] table [if not exists] table name (field name field type, field name field type,…) partition by (partition field name partition field type)…;
Automatically generate a partition directory and insert data: load data [local] inpath ‘file path’ into table partition table name partition (partition field name = ‘value’);
Note: If you add local, the file path after it should be the Linux local path. If not, it will be the hdfs file path.
example
-- 2. Multi-level partition table --Create table create table more_part_order( oid string, name string, price float, num int ) partitioned by ( year string, month string, day string) row format delimited fields terminated by ' '; --Insert data load data inpath '/source/order202251.txt' into table more_part_order partition (year='2022',month='2022-05',day='2022-05-01'); load data inpath '/source/order2023415.txt' into table more_part_order partition (year='2023',month ='2023-04',day ='2023-04-15'); load data inpath '/source/order202351.txt' into table more_part_order partition (year='2023',month ='2023-05',day ='2023-05-01'); load data inpath '/source/order202352.txt' into table more_part_order partition (year='2023',month ='2023-05',day ='2023-05-02'); -- verify the data select * from more_part_order; -- Benefits of partitioned tables: avoid full table scans and improve query efficiency -- Requirement: Statistics of total product sales in 2023 select sum(price*num) from more_part_order; -- 4618 -- Requirement: Statistics of total product sales in May 2023 select sum(price*num) from more_part_order where month='2023-05'; --128 -- Requirement: Statistics of total product sales on May 1, 2023 select sum(price*num) from more_part_order where day='2023-05-01'; --78
Multi-level partition
Knowledge points
Create a partitioned table: create [external] table [if not exists] table name (field name field type, field name field type, … ) partitioned by (first-level partition field name partition field type, second-level partition field name partition field type, …) ;
Automatically generate a partition directory and insert data: load data [local] inpath ‘file path’ into table partition table name partition (first-level partition field name = ‘value’, second-level partition field name = ‘value’, …);
Note: If you add local, the file path should be the Linux local path. If not, it should be the hdfs file path.
Example
-- 2. Multi-level partition table --Create table create table more_part_order( oid string, name string, price float, num int ) partitioned by ( year string, month string, day string) row format delimited fields terminated by ' '; --Insert data load data inpath '/source/order202251.txt' into table more_part_order partition (year='2022',month='2022-05',day='2022-05-01'); load data inpath '/source/order2023415.txt' into table more_part_order partition (year='2023',month ='2023-04',day ='2023-04-15'); load data inpath '/source/order202351.txt' into table more_part_order partition (year='2023',month ='2023-05',day ='2023-05-01'); load data inpath '/source/order202352.txt' into table more_part_order partition (year='2023',month ='2023-05',day ='2023-05-02'); -- verify the data select * from more_part_order; -- Benefits of partitioned tables: avoid full table scans and improve query efficiency -- Requirement: Statistics of total product sales in 2023 select sum(price*num) from more_part_order; -- 4618 -- Requirement: Statistics of total product sales in May 2023 select sum(price*num) from more_part_order where month='2023-05'; -- Requirement: Statistics of total product sales on May 1, 2023 select sum(price*num) from more_part_order where day='2023-05-01';
Partition table operations
Add partition: alter table partition table name add partition (partition field name=’value’,…);
Delete partition: alter table partition table name drop partition (partition field name=’value’,…);
Modify the partition name: alter table partition table name partition (partition field name = ‘old value’,…) rename to partition (partition field name = ‘new value’)
View all partitions: show partitions partition table name;
Synchronize/repair partition: msck repair table partition table name;
example
-- Add a partition (essentially create a partition directory on hdfs) alter table one_part_order add partition (year=2024); alter table more_part_order add partition (year=2024,month='2024-05',day='2024-05-01'); -- Modify the partition (essentially modify the partition directory name on HDFS) alter table one_part_order partition (year=2024) rename to partition (year=2025); --The original 2024 directory and the partition directory 2024-05 under the 2024 directory are retained. In fact, the partition directory 2024-05-01 is renamed and a new partition directory 2025/2025-05/2025-05-01 is created. alter table more_part_order partition (year=2024,month='2024-05',day='2024-05-01') rename to partition (year=2025,month='2025-05',day='2025-05-01'); -- View all partitions show partitions one_part_order; show partitions more_part_order; -- Delete partition --Partition directory 2025 is deleted alter table one_part_order drop partition (year=2025); -- 2025 is deleted alter table more_part_order drop partition (year=2023,month='2023-05',day='2023-05-01'); --2023-05-01 is deleted alter table more_part_order drop partition (year=2023,month='2023-05');-- 2023-05 is deleted alter table more_part_order drop partition (year=2023); -- 2023 is deleted -- If you create a folder on HDFS that conforms to the partition directory format, you can use msck repair to repair it. -- Example: Manually create a year=2033 directory msck repair table one_part_order; msck repair table more_part_order; --View all partitions again after repair show partitions one_part_order; show partitions more_part_order;
hadoop_hive document
hive documentation: https://cwiki.apache.org/confluence/display/Hive/Configuration + Properties
hdfs documentation: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
yarn documentation: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
mr documentation: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
Bucket table
Features/benefits of bucket table: Bucket files need to be generated to improve efficiency in specific operations during query (filtering, joining, grouping and sampling) Note on efficiency: If the bucket table is not used to filter data when querying data, the efficiency remains unchanged. Note on bucketing field names: Bucketing field names must be the original field names, because bucketing needs to take the remainder based on the corresponding field value and put data with the same remainder into the same bucketing file.
Important parameters
-- Enabled by default, hive2.x version has been removed set hive.enforce.bucketing; -- View undefined because it has been removed set hive.enforce.bucketing=true; -- Modify -- Check the number of reducers -- Parameter priority: set method > hive document > hadoop document set mapreduce.job.reduces; -- View the default -1, which means automatically matching the number of reducers based on the number of buckets set mapreduce.job.reduces=3; -- Set parameters
Basic bucket table
Knowledge points
Create a basic bucket table:
create [external] table [if not exists] table name (
Field name Field type
)
clustered by (bucket field name)
into bucket number buckets ;
Example
-- 1. Create a basic bucket table, requiring 3 buckets create table course( id int, subject string, name string )clustered by (id) into 3 buckets row format delimited fields terminated by '\t'; -- Download Data load data inpath '/source/course.txt' into table course; -- verify the data select * from course;
Bucket table sorting
Knowledge points
Create a basic bucket table and then sort within the buckets:
create [external] table [if not exists] table name (
Field name Field type
)
clustered by (bucket field name)
sorted by (sort field name asc|desc) # Note: asc ascending order (default) desc descending order
into bucket number buckets ;
Example
-- 1. Create a basic bucketing table, which is required to be divided into 3 buckets, in descending order according to id. create table course_sort( id int, subject string, name string )clustered by (id) sorted by (id desc )into 3 buckets row format delimited fields terminated by '\t'; -- Download Data load data inpath '/source/course.txt' into table course_sort; -- verify the data select * from course_sort;
Bucket Principle
Bucketing principle:
If it is a numeric type bucket field: directly use the numerical value to modulo the number of buckets
If it is a string type bucket field: the bottom layer will use the hash algorithm to calculate a number and then take the modulo of the number of buckets.Hash: Hash is a data encryption algorithm. We will not discuss its principle in detail. We only need to know its main characteristics: the same value will have the same result after being encrypted by Hash.
For example: The result of the string “sisi” after being hashed is 3530540 (just for illustration), then no matter how many times it is calculated, the result of the string “sisi” will be 3530540.
Calculate remainder: hash(‘sisi’)%3==2
Note: The same data will get the same result. For example, the modulo result of ‘sisi’ hash is 2. No matter how many times it is calculated, its modulo result will be 2.
The difference between partition table and bucket table
Partition Table
Use keywords when creating a table: partition by (partition field name partition field type)
Notes on partition field names: It is a new field, the type needs to be specified, and it cannot have the same name as other fields.
Advantages of partitioned tables: When using partition fields as conditions, the bottom layer directly finds the corresponding partition directory, which can avoid full table scanning and improve query efficiency.
The most direct effect of the partition table: under the hfds table directory, it is divided into multiple partition directories (year=xxxx, month=xx, day=xx)
It is not recommended to directly upload files under the root path of the hdfs table: the partition table cannot directly identify the data in the corresponding file, because the partition table will look for data files in the partition directory.
Use the load method to load files in HDFS: The essence is to move the files to the corresponding partition directoryBucket table
Use keywords when creating a table: clustered by (bucket field name) into bucket number buckets
Notes on bucketing field names: It specifies an existing field and does not need to specify the type.
Benefits of bucket tables: When using bucket fields to perform specific operations such as sampling, performance efficiency can also be improved.
The most direct effect of the bucket table: In the HDFS table directory or partition directory, it is divided into multiple bucket files (000000_0,000001_0,000002_0…)
It is not recommended to directly upload files under the root path of the hdfs table: the bucket table can identify the data in the corresponding file, but it has no bucketing effect and is not recommended.
Use the load method to load files in HDFS: the essence is to copy the data to each bucket file
The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. MySQL entry-level skills treeDatabase compositionTable 77989 people are learning the system