Article directory
- 1. Python generates data
-
- 1.1 Code description
- 1.2 Code reference
- 2. Data migration
-
- 2.1 Upload from local machine to server
- 2.2 Check source data format
- 2.3 Check size and upload to HDFS
- 3. Beeline table creation
-
- 3.1 Create a test table and import test data
- 3.2 Create table to display content
- 4. Processing of the first row and column names of csv files
-
- 4.1 Create a new table
- 4.2 Insert the first filtered row of the old table into the new table
1. Python generates data
1.1 Code Description
This Python code is used to generate simulated personal information data and save the data as a CSV file.
-
Import the necessary modules:
csv
: Module for processing CSV files.random
: used to generate random numbers.faker
: Library for generating simulated data.
-
Define the basic information needed to generate the data:
file_base_path
: The base path of the generated CSV file.rows_per_file
: The number of rows contained in each CSV file.num_rows
: The total number of rows to generate.fake
: Create an instance offaker.Faker()
for generating simulated data.
-
Dictionary defining simulation data:
nationalities
: Contains nationality codes and corresponding countries.regions
: Contains region codes and corresponding region names.source_codes
: Contains a set of source codes.
-
Use counter
row_counter
to keep track of the number of rows generated. -
Use a loop to generate multiple CSV files, each containing
rows_per_file
rows of data. -
In each file, generate random personal information data and write it to a CSV file.
-
During the data generation process, the progress is printed every 10,000 rows of data.
-
After all data is generated, print the total number of rows generated.
This code uses the Faker library to generate simulated personal information data. Each CSV file contains a certain number of rows of data. The data fields include Rowkey, Name, Age, Email, Address, IDNumber, PhoneNumber, Nationality, Region, SourceCode
.
1.2 Code Reference
import csv import random import faker #File base path file_base_path = './output/personal_info_extended' #Number of lines per file rows_per_file = 10000 #Total number of rows num_rows = 10000000 #Create Faker instance fake = faker.Faker() #Define data dictionary nationalities = {<!-- --> 1: 'US', 2: 'CA', 3: 'UK', 4: 'AU', 5: 'FR', 6: 'DE', 7: 'JP', } regions = {<!-- --> 1: 'North', 2: 'South', 3: 'East', 4: 'West', 5: 'Central', } source_codes = ['A123', 'B456', 'C789', 'D101', 'E202'] # Counter to keep track of the number of rows generated row_counter = 0 # Loop to generate data files for file_number in range(1, num_rows // rows_per_file + 1): file_path = f"{<!-- -->file_base_path}_{<!-- -->file_number}.csv" # Open CSV file to write data with open(file_path, 'w', newline='') as csvfile: csv_writer = csv.writer(csvfile) # Write the header row of the CSV file if row_counter == 0: csv_writer.writerow(['Rowkey', 'Name', 'Age', 'Email', 'Address', 'IDNumber', 'PhoneNumber', 'Nationality ', 'Region', 'SourceCode']) # Generate and write extended personal information simulation data with the specified number of rows for _ in range(rows_per_file): name = fake.name() age = random.randint(18, 99) email = fake.email() address = fake.address().replace('\\ ', ' ') // Replace the newline in the address and keep the data generated as one line id_number = fake.ssn() phone_number = fake.phone_number() nationality_code = random.randint(1, len(nationalities)) nationality = nationalities[nationality_code] region_code = random.randint(1, len(regions)) region = regions[region_code] source_code = random.choice(source_codes) data_row = [row_counter + 1, name, age, email, address, id_number, phone_number, nationality, region, source_code] csv_writer.writerow(data_row) row_counter + = 1 print(f'{<!-- -->row_counter} row data has been generated') print(f'{<!-- -->num_rows} row extended personal information simulation data has been generated')
Insert image description here
2. Data migration
2.1 Upload from local machine to server
[root@hadoop10 personInfo]# pwd /opt/data/personInfo [root@hadoop10 personInfo]# ls -l| wc -l 215 [root@hadoop10 personInfo]# wc -l * ... 10000 personal_info_extended_98.csv 10000 personal_info_extended_99.csv 10000 personal_info_extended_9.csv 2131609 total usage
The command shows that we used the generated 215 csv files, which have now been uploaded to the /opt/data/personInfo
directory.
2.2 Check source data format
[root@hadoop10 personInfo]# head personal_info_extended_1.csv Rowkey,Name,Age,Email,Address,IDNumber,PhoneNumber,Nationality,Region,SourceCode 1,Hayley Jimenez,58,[email protected],"92845 Davis Circles Apt. 198 East Jerryshire, NV 35424",657-35-2900,(141)053-9917,DE,North,C789 2,Amy Johnson,23,[email protected],"119 Manning Rapids Suite 557 New Randyburgh, MN 58113",477-76-9570, + 1-250-531-6115,UK,North,D101 3,Sara Harper,31,[email protected],"98447 Robinson Dale Garzatown, ME 35917",254-77-4980,7958192189,AU,East,A123 4,Alicia Wang,53,[email protected],"531 Lucas Vista New Laura, MO 62148",606-19-1971,001-295-093-9174x819,DE,West,C789 5,Lauren Rodriguez,71,[email protected],"060 Gomez Ports Suite 355 Lake Aarontown, CO 38284",186-61-7463,8458236624,DE,East,E202 6,Juan Harris,98,[email protected],"50325 Alvarez Forge Apt. 800 New Ericchester, AL 16131",529-53-1492, + 1-302-675-5810,CA,East,B456 7,Stephanie Price,90,[email protected],"9668 Erik Inlet Port Joshua, MO 62524",303-11-9577,628.011.4670,UK,East,C789 8,Nicole Parker,61,[email protected],"485 Elliott Branch Scottshire, NJ 03885",473-55-5636,001-625-925-3712x952,FR,West,A123 9,Joel Young,54,[email protected],"9413 Houston Flats Apt. 095 West Peggy, MD 56240",547-31-2815,920.606.0727x27740,JP,Central,E202
Use the head
command to view the header of the file and find the first row of fields. We can write a table creation statement through the first row of fields.
2.3 Check size and upload to HDFS
[root@hadoop10 data]# du -h 282M ./personInfo 282M.
[root@hadoop10 data]# hdfs dfs -put /opt/data/personInfo /testdir/ [root@hadoop10 data]# hdfs dfs -du -h /testdir/ 281.4 M 281.4 M /testdir/personInfo
The Linux local file occupies 282M, and after uploading to the HDFS cluster /testdir/
directory, it occupies 281.4M.
3. Beeline table creation
3.1 Create a test table and import test data
CREATE TABLE personal_info ( Rowkey STRING, Name STRING, Age STRING, Email STRING, Address STRING, IDNumber STRING, PhoneNumber STRING, Nationality STRING, Region STRING, SourceCode STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; LOAD DATA INPATH '/testdir/personInfo/*.csv' INTO TABLE personal_info;
If each row of the csv file has the same column name, you need to add the following code at the end of the table creation statement: TBLPROPERTIES ("skip.header.line.count"="1")
, skip the first line.
In this case, because python is used to generate the file, only the first csv file has column names, and the rest of the csv does not have column names. We will process this first row separately later.
3.2 Create table display content
0: jdbc:hive2://hadoop10:10000> CREATE TABLE personal_info ( . . . . . . . . . . . . . . . . . > Rowkey STRING, . . . . . . . . . . . . . . . . . > Name STRING, . . . . . . . . . . . . . . . . . > Age STRING, . . . . . . . . . . . . . . . . . > Email STRING, . . . . . . . . . . . . . . . . . > Address STRING, . . . . . . . . . . . . . . . . . > IDNumber STRING, . . . . . . . . . . . . . . . . . > PhoneNumber STRING, . . . . . . . . . . . . . . . . . > Nationality STRING, . . . . . . . . . . . . . . . . . > Region STRING, . . . . . . . . . . . . . . . . > SourceCode STRING . . . . . . . . . . . . . . . . >) . . . . . . . . . . . . . . . . . > ROW FORMAT DELIMITED . . . . . . . . . . . . . . . . > FIELDS TERMINATED BY ',' . . . . . . . . . . . . . . . . . > STORED AS TEXTFILE; No rows affected (0.147 seconds)
0: jdbc:hive2://hadoop10:10000> LOAD DATA INPATH '/testdir/personInfo/*.csv' INTO TABLE personal_info; No rows affected (2.053 seconds)
0: jdbc:hive2://hadoop10:10000> select * from personal_info limit 5; + ----------------------- + ----------------------- + --- ------------------ + ------------------------------- + --- --------------------------------------------- + ---- -------------------------- + -------------------------- + ------------------------------- + ------------------ --- + --------------------------- + | personal_info.rowkey | personal_info.name | personal_info.age | personal_info.email | personal_info.address | personal_info.idnumber | personal_info.phonenumber | personal_info.nationality | personal_info.region | personal_info.sourcecode | + ----------------------- + ----------------------- + --- ------------------ + ------------------------------- + --- --------------------------------------------- + ---- -------------------------- + -------------------------- + ------------------------------- + ------------------ --- + --------------------------- + | Rowkey | Name | Age | Email | Address | IDNumber | PhoneNumber | Nationality | Region | SourceCode | | 1 | Hayley Jimenez | 58 | [email protected] | "92845 Davis Circles Apt. 198 East Jerryshire | NV 35424" | 657-35-2900 | (141)053-9917 | DE | North | | 2 | Amy Johnson | 23 | [email protected] | "119 Manning Rapids Suite 557 New Randyburgh | MN 58113" | 477-76-9570 | + 1-250-531-6115 | UK | North | | 3 | Sara Harper | 31 | [email protected] | "98447 Robinson Dale Garzatown | ME 35917" | 254-77-4980 | 7958192189 | AU | East | | 4 | Alicia Wang | 53 | [email protected] | "531 Lucas Vista New Laura | MO 62148" | 606-19-1971 | 001-295-093-9174x819 | DE | West | + ----------------------- + ----------------------- + --- ------------------ + ------------------------------- + --- --------------------------------------------- + ---- -------------------------- + -------------------------- + ------------------------------- + ------------------ --- + --------------------------- + 5 rows selected (0.52 seconds)
4. Processing of first row name of csv file
4.1 Create a new table
The solution is to query the entire table’s data, insert it into another new table, and then delete the old table. If this method is used in a production environment, machine performance and storage conditions should be considered.
CREATE TABLE pinfo ( Rowkey STRING, Name STRING, Age STRING, Email STRING, Address STRING, IDNumber STRING, PhoneNumber STRING, Nationality STRING, Region STRING, SourceCode STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Query the number of rows in the old table.
0: jdbc:hive2://hadoop10:10000> select count(*) from personal_info; + ---------- + | _c0 | + ---------- + | 2131609 | + ---------- + 1 row selected (45.762 seconds)
4.2 Insert the first filtered row of the old table into the new table
INSERT OVERWRITE TABLE pininfo SELECT t.Rowkey, t.Name, t.Age, t.Email, t.Address, t.IDNumber, t.PhoneNumber, t.Nationality, t.Region, t.SourceCode FROM ( SELECT Rowkey, Name, Age, Email, Address, IDNumber, PhoneNumber, nationality, Region, SourceCode FROM personal_info ) t WHERE t.Name != 'Name';
0: jdbc:hive2://hadoop10:10000> select * from pinfo limit 5; + --------------- + ------------------ + ---------- + -------------------------- + -------------------------- ------------------------------ + ----------------- + ---- ---------------- + ----------------------- + ---------- ------ + ------------------- + | pinfo.rowkey | pinfo.name | pinfo.age | pinfo.email | pinfo.address | pinfo.idnumber | pinfo.phonenumber | pinfo.nationality | pinfo.region | pinfo.sourcecode | + --------------- + ------------------ + ---------- + -------------------------- + -------------------------- ------------------------------ + ----------------- + ---- ---------------- + ----------------------- + ---------- ------ + ------------------- + | 1 | Hayley Jimenez | 58 | [email protected] | "92845 Davis Circles Apt. 198 East Jerryshire | NV 35424" | 657-35-2900 | (141)053-9917 | DE | North | | 2 | Amy Johnson | 23 | [email protected] | "119 Manning Rapids Suite 557 New Randyburgh | MN 58113" | 477-76-9570 | + 1-250-531-6115 | UK | North | | 3 | Sara Harper | 31 | [email protected] | "98447 Robinson Dale Garzatown | ME 35917" | 254-77-4980 | 7958192189 | AU | East | | 4 | Alicia Wang | 53 | [email protected] | "531 Lucas Vista New Laura | MO 62148" | 606-19-1971 | 001-295-093-9174x819 | DE | West | | 5 | Lauren Rodriguez | 71 | [email protected] | "060 Gomez Ports Suite 355 Lake Aarontown | CO 38284" | 186-61-7463 | 8458236624 | DE | East | + --------------- + ------------------ + ---------- + -------------------------- + -------------------------- ------------------------------ + ----------------- + ---- ---------------- + ----------------------- + ---------- ------ + ------------------- + 5 rows selected (0.365 seconds) 0: jdbc:hive2://hadoop10:10000>
Check the progress of new table insertion in yarn.
Finally, when the query result of the new table shows that there are 1 row less than the old table, the insertion process is completed.
0: jdbc:hive2://hadoop10:10000> select count(*) from pinfo; + ---------- + | _c0 | + ---------- + | 2131608 | + ---------- + 1 row selected (0.291 seconds)