Use Python to create a faker instance to generate a csv big data test file and import it into the Hive data warehouse

Article directory

1. Python generates data
- 1.1 Code description
- 1.2 Code reference
2. Data migration
- 2.1 Upload from local machine to server
- 2.2 Check source data format
- 2.3 Check size and upload to HDFS
3. Beeline table creation
- 3.1 Create a test table and import test data
- 3.2 Create table to display content
4. Processing of the first row and column names of csv files
- 4.1 Create a new table
- 4.2 Insert the first filtered row of the old table into the new table

1. Python generates data

1.1 Code Description

This Python code is used to generate simulated personal information data and save the data as a CSV file.

Import the necessary modules:
- csv: Module for processing CSV files.
- random: used to generate random numbers.
- faker: Library for generating simulated data.
Define the basic information needed to generate the data:
- file_base_path: The base path of the generated CSV file.
- rows_per_file: The number of rows contained in each CSV file.
- num_rows: The total number of rows to generate.
- fake: Create an instance of faker.Faker() for generating simulated data.
Dictionary defining simulation data:
- nationalities: Contains nationality codes and corresponding countries.
- regions: Contains region codes and corresponding region names.
- source_codes: Contains a set of source codes.
Use counter row_counter to keep track of the number of rows generated.
Use a loop to generate multiple CSV files, each containing rows_per_file rows of data.
In each file, generate random personal information data and write it to a CSV file.
During the data generation process, the progress is printed every 10,000 rows of data.
After all data is generated, print the total number of rows generated.

This code uses the Faker library to generate simulated personal information data. Each CSV file contains a certain number of rows of data. The data fields include Rowkey, Name, Age, Email, Address, IDNumber, PhoneNumber, Nationality, Region, SourceCode.

1.2 Code Reference

import csv
import random
import faker

#File base path
file_base_path = './output/personal_info_extended'
#Number of lines per file
rows_per_file = 10000
#Total number of rows
num_rows = 10000000

#Create Faker instance
fake = faker.Faker()

#Define data dictionary
nationalities = {<!-- -->
    1: 'US',
    2: 'CA',
    3: 'UK',
    4: 'AU',
    5: 'FR',
    6: 'DE',
    7: 'JP',
}

regions = {<!-- -->
    1: 'North',
    2: 'South',
    3: 'East',
    4: 'West',
    5: 'Central',
}

source_codes = ['A123', 'B456', 'C789', 'D101', 'E202']

# Counter to keep track of the number of rows generated
row_counter = 0

# Loop to generate data files
for file_number in range(1, num_rows // rows_per_file + 1):
    file_path = f"{<!-- -->file_base_path}_{<!-- -->file_number}.csv"

    # Open CSV file to write data
    with open(file_path, 'w', newline='') as csvfile:
        csv_writer = csv.writer(csvfile)

        # Write the header row of the CSV file
        if row_counter == 0:
            csv_writer.writerow(['Rowkey', 'Name', 'Age', 'Email', 'Address', 'IDNumber', 'PhoneNumber', 'Nationality ', 'Region', 'SourceCode'])

        # Generate and write extended personal information simulation data with the specified number of rows
        for _ in range(rows_per_file):
            name = fake.name()
            age = random.randint(18, 99)
            email = fake.email()
            address = fake.address().replace('\\
', ' ') // Replace the newline in the address and keep the data generated as one line
            id_number = fake.ssn()
            phone_number = fake.phone_number()
            nationality_code = random.randint(1, len(nationalities))
            nationality = nationalities[nationality_code]
            region_code = random.randint(1, len(regions))
            region = regions[region_code]
            source_code = random.choice(source_codes)

            data_row = [row_counter + 1, name, age, email, address, id_number, phone_number, nationality, region, source_code]
            csv_writer.writerow(data_row)

            row_counter + = 1

    print(f'{<!-- -->row_counter} row data has been generated')

print(f'{<!-- -->num_rows} row extended personal information simulation data has been generated')

Insert image description here

2. Data migration

2.1 Upload from local machine to server

[root@hadoop10 personInfo]# pwd
/opt/data/personInfo
[root@hadoop10 personInfo]# ls -l| wc -l
215
[root@hadoop10 personInfo]# wc -l *
...
    10000 personal_info_extended_98.csv
    10000 personal_info_extended_99.csv
    10000 personal_info_extended_9.csv
  2131609 total usage

The command shows that we used the generated 215 csv files, which have now been uploaded to the /opt/data/personInfo directory.

2.2 Check source data format

[root@hadoop10 personInfo]# head personal_info_extended_1.csv
Rowkey,Name,Age,Email,Address,IDNumber,PhoneNumber,Nationality,Region,SourceCode
1,Hayley Jimenez,58,[email protected],"92845 Davis Circles Apt. 198 East Jerryshire, NV 35424",657-35-2900,(141)053-9917,DE,North,C789
2,Amy Johnson,23,[email protected],"119 Manning Rapids Suite 557 New Randyburgh, MN 58113",477-76-9570, + 1-250-531-6115,UK,North,D101
3,Sara Harper,31,[email protected],"98447 Robinson Dale Garzatown, ME 35917",254-77-4980,7958192189,AU,East,A123
4,Alicia Wang,53,[email protected],"531 Lucas Vista New Laura, MO 62148",606-19-1971,001-295-093-9174x819,DE,West,C789
5,Lauren Rodriguez,71,[email protected],"060 Gomez Ports Suite 355 Lake Aarontown, CO 38284",186-61-7463,8458236624,DE,East,E202
6,Juan Harris,98,[email protected],"50325 Alvarez Forge Apt. 800 New Ericchester, AL 16131",529-53-1492, + 1-302-675-5810,CA,East,B456
7,Stephanie Price,90,[email protected],"9668 Erik Inlet Port Joshua, MO 62524",303-11-9577,628.011.4670,UK,East,C789
8,Nicole Parker,61,[email protected],"485 Elliott Branch Scottshire, NJ 03885",473-55-5636,001-625-925-3712x952,FR,West,A123
9,Joel Young,54,[email protected],"9413 Houston Flats Apt. 095 West Peggy, MD 56240",547-31-2815,920.606.0727x27740,JP,Central,E202

Use the head command to view the header of the file and find the first row of fields. We can write a table creation statement through the first row of fields.

2.3 Check size and upload to HDFS

[root@hadoop10 data]# du -h
282M ./personInfo
282M.

[root@hadoop10 data]# hdfs dfs -put /opt/data/personInfo /testdir/

[root@hadoop10 data]# hdfs dfs -du -h /testdir/
281.4 M 281.4 M /testdir/personInfo

The Linux local file occupies 282M, and after uploading to the HDFS cluster /testdir/ directory, it occupies 281.4M.

3. Beeline table creation

3.1 Create a test table and import test data

CREATE TABLE personal_info (
    Rowkey STRING,
    Name STRING,
    Age STRING,
    Email STRING,
    Address STRING,
    IDNumber STRING,
    PhoneNumber STRING,
    Nationality STRING,
    Region STRING,
    SourceCode STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

LOAD DATA INPATH '/testdir/personInfo/*.csv' INTO TABLE personal_info;

If each row of the csv file has the same column name, you need to add the following code at the end of the table creation statement: TBLPROPERTIES ("skip.header.line.count"="1") , skip the first line.

In this case, because python is used to generate the file, only the first csv file has column names, and the rest of the csv does not have column names. We will process this first row separately later.

3.2 Create table display content

0: jdbc:hive2://hadoop10:10000> CREATE TABLE personal_info (
. . . . . . . . . . . . . . . . . > Rowkey STRING,
. . . . . . . . . . . . . . . . . > Name STRING,
. . . . . . . . . . . . . . . . . > Age STRING,
. . . . . . . . . . . . . . . . . > Email STRING,
. . . . . . . . . . . . . . . . . > Address STRING,
. . . . . . . . . . . . . . . . . > IDNumber STRING,
. . . . . . . . . . . . . . . . . > PhoneNumber STRING,
. . . . . . . . . . . . . . . . . > Nationality STRING,
. . . . . . . . . . . . . . . . . > Region STRING,
. . . . . . . . . . . . . . . . > SourceCode STRING
. . . . . . . . . . . . . . . . >)
. . . . . . . . . . . . . . . . . > ROW FORMAT DELIMITED
. . . . . . . . . . . . . . . . > FIELDS TERMINATED BY ','
. . . . . . . . . . . . . . . . . > STORED AS TEXTFILE;
No rows affected (0.147 seconds)

0: jdbc:hive2://hadoop10:10000> LOAD DATA INPATH '/testdir/personInfo/*.csv' INTO TABLE personal_info;
No rows affected (2.053 seconds)

0: jdbc:hive2://hadoop10:10000> select * from personal_info limit 5;
 + ----------------------- + ----------------------- + --- ------------------ + ------------------------------- + --- --------------------------------------------- + ---- -------------------------- + -------------------------- + ------------------------------- + ------------------ --- + --------------------------- +
| personal_info.rowkey | personal_info.name | personal_info.age | personal_info.email | personal_info.address | personal_info.idnumber | personal_info.phonenumber | personal_info.nationality | personal_info.region | personal_info.sourcecode |
 + ----------------------- + ----------------------- + --- ------------------ + ------------------------------- + --- --------------------------------------------- + ---- -------------------------- + -------------------------- + ------------------------------- + ------------------ --- + --------------------------- +
| Rowkey | Name | Age | Email | Address | IDNumber | PhoneNumber | Nationality | Region | SourceCode |
| 1 | Hayley Jimenez | 58 | [email protected] | "92845 Davis Circles Apt. 198 East Jerryshire | NV 35424" | 657-35-2900 | (141)053-9917 | DE | North |
| 2 | Amy Johnson | 23 | [email protected] | "119 Manning Rapids Suite 557 New Randyburgh | MN 58113" | 477-76-9570 | + 1-250-531-6115 | UK | North |
| 3 | Sara Harper | 31 | [email protected] | "98447 Robinson Dale Garzatown | ME 35917" | 254-77-4980 | 7958192189 | AU | East |
| 4 | Alicia Wang | 53 | [email protected] | "531 Lucas Vista New Laura | MO 62148" | 606-19-1971 | 001-295-093-9174x819 | DE | West |
 + ----------------------- + ----------------------- + --- ------------------ + ------------------------------- + --- --------------------------------------------- + ---- -------------------------- + -------------------------- + ------------------------------- + ------------------ --- + --------------------------- +
5 rows selected (0.52 seconds)

4. Processing of first row name of csv file

4.1 Create a new table

The solution is to query the entire table’s data, insert it into another new table, and then delete the old table. If this method is used in a production environment, machine performance and storage conditions should be considered.

CREATE TABLE pinfo (
    Rowkey STRING,
    Name STRING,
    Age STRING,
    Email STRING,
    Address STRING,
    IDNumber STRING,
    PhoneNumber STRING,
    Nationality STRING,
    Region STRING,
    SourceCode STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Query the number of rows in the old table.

0: jdbc:hive2://hadoop10:10000> select count(*) from personal_info;
 + ---------- +
| _c0 |
 + ---------- +
| 2131609 |
 + ---------- +
1 row selected (45.762 seconds)

4.2 Insert the first filtered row of the old table into the new table

INSERT OVERWRITE TABLE pininfo
SELECT
    t.Rowkey,
    t.Name,
    t.Age,
    t.Email,
    t.Address,
    t.IDNumber,
    t.PhoneNumber,
    t.Nationality,
    t.Region,
    t.SourceCode
FROM (
    SELECT
        Rowkey,
        Name,
        Age,
        Email,
        Address,
        IDNumber,
        PhoneNumber,
        nationality,
        Region,
        SourceCode
    FROM personal_info
) t
WHERE t.Name != 'Name';

0: jdbc:hive2://hadoop10:10000> select * from pinfo limit 5;
 + --------------- + ------------------ + ---------- + -------------------------- + -------------------------- ------------------------------ + ----------------- + ---- ---------------- + ----------------------- + ---------- ------ + ------------------- +
| pinfo.rowkey | pinfo.name | pinfo.age | pinfo.email | pinfo.address | pinfo.idnumber | pinfo.phonenumber | pinfo.nationality | pinfo.region | pinfo.sourcecode |
 + --------------- + ------------------ + ---------- + -------------------------- + -------------------------- ------------------------------ + ----------------- + ---- ---------------- + ----------------------- + ---------- ------ + ------------------- +
| 1 | Hayley Jimenez | 58 | [email protected] | "92845 Davis Circles Apt. 198 East Jerryshire | NV 35424" | 657-35-2900 | (141)053-9917 | DE | North |
| 2 | Amy Johnson | 23 | [email protected] | "119 Manning Rapids Suite 557 New Randyburgh | MN 58113" | 477-76-9570 | + 1-250-531-6115 | UK | North |
| 3 | Sara Harper | 31 | [email protected] | "98447 Robinson Dale Garzatown | ME 35917" | 254-77-4980 | 7958192189 | AU | East |
| 4 | Alicia Wang | 53 | [email protected] | "531 Lucas Vista New Laura | MO 62148" | 606-19-1971 | 001-295-093-9174x819 | DE | West |
| 5 | Lauren Rodriguez | 71 | [email protected] | "060 Gomez Ports Suite 355 Lake Aarontown | CO 38284" | 186-61-7463 | 8458236624 | DE | East |
 + --------------- + ------------------ + ---------- + -------------------------- + -------------------------- ------------------------------ + ----------------- + ---- ---------------- + ----------------------- + ---------- ------ + ------------------- +
5 rows selected (0.365 seconds)
0: jdbc:hive2://hadoop10:10000>

Check the progress of new table insertion in yarn.

Finally, when the query result of the new table shows that there are 1 row less than the old table, the insertion process is completed.

0: jdbc:hive2://hadoop10:10000> select count(*) from pinfo;
 + ---------- +
| _c0 |
 + ---------- +
| 2131608 |
 + ---------- +
1 row selected (0.291 seconds)