Python pandas processes multiple excels in large batches, and carries out processing, statistics and improvement ideas

Processing goal: Read more than 800 enterprise information in each city for a period of time stored in excel (each city has at least one excel data), and count the number of enterprises in each secondary manufacturing industry in each city from 2012 to 2023.

Data size: more than 800 excels, a total size of 45GB, the size of a single excel is between 1MB-250MB

Requirements analysis: Since the two dimensions of secondary manufacturing industry and year are required, plus the rows and columns in excel, it is not difficult to think of the Dataframe in pandas; in addition, it is also necessary to consider that under a large amount of data, a notebook with ordinary performance needs How to simplify the processing process, shorten the running time of the program, the details of string processing and input, processing, and output; after the final code is written successfully, you need to test a single excel first, then test multiple excels, and finally add some output information (in order to monitor the running progress of the program), and then run on the overall data to finally get the results

Programming design: In order to shorten the running time of the program, all the required data are first processed in the memory, and finally output to excel at one time; according to the design model of information experts, the year and manufacturing sub-industry information are city The information we have, so we need the city class; secondly, we also need some functions to filter the data that meets the conditions; as well as functions to process a single excel and persistent storage functions

Let’s first look at the function that processes a single excel. Before reading into excel, you need to make sure that there are two situations in the excel column, so first use

pd.read_excel(file_path, engine='openpyxl', nrows=1).columns.tolist()

Reading out columns and saving them as lists will be divided into two processing methods. When reading data in excel, there is a very effective way to improve efficiency, which is to read only the required columns. The usercols parameter of the read_excel function plays a role. For this function, only the columns it contains will be read and stored as dataframe; filtered_df is generated in such a way that only the data that meets the requirements of the original df will be retained. .str[:4] means only the first 4 digits are taken, .fillna(‘ 0’) refers to adding 0 to the blank, .astype (int) refers to converting to an int type variable, .between (2012, 2023) only the data currently listed in this range will be retained, & amp; connects the next one Condition, when the condition is written as a function, use df[column].apply(function_name) to apply the function, and the data will be passed in as a parameter

Then according to the industry classification and year as the columns and rows, create a Dataframe with all elements as 0, and initialize it; traverse the filtered_df row by row according to the principle that one column of data is a row and one column of data is a column. Count on result_df, and finally return result_df, this is the logic for processing a single excel

# Process each excel
def data_analyse(file_path):
    print(file_path)
    cols = pd.read_excel(file_path, engine='openpyxl', nrows=1).columns.tolist()
    # Excel with secondary industry classification
    if 'Secondary Industry Classification' in cols:
        df = pd.read_excel(file_path, engine='openpyxl', usecols=['date of establishment', 'secondary industry classification']) # Take only two columns
        # Only data for 12-23 years and 31 manufacturing sub-industries
        filtered_df = df[
            df['Date of establishment'].str[:4].fillna('0').astype(int).between(2012, 2023) & amp; df['Secondary industry classification'].apply(
                is_manufacture_industry)]
        # Calculate the total number of enterprises
        count = len(filtered_df)
        # Create a dataframe with industry as col, year as index and initial elements as 0
        index_years = list(range(2012, 2024))
        columns_names = manufacturing_industries
        result_df = pd.DataFrame(0, index=index_years, columns=columns_names)
        # Traverse each row of data that meets the requirements and count using count df
        for index, row in filtered_df.iterrows():
            year = int(row['Establishment date'][:4])
            sub_industry = row['Secondary industry classification']
            result_df.at[year, sub_industry] + = 1
        # Sum each row of result_df to calculate the total annual manufacturing volume of the current city
        result_df['annual total manufacturing volume'] = result_df.sum(axis=1)
        return result_df, count
    # There is no secondary industry classification, only the excel of the industry to which it belongs.
    else:
        df = pd.read_excel(file_path, engine='openpyxl', usecols=['date of establishment', 'industry']) # Take only two columns
        # Only data for 12-23 years and 31 manufacturing sub-industries
        filtered_df = df[
            df['Date of establishment'].str[:4].fillna('0').astype(int).between(2012, 2023) & amp; df['Industry'].apply(
                is_manufacture_industry)]
        # Calculate the total number of enterprises
        count = len(filtered_df)
        # Create a dataframe with industry as col, year as index and initial elements as 0
        index_years = list(range(2012, 2024))
        columns_names = manufacturing_industries
        result_df = pd.DataFrame(0, index=index_years, columns=columns_names)
        # Traverse each row of data that meets the requirements and count using count df
        for index, row in filtered_df.iterrows():
            year = int(row['Establishment date'][:4])
            sub_industry = row['Industry']
            result_df.at[year, sub_industry] + = 1
        # Sum each row of result_df to calculate the total annual manufacturing volume of the current city
        result_df['annual total manufacturing volume'] = result_df.sum(axis=1)
        return result_df, count

Look at the main program again

city_flag is a switch variable, its function is to determine whether two adjacent excels have data from the same city. The function to determine is check_cities, which determines the excel file name, and uses the global variable last_city to record the city to which the data of the previous excel belongs. , if the cities are the same, add the corresponding elements of the result_df of the two excels, and use the city_merge function to implement it; if they are different, create a new city object and add it to the city list, and perform persistent storage after completion.

if __name__ == '__main__':
    # Used to distinguish different data tables of the same city
    city_flag = False #The default is a different city
    last_city = ''
    city_list = [] # Container of all city objects
    directory_path = r'G:\Complete information of China's industrial and commercial registered enterprises (updated in 2023.9)\
ational data'
    all_files = os.listdir(directory_path)

    # Start looping through files in the directory
    for file in all_files:
        # Check whether the city already has existing data and adjust city_flag
        last_city = check_cities(file)
        # Absolute file path
        file_path = os.path.join(directory_path, file)
        # Quantity matrix
        result_df, count = data_analyse(file_path)
        if city_flag: # Different excel, the same city, add the corresponding elements of df
            city_merge(result_df, count)
        else:
            # Create a new city, create an object, and add it to the city list
            temp_city = City(last_city, result_df, count)
            city_list.append(temp_city)

    # Output excel results
    save_as_excel()
    print('done!')

The check_cities function checks the city name and updates last_city; a regular expression is used to extract the city name. The expression will match all Chinese characters and the return value is the updated last_city.

# City name check, the return value is the city name this time, which will be used as the lastcity for the next check
def check_cities(file):
    global city_flag #Declare global variables
    global last_city
    #Extract city name
    simplified_name = ''.join(re.findall(r'[\一-\龥]', file))
    if last_city != simplified_name:
        city_flag = False
        last_city = simplified_name
    else:
        city_flag = True
    return last_city

Merge the same city data. If there are the same cities, it means that the last element in the city list is the same as the current city, so modify the result_df of the last element of city_list and add the current result_df to complete the merger.

# For the same city, the corresponding elements are merged
def city_merge(result_df, count):
    global city_list
    city_list[-1].result_df = city_list[-1].result_df + result_df # Add the dataframe of the newly added city and the df corresponding elements of its subsequent excel table.
    city_list[-1].count = city_list[-1].count + count

Persistent storage of results

The final result in memory is a city_list, mainly due to some format requirements: the previous row of the dataframe must have city_name, and only the first dataframe needs column names (because only the first row in excel has column names); mainly used There are two writing methods: 1.ws.cell, which writes directly to the cell. 2.dataframe.to_excel function can write the dataframe to excel as a whole.

# Store in excel
def save_as_excel():
    with pd.ExcelWriter('result.xlsx', engine='openpyxl') as writer:
        startrow = 0
        for idx, city in enumerate(city_list):

            # Write city_name to the current startrow position
            ws = writer.sheets['Sheet1'] if 'Sheet1' in writer.sheets else None
            if not ws:
                ws = writer.book.create_sheet('Sheet1')
            ws.cell(row=startrow + 1, column=1, value=city.city_name)

            # Sum the total manufacturing volume of each city in 11 years from 12 to 23 years
            # ws.cell(row=startrow + 1, column=2, value='Total manufacturing volume from 2012 to 2023')
            # ws.cell(row=startrow + 1, column=3, value=city.count)

            # If it is the first DataFrame, write the header
            ifidx==0:
                header = True
            else:
                header=False

            # Update startrow to the row after city_name and write the DataFrame to this location
            startrow + = 1
            city.result_df.to_excel(writer, sheet_name='Sheet1', startrow=startrow, index=True, header=header)

            # Update the starting row of the next dataframe, including the length of the DataFrame itself and an extra line reserved for the next city_name
            startrow + = city.result_df.shape[0] + 1

All source code:

import os
import re
import pandas as pd

# Secondary classification of manufacturing industry
manufacturing_industries = ['Agricultural and sideline food processing industry', 'Food manufacturing industry', 'Alcohol, beverage and refined tea manufacturing industry', 'Tobacco products industry', 'Textile industry',
                            'Textile, clothing and apparel industry', 'Leather, fur, feather and their products and shoemaking industry', 'Wood processing and wood, bamboo, rattan, palm and straw products industry',
                            'Furniture manufacturing', 'Paper and paper products industry', 'Printing and recording media reproduction industry',
                            'Cultural, educational, industrial and aesthetic, sports and entertainment product manufacturing', 'Petroleum, coal and other fuel processing industry',
                            'Chemical raw materials and chemical products manufacturing industry', 'Pharmaceutical manufacturing industry', 'Chemical fiber manufacturing industry', 'Rubber and plastic products industry',
                            'Non-metal mineral products industry', 'Ferrous metal smelting and rolling processing industry', 'Non-ferrous metal smelting and rolling processing industry',
                            'Metal products industry', 'General equipment manufacturing industry', 'Special equipment manufacturing industry', 'Automotive manufacturing industry',
                            'Railway, shipbuilding, aerospace and other transportation equipment manufacturing', 'Electrical machinery and equipment manufacturing',
                            'Computer, communications and other electronic equipment manufacturing', 'Instrument manufacturing', 'Other manufacturing',
                            'Comprehensive utilization of waste resources industry', 'Metal products, machinery and equipment repair industry']


#City category
class City:
    def __init__(self, city_name, result_df, count):
        self.city_name = city_name
        self.result_df = result_df
        self.result_df.index.name = city_name # Mark the city name on the df of this city
        self.count = count # The total number of manufacturing enterprises in the city from 12 to 23 years


# When judging, it is the manufacturing industry, and the return value is a Boolean value
def is_manufacturing_industry(value):
    # 31 Manufacturing Sub-Industries
    manufacturing_industries = ['Agricultural and sideline food processing industry', 'Food manufacturing industry', 'Alcohol, beverage and refined tea manufacturing industry', 'Tobacco products industry', 'Textile industry',
                                'Textile, clothing and apparel industry', 'Leather, fur, feather and their products and shoemaking industry', 'Wood processing and wood, bamboo, rattan, palm and straw products industry',
                                'Furniture manufacturing', 'Paper and paper products industry', 'Printing and recording media reproduction industry',
                                'Cultural, educational, industrial and aesthetic, sports and entertainment product manufacturing', 'Petroleum, coal and other fuel processing industry',
                                'Chemical raw materials and chemical products manufacturing industry', 'Pharmaceutical manufacturing industry', 'Chemical fiber manufacturing industry', 'Rubber and plastic products industry',
                                'Non-metal mineral products industry', 'Ferrous metal smelting and rolling processing industry', 'Non-ferrous metal smelting and rolling processing industry',
                                'Metal products industry', 'General equipment manufacturing industry', 'Special equipment manufacturing industry', 'Automotive manufacturing industry',
                                'Railway, shipbuilding, aerospace and other transportation equipment manufacturing', 'Electrical machinery and equipment manufacturing',
                                'Computer, communications and other electronic equipment manufacturing', 'Instrument manufacturing', 'Other manufacturing',
                                'Comprehensive utilization of waste resources industry', 'Metal products, machinery and equipment repair industry']
    return value in manufacturing_industries


# Process each excel
def data_analyse(file_path):
    print(file_path)
    cols = pd.read_excel(file_path, engine='openpyxl', nrows=1).columns.tolist()
    # Excel with secondary industry classification
    if 'Secondary Industry Classification' in cols:
        df = pd.read_excel(file_path, engine='openpyxl', usecols=['date of establishment', 'secondary industry classification']) # Take only two columns
        # Only data for 12-23 years and 31 manufacturing sub-industries
        filtered_df = df[
            df['Date of establishment'].str[:4].fillna('0').astype(int).between(2012, 2023) & amp; df['Secondary industry classification'].apply(
                is_manufacture_industry)]
        # Calculate the total number of enterprises
        count = len(filtered_df)
        # Create a dataframe with industry as col, year as index and initial elements as 0
        index_years = list(range(2012, 2024))
        columns_names = manufacturing_industries
        result_df = pd.DataFrame(0, index=index_years, columns=columns_names)
        # Traverse each row of data that meets the requirements and count using count df
        for index, row in filtered_df.iterrows():
            year = int(row['Establishment date'][:4])
            sub_industry = row['Secondary industry classification']
            result_df.at[year, sub_industry] + = 1
        # Sum each row of result_df to calculate the total annual manufacturing volume of the current city
        result_df['annual total manufacturing volume'] = result_df.sum(axis=1)
        return result_df, count
    # There is no secondary industry classification, only the excel of the industry to which it belongs.
    else:
        df = pd.read_excel(file_path, engine='openpyxl', usecols=['date of establishment', 'industry']) # Take only two columns
        # Only data for 12-23 years and 31 manufacturing sub-industries
        filtered_df = df[
            df['Date of establishment'].str[:4].fillna('0').astype(int).between(2012, 2023) & amp; df['Industry'].apply(
                is_manufacture_industry)]
        # Calculate the total number of enterprises
        count = len(filtered_df)
        # Create a dataframe with industry as col, year as index and initial elements as 0
        index_years = list(range(2012, 2024))
        columns_names = manufacturing_industries
        result_df = pd.DataFrame(0, index=index_years, columns=columns_names)
        # Traverse each row of data that meets the requirements and count using count df
        for index, row in filtered_df.iterrows():
            year = int(row['Establishment date'][:4])
            sub_industry = row['Industry']
            result_df.at[year, sub_industry] + = 1
        # Sum each row of result_df to calculate the total annual manufacturing volume of the current city
        result_df['annual total manufacturing volume'] = result_df.sum(axis=1)
        return result_df, count


# City name check, the return value is the city name this time, which will be used as the lastcity for the next check
def check_cities(file):
    global city_flag #Declare global variables
    global last_city
    #Extract city name
    simplified_name = ''.join(re.findall(r'[\一-\龥]', file))
    if last_city != simplified_name:
        city_flag = False
        last_city = simplified_name
    else:
        city_flag = True
    return last_city


# For the same city, the corresponding elements are merged
def city_merge(result_df, count):
    global city_list
    city_list[-1].result_df = city_list[-1].result_df + result_df # Add the dataframe of the newly added city and the df corresponding elements of its subsequent excel table.
    city_list[-1].count = city_list[-1].count + count


# Store in excel
def save_as_excel():
    with pd.ExcelWriter('result.xlsx', engine='openpyxl') as writer:
        startrow = 0
        for idx, city in enumerate(city_list):

            # Write city_name to the current startrow position
            ws = writer.sheets['Sheet1'] if 'Sheet1' in writer.sheets else None
            if not ws:
                ws = writer.book.create_sheet('Sheet1')
            ws.cell(row=startrow + 1, column=1, value=city.city_name)

            # Sum the total manufacturing volume of each city in 11 years from 12 to 23 years
            # ws.cell(row=startrow + 1, column=2, value='Total manufacturing volume from 2012 to 2023')
            # ws.cell(row=startrow + 1, column=3, value=city.count)

            # If it is the first DataFrame, write the header
            ifidx==0:
                header = True
            else:
                header=False

            # Update startrow to the row after city_name and write the DataFrame to this location
            startrow + = 1
            city.result_df.to_excel(writer, sheet_name='Sheet1', startrow=startrow, index=True, header=header)

            # Update the starting row of the next dataframe, including the length of the DataFrame itself and an extra line reserved for the next city_name
            startrow + = city.result_df.shape[0] + 1


if __name__ == '__main__':
    # Used to distinguish different data tables of the same city
    city_flag = False #The default is a different city
    last_city = ''
    city_list = [] # Container of all city objects
    directory_path = r'G:\Complete information of China's industrial and commercial registered enterprises (updated in 2023.9)\
ational data'
    all_files = os.listdir(directory_path)

    # Start looping through files in the directory
    for file in all_files:
        # Check whether the city already has existing data and adjust city_flag
        last_city = check_cities(file)
        # Absolute path to file
        file_path = os.path.join(directory_path, file)
        # Quantity matrix
        result_df, count = data_analyse(file_path)
        if city_flag: # Different excel, the same city, add the corresponding elements of df
            city_merge(result_df, count)
        else:
            # Create a new city, create an object, and add it to the city list
            temp_city = City(last_city, result_df, count)
            city_list.append(temp_city)

    # Output excel results
    save_as_excel()
    print('done!')
    #Perform time optimization and then start running

Improvement ideas:

Statistics for Excel, especially Excel with a relatively large amount of data, are actually CPU-intensive; this program runs on a single core, and the final execution time to complete the 45g of data is about 20 hours; CPU-intensive operations are suitable for use Multi-process programming to get the most out of your computer’s performance

Use a simple multi-process example to demonstrate the use of python’s multiprocessing library to complete multi-process programming.

In the process function, since a shared data structure is used, when adding results to the shared data structure, the .get_lock() method must be used to lock to ensure that when a process adds elements to the shared data structure, other processes They cannot be operated on at the same time

import multiprocessing

# Step 2: Define a function that accepts parameters and performs tasks
def process_function(arg1, shared_list):
    #To perform tasks here, you can use the parameter arg1
    result = arg1 * 2

    # Add results to shared list
    with shared_list.get_lock():
        shared_list.append(result)

if __name__ == "__main__":
    # Step 1: Import the multiprocessing module

    # Step 3: Create a shared list
    manager = multiprocessing.Manager()
    shared_list = manager.list()

    # Step 4: Create a process pool
    num_processes = multiprocessing.cpu_count()
    pool = multiprocessing.Pool(processes=num_processes)

    # Step 5: Submit the task using the process pool, passing parameters and shared list
    arg1_values = [1, 2, 3, 4, 5]

    # Create a wrapper function using the partial function, passing the shared list as a parameter
    from functools import partial
    process_function_with_shared_list = partial(process_function, shared_list=shared_list)

    #Use the map method to perform tasks
    pool.map(process_function_with_shared_list, arg1_values)

    # Step 6: Wait for all tasks to complete and close the process pool
    pool.close()
    pool.join()

    #Print the contents of the shared list
    print("Shared List:", shared_list)

The knowledge points of the article match the official knowledge archives, and you can further learn relevant knowledge. Python introductory skill treeStructured data analysis tool PandasPandas overview 383323 people are learning the system