Python’s methods of data interception, line-by-line difference, and cross-file merging for a large number of table files

This article introduces a large number of Excel table files under a folder based on Python language, based on each file strong>, firstly intercept the data we need according to the characteristics of a certain column of data, and then compute row-by-row difference for the intercepted data, and based on Specific methods for merging data across files for the same large number of Excel table files in multiple other folders.

First of all, let’s clarify the specific needs of this article. There is a folder, which contains a large number of Excel table files (in this article, we will take the file in .csv format as an example), And the name of each file indicates the ID of the data source point corresponding to the file; as shown in the figure below.

Among them, each Excel table file has a data format as shown in the figure below; the column 1 in it represents days Time data, the time span between each row of data is 8 days.

What we hope to achieve is, first of all, for each file in this folder, intercept the number of days in 2022001 (that is, 1 in 2022) day) and the following part; then, for each column of the intercepted data (except column 1, because column 1 is the data representing time ) with line-by-line difference-for example, subtract 2022001 from 2022009, then subtract 2022017 from 2022009 >2022009 data, and put the difference as new columns behind the original columns; also, we also hope to get the file name of the current file and column 1 In the number of days, some key information is extracted and put behind as a new column (here I hope to produce a deep neural network regression training data, so it is necessary to combine various types of data). In addition, we also have 2 folders, which have the same large amount of data, the same file naming rules, and the same data format. We hope to combine this 2 folder with the Data of the same day in each file with the same name in the current folder will be merged.

After understanding the requirements, we can start writing code. The code used in this article is as follows.

# -*- coding: utf-8 -*-
"""
Created on Thu May 18 11:36:41 2023

@author: fkxxgis
"""

import os
import numpy as np
import pandas as pd

original_path = "E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/17_HANTS"
era5_path = "E:/01_Reflectivity/99_Model_Training/00_Data/03_Meteorological_Data/02_AllERA5"
history_path = "E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/18_AllYearAverage_2"
output_path = "E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/19_2022Data"

era5_files = os.listdir(era5_path)
history_files = os.listdir(history_path)

for file in os.listdir(original_path):
    file_path = os.path.join(original_path, file)
    
    if file.endswith(".csv") and os.path.isfile(file_path):
        point_id = file[4:-4]
        
        df = pd.read_csv(file_path)
        filter_df = df[df["DOY"] >= 2022001]
        filter_df = filter_df. reset_index(drop = True)
        filter_df["blue_dif"] = filter_df["blue"].diff()
        filter_df["green_dif"] = filter_df["green"].diff()
        filter_df["red_dif"] = filter_df["red"].diff()
        filter_df["inf_dif"] = filter_df["inf"].diff()
        filter_df["si1_dif"] = filter_df["si1"].diff()
        filter_df["si2_dif"] = filter_df["si2"].diff()
        filter_df["NDVI_dif"] = filter_df["NDVI"].diff()
        filter_df["PointType"] = file[4 : 7]
        filter_df["days"] = filter_df["DOY"] % 1000
        
        for era5_file in era5_files:
            if point_id in era5_file:
                era5_df = pd.read_csv(os.path.join(era5_path, era5_file))
                
                rows_num = filter_df.shape[0]
                for i in range(rows_num):
                    day = filter_df.iloc[i, 0]
                    row_need_index = era5_df.index[era5_df.iloc[:, 1] == day]
                    row_need = row_need_index[0]
                    sola_data_all = era5_df.iloc[row_need - 2 : row_need, 2]
                    temp_data_all = era5_df.iloc[row_need - 6 : row_need - 2, 3]
                    prec_data_all = era5_df.iloc[row_need - 5 : row_need - 1, 4]
                    soil_data_all = era5_df.iloc[row_need - 6 : row_need - 2, 5 : 7 + 1]
                    sola_data = np.sum(sola_data_all.values)
                    temp_data = np.sum(temp_data_all.values)
                    prec_data = np.sum(prec_data_all.values)
                    soil_data = np.sum(soil_data_all.values)
                    filter_df.loc[i, "sola"] = sola_data
                    filter_df.loc[i, "temp"] = temp_data
                    filter_df.loc[i, "prec"] = prec_data
                    filter_df.loc[i, "soil"] = soil_data
                break
            
        for history_file in history_files:
            if point_id in history_file:
                history_df = pd.read_csv(os.path.join(history_path, history_file)).iloc[ : , 1 : ]
                history_df.columns = ["blue_h", "green_h", "red_h", "inf_h", "si1_h", "si2_h", "ndvi_h"]
                break
        
        filter_df_new = pd. concat([filter_df, history_df], axis = 1)
                
        output_file = os.path.join(output_path, file)
        filter_df_new.to_csv(output_file, index = False)

The code first defines several folder paths, which are the original data folder (that is, the folder shown in the 1 picture at the beginning of this article), ERA5 Weather data folder, historical data folder and output folder. Then, all file names in the ERA5 meteorological data folder and historical data folder are obtained through the os.listdir() function, and used in subsequent loops.

Next is a for loop that traverses all .csv files in the raw data folder, if the file name ends with .csv and is a legal file, read the file. Then, the point ID was extracted based on the file name, and the data of the file was read using the read_csv() function in Pandas. Then, the loc[] function in Pandas was used to process the data, including filtering out DOY greater than or equal to 2022001 , reindexed it, and computed the difference of the reflectance data. Then, add some metadata to the filtered data, including point type and number of days.

Next are two for loops, which are used to process ERA5 weather data and historical data respectively. When processing ERA5 weather data, first find the ERA5 weather data file that matches the current point ID, and use Pandasread_csv() function in > reads the data of the file. Then, use the iloc[] function to find the corresponding line in the ERA5 meteorological data according to the current date, and extract the solar radiation, temperature from this line and the previous two lines , precipitation and soil moisture data. Finally, these data are added to the filtered data.

When processing historical data, first find the historical data file matching the current point ID, and use the read_csv() function in Pandas The data of the file was read. Then, the first column was removed using the iloc[] function and the remaining columns were renamed blue_h, green_h, red_h, inf_h, si1_h, si2_h, and ndvi_h. Finally, use the concat() function in Pandas to combine the filtered and historical data into a new DataFrame.

Finally, use the to_csv() function in Pandas to save the new DataFrame to the output folder.

Run the above code, we can get countless combined Excel table files, and the columns of each file are as shown in the figure below, after we have merged all kinds of information.

In this way, the production process of our neural network training data set is completed.

So far, you’re done.

Welcome to pay attention: Crazy learning GIS