After Python batch processes Excel data, import it into SQL Server | Practice

1. Foreword

Immediately after yesterday’s article, Windows downloads, installs, configures SQL Server, SSMS, and uses Python to connect to read and write data. We have installed and configured sqlserver, and have successfully tested how to use Python to connect, read and write data to the database.

Today we officially started to meet the demand: there are a lot of Excel, which need to be processed in batches and then stored in different data tables.

2. Start to use your brain

2.1 Disassembly + clear requirements

1) What needs to be modified in the excel data?

  • There is a column of data DocketDate is a short-term value in excel, which needs to be converted into a normal year-month-day format;
    eg. 44567 –> 2022/1/6

  • Some data needs to be deduplicated according to SOID, and the most recent data should be kept according to DocketDate;

  • There is a column of data that needs to be converted to a date format.
    eg. 06/Jan/2022 12:27 –> 2022-1-6

Mainly involved: date format processing, data deduplication processing

2) Does each Excel correspond to a different data sheet? Is the name of the table consistent with the name of the Excel attachment?

  • Some Excel corresponds to the same table, some are separate

  • The table name is inconsistent with the name of the Excel attachment, but there is a corresponding relationship
    eg. Attachments test1 and test2 correspond to table testa, and attachment test3 corresponds to testb

Mainly involved: data merge processing

2.2 Install third-party packages

pip3 install sqlalchemy pymssql pandas xlrd xlwt
  • Sqlalchemy: You can map the table structure of a relational database to an object, and then process the database content by processing the object;

  • pymssql: the driver program for python to connect to the sqlserver database, and you can also directly use it to connect to the database and perform read and write operations;

  • pandas: handle various data, built-in many data processing methods, very convenient;

  • xlrd xlwt: read and write excel files, pandas read and write excel will call them.

Import package:

import pandas as pd
from datetime import date, timedelta, datetime
import time
import os
from sqlalchemy import create_engine
import pymssql

2.3 Read excel data

Reading data is relatively simple, just call the read_excel function of pandas directly. If the file has any special format, such as encoding, you can also customize the settings.

# read excel data
def get_excel_data(filepath):
    data = pd. read_excel(filepath)
    
    return data

2.4 Special data processing

1) The number of days in the date is shortened to the date

This is difficult. It is very simple to transfer directly in excel. Just select the data to be transferred, and then select the short date in the start-data format column.

At that time, I didn’t know the conversion law at first glance. After searching for a long time, I didn’t find any similar problems or explanations. First of all, it must not be a timestamp. can solve other data transformation problems.

First, we need to judge the null value, then set the number of days to calculate the starting time, use the timedelta function of the datetime module to convert the number of days into a time difference, and then directly calculate the starting date to get the date it represents.

# The number of days to shorten the date
def days_to_date(days):
    # handle nan values
    if pd.isna(days):
        return
    #44567 2022/1/6
    # Calculate the shortened date of excel days from 1899.12.30
    start = date(1899,12,30)
    # Convert days to timedelta type, which can be directly calculated with date
    delta = timedelta(days)
    # Start date + time difference to get the corresponding short date
    offset = start + delta
    return offset

The more difficult thing to think about here is the starting date for calculating the number of days, but after thinking about it, it is actually easy to calculate. From excel, we can directly convert the number of days into a short date. The equation already exists, only An unknown x, we only need to list a linear equation in one variable to solve the unknown x.

from datetime import date, timedelta

date_days = 44567
# Convert the number of days into a date type time interval
delta = timedelta(date_days)
# result date
result = date(2022,1,6)
# calculate unknown start date
x = result - delta
print(x)

'''
Output: 1899-12-30
'''

2) Convert the English in the date into a number

At first, I thought of using regular matching to extract the year, month, and day, and then convert the English month into a number. Later, I found that the English month can be directly recognized in the date.

The code is as follows, first convert the string into date type data according to the format, the original data is 06/Jan/2022 12:27 (digital day/English month/digital year digital hour: digital minute), press The corresponding relationship in the date formatting symbol explanation table can be replaced.

# Convert the official date format to a common format
def date_to_common(time):
    # handle nan values
    if pd.isna(time):
        return
    #06/Jan/2022 12:27 2022-1-6
    # test print(time,':', type(time))
    # convert string to date
    time_format = datetime.strptime(time,'%d/%b/%Y %H:%M')
    # Convert to specified date format
    common_date = datetime.strftime(time_format, '%Y-%m-%d')
    return common_date

Explanation table of date formatting symbols
@CSDN-The Messenger of Boating

3) Deduplication by order number SOID

In addition to deduplication by the specified column, the deduplication here also needs to retain the latest data by date.

My idea is to first call the sort_values function of pandas to sort all the data in ascending order according to the date column, and then call the drop_duplicates function to specify the SOID The column is deduplicated, and the value of keep is specified as last, indicating that the last row of data is kept in the duplicate data.

code show as below:

# Remove duplicate values SOID duplicates Remove the earliest data by date
def delete_repeat(data):
    # First sort by the date column Docket Rec.Date & amp; Time The default is descending order to ensure that the date left is the latest
    data.sort_values(by=['Docket Rec.Date & amp; Time'], inplace=True)
    # remove duplicate rows by SOID
    data.drop_duplicates(subset=['SOID #'], keep='last', inplace=True)
    
    return data

2.5 Other requirements

Multiple Excel data correspond to a database table

You can write a dictionary to store database tables and corresponding Excel data names, and then store them in the corresponding database tables one by one (or merge the data after processing the data in advance).

  • Merge Excel tables of the same type

# Merge data in the same table, import the merged excel list
def merge_excel(elist, files_path):
    data_list = [get_excel_data(files_path + i) for i in elist]
    data = pd.concat(data_list)
    return data

Here you can pass in the same type of Excel file name list (elist) and the absolute/relative path of the data storage folder (files_path). You can get the absolute/relative path of the Excel data table file through the absolute/relative path of the file + the name of the Excel file. Then call the get_excel_data function to read the data.

Traversing and reading the Excel table data uses the list comprehension formula, and finally uses the concat function of pandas to merge the corresponding data.

  • Data storage to sqlserver

# Initialize the database connection engine
# create_engine("database type + database driver://database username:database password@IP address:port/database", other parameters)
engine = create_engine("mssql + pymssql://sa:123456@localhost/study?charset=GBK")

# Storing data
def data_to_sql(data, table_naem, columns):
    # Do some processing on the data again, select the specified column and store it in the database
    data1 = data[columns]
    
    # The first parameter: table name
    # Second parameter: database connection engine
    # The third parameter: whether to store the index
    # The fourth parameter: append data if the table exists
    t1 = time.time() # Timestamp in seconds
    print('Data insertion start time: {0}'.format(t1))
    data1.to_sql(table_naem, engine, index=False, if_exists='append')
    t2 = time.time() # Timestamp in seconds
    print('Data insertion end time: {0}'.format(t2))
    print('Successfully inserted data %d, '%len(data1), 'Time spent: %.5f seconds.'%(t2-t1))

Pay attention to pitfalls when sqlalchemy + pymssql connect to sqlserver: you need to specify the database code, the database created by slqserver defaults to GBK code, for the installation and use of sqlserver, you can check the article Windows download, install and configure SQL Server, SSMS, and use Python to connect to read and write data.

2.6 Complete calling code

'''
Batch process all excel data
'''
# Data files are stored in a specified directory, such as:
files_path = './data/'
bf_path = './process/'

# Get all file names in the current directory
# files = os. listdir(files_path)
# files

# Table name: attachment excel name
data_dict = {
    'testa': ['test1.xls', 'test2.xls'],
    'testb': ['test3.xls'],
    'testc': ['test4.xls']
}

# Select the specified column in the attachment, and only save the specified column data
columns_a = ['S/No', 'SOID #', 'Current MileStone', 'Store In Date Time']
columns_b = ['Received Part Serial No', 'Received Product Category', 'Received Part Desc']
columns_c = ['From Loc', 'Orig Dispoition Code']

columns = [columns_a, columns_b, columns_c]
flag = 0 # column selection flag

# Traversing the dictionary, merging the relevant excel and processing the data, storing it in sql
for k,v in data_dict.items():
    table_name = k
    data = merge_excel(v, files_path)
    # 1. Processing data
    if 'SOID #' not in data.columns:
        # If the column to be processed is not included, it will be simply deduplicated and stored in the database
        data. drop_duplicates(inplace=True)
    else:
        # special processing data
        data = process_data(data)
    # 2. Store data
    # To be on the safe side, save a copy locally
    data.to_excel(bf_path + table_name + '.xls')
    # store in database
    data_to_sql(data, table_name, columns[flag])
    flag + =1

Finally: The complete software testing video tutorial below has been sorted out and uploaded, friends who need it can get it by themselves [100% free guarantee]

Software testing interview document

We must study to find a high-paying job. The following interview questions are the latest interview materials from first-tier Internet companies such as Ali, Tencent, and Byte, and some Byte bosses have given authoritative answers. Finish this set The interview materials believe that everyone can find a satisfactory job.

image

Get the whole set of information