Python: Use the pandas package to complete the transpose operation of excel data

Directory

  • question
  • primary method
    • Import Data
    • Observation data
    • Processing ideas and code
  • Advanced method
    • Import & observe data
    • Processing ideas and code
      • Step by step version:
        • Extract column index
        • Extract row index
        • Extract value
        • Build a new data frame
        • Transpose
      • Function version:

Question

Due to business requirements, Table A needs to be converted into Table B to facilitate subsequent association of other information and analysis of product delivery status.

Table A

Table B

Primary method

The key functions used are: fillna(), MultiIndex.from_product(), stack(), reset_index()

Import data

Import xlsx using pd.read_excel()

It should be noted that pandas uses the xlrd package to import xlsx, and only the old version of the xlrd (1.2.0) package supports importing xlsx files. The latest version of the xlrd package does not support xlsx but supports xls.

import pandas as pd
df_raw = pd.read_excel(r'XXXXX\practice.xlsx')

Observation data

After importing the data, you need to perform corresponding operations according to the format presented by the data.
Figure 1
As can be seen:

  1. Stores that are column information in excel become row information after being read;
  2. The merged cells are split, and only one cell has information after the split.

Processing ideas and code

  1. Set the product category and product quality fields to index, then set the store level and store to multi-level columns;
  2. Use stack() to transpose the table. At this time, the multi-level columns also become index;
  3. Use reset_index() to restore index to column and that’s it.

first step:

from collections import OrderedDict #This package is used to generate a column name list with deduplication and unchanged order.
def remove_duplicates(lst):#Define a function that removes duplicates from a list and does not change the order
    return list(OrderedDict.fromkeys(lst))

df = df_raw.copy()
df.fillna(method = 'ffill',inplace = True)
#Generate a second-level column list
col_index = remove_duplicates(df.iloc[0,:][df.iloc[0,:].notna()].to_list())
#Delete the first line
df = df.drop(0,axis = 0)
#set_index
df = df.set_index(['product category','product quality'])
#Add columns index
col_index2 = [x for x in df.columns.to_list() if 'Unnamed'not in x ]#Generate the first-level column name list. This step must be done after set_index, otherwise the redundant column names will be included.
columns_multiindex = pd.MultiIndex.from_product([col_index2,col_index], names=['store level', 'store'])
df.columns=columns_multiindex

After adding index and multi-level columns, the effect is as follows:
Figure 2
Steps 2 and 3:

#stack, convert column index into index (analogous to the row and column transposition operation in excel), and then restore all indexes to columns through reset_index
df = df.stack(['store level', 'store']).reset_index()
df

Effect:
Figure 3
PS: After such an operation, I finally have a deeper understanding of the index of dataframe, and the concepts of iloc and loc are also clear.

Advanced methods

When processing tabular data in actual work, I found that the above method only applies to the following column names with inclusion relationships (the main reason is the use of MultiIndex.from_product): Example header 1
But when the relationship becomes more complex… it doesn’t apply:
Example header 2

Import & Observe Data

Data reading results
It can still be seen:

  1. The column information in Excel becomes row information after reading;
  2. The merged cells are split, and only the leftmost cell has information after the split.

Processing ideas and code

The overall idea is similar to that in the primary method. It still operates around the index of the data frame. Before stack(), the row index, column index and value are extracted and processed respectively to build a new data frame. The overall steps and functions are instead Much simpler.
Extract each part separately

Step-by-step version:

Extract column index

Extract column indexes (including the current column index) as values and fill the rows with null values.

df_column = df.iloc[0:4,2:]
df_cols = pd.DataFrame([df_column.columns.to_list()],columns = df_column.columns.to_list(),index = [0])
df_column = pd.concat([df_cols,df_column]).reset_index(drop = True)
df_column.loc[0,df_column.loc[0,:].str.contains('Unnamed') == True] = np.nan
df_column.fillna(method = 'ffill',inplace = True,axis = 1)

Column Index

Extract row index

Extract row index

df_index = df.loc[:,['Column Name 1', 'Column Name 2']]
#df_index.fillna(method = 'ffill',inplace = True)
df_index = df_index.drop([0,1,2,3],axis = 0).reset_index(drop = True)
df_index

Note: When slicing df and passing it to a new variable df_index = df.loc[:,['Column Name 1', 'Column Name 2']], you need to use iloc or loc, otherwise it will appear SettingWithCopyWarning.

Extract value
df_value = df.iloc[4:,2:]
df_value

Build a new data frame
df_value.index = df_index.T.values.tolist()
df_value.columns = df_column.values.tolist()
df_value.index.names = ['Column Name 1', 'Column Name 2'] #The name here is the name of the row index
df_value.columns.names = ['colname1','colname2','colname3','colname4','colname5']#The name here is the name of the column index
#Row index names and column index names appear as column names in the final result table


The result presented in jupyter is like this. It seems that there is no alignment, which may cause misunderstanding. In fact, it is in the desired form. You can export it to csv to verify:

Transpose
df_new = df_value.stack(['colname1','colname2','colname3','colname4','colname5']).reset_index()

NA lines are also automatically removed

Function version:

def multicolumnlevel_to_one(raw_dataframe,row_level_cnt,column_level_cnt,row_index_name,column_index_name):
    #raw_dataframe: The pending table dataframe requires deletion of unnecessary columns
    #row_level_cnt: Number of row index columns in the original table int
    #column_level_cnt: Number of column name levels in the original table int
    #row_index_name: The column name of the column that needs to be set as the row index. The order of the list is from left to right. The name of the column to be used as the row index.
    #column_index_name: column index name list order is different column names from top to bottom
    if column_level_cnt >= 2:
        #step 1 Extract column index
        df_column = raw_dataframe.iloc[:(column_level_cnt-1),row_level_cnt:]
        df_cols = pd.DataFrame([df_column.columns.to_list()],columns = df_column.columns.to_list(),index = [0])
        df_column = pd.concat([df_cols,df_column]).reset_index(drop = True)
        df_column.loc[0,df_column.loc[0,:].str.contains('Unnamed') == True] = np.nan
        df_column.fillna(method = 'ffill',inplace = True,axis = 1)

        #step 2 Extract row index
        df_index = raw_dataframe.loc[:,row_index_name]
        df_index.fillna(method = 'ffill',inplace = True)
        df_index = df_index.drop(list(range(column_level_cnt-1)),axis = 0).reset_index(drop = True)

        #step 3 Extract value
        df_value = raw_dataframe.iloc[(column_level_cnt-1):,row_level_cnt:]

        #step 4 Build a new data frame
        df_value.index = df_index.T.values.tolist()
        df_value.columns = df_column.values.tolist()
        df_value.index.names = row_index_name
        df_value.columns.names = column_index_name

        #step 5 Transpose
        df_new = df_value.stack(column_index_name).reset_index()
    else: #If there is only one row of column names
        df_new = raw_dataframe.set_index(row_index_name)
        df_new.columns.name = column_index_name[0]
        df_new = df_new.stack().reset_index()
    return df_new

The function also satisfies the following transposition of tables:

End