Python Teaching | Pandas Data Indexing and Data Selection

Directory

Part 1 Preface

Index of Part 2 Tabular Data

Part 3 Pandas data selection

1. Select the data field

2. Select data row

3. Select a single data value

4. Select any data

(1) Select according to the index: loc function

(2) Select according to the order: iloc function

What is the use of Part 4 data selection?

1. View data or generate new data

2. Modify the data value

Part 5 Summary

Part 6 Python Tutorial

Preface to Part 1

In the first two articles of the Python tutorial series, we introduced the data types in the data analysis library Pandas and the methods of reading and writing table files. After reading, you should have a preliminary understanding of Pandas. In data processing, Importing data-analyzing and processing data-exporting result data is a very common processing flow. The previous article has introduced to you in detail how Pandas imports and exports data, so the following In technical articles, we will focus on how to analyze and process data. As the most popular data processing framework in the Python language, Pandas is powerful, complete, and full of functions, so we plan to use multiple articles to introduce the main functions of Pandas in detail according to different needs of data processing. It is the data selection (also known as data slicing) function of Pandas.

What is data selection? In short, it is to obtain the data value of a certain position or a certain area in the table according to the row and column order or the row and column index of the table. If you complete simple data selection in Excel software, you only need to move the mouse. You can use Pandas to do data selection, but you need to learn to use two obscure functions first. It is not easy to find difficult, although the direct interaction between people and data in the programming language is not as good as office software, but the specific functions of Pandas data selection and the aspect of replacing human processing of big data cannot be replaced by Excel, so the data selection Definitely a necessary step in learning Python data manipulation.

This tutorial is written based on pandas version 1.5.3

All the Python code in this article is written in the integrated development environment Visual Studio Code (VScode) using the interactive development environment Jupyter Notebook. Please use VScode to open the code shared in this article.

Index of Part 2 table data

When using tabular data daily, it is often necessary to check the data somewhere in the table. At this time, it is inevitable to locate according to the field name and row number of the table. For example, when displaying tabular data in office software such as Excel/WPS, the software left The side and top have their own row numbers and column numbers. We can locate the data according to the row numbers and column numbers (or field names), as shown in the figure below.

In Python, Pandas is a Python third-party library specially used to process tabular data. In Pandas, the data type representing tabular data is DataFrame. In order to be able to locate data as easily as Excel, DataFrame also has row number and column number attributes. Below we read a table data for demonstration.

#import pandas library
import pandas as pd
# read the demo data
DATA = pd.read_excel('./Partial information of A-share listed companies.xlsx')
# Process the field name and simplify it, don't worry about this step
DATA.columns = [COL.split('\\
')[0] for COL in list(DATA.columns)]
# display data
DATA

As shown in the figure above, the tabular data type DataFrame in Pandas also has row number and column number attributes. The row and column numbers of the DataFrame type and the row and column numbers of the Excel table look somewhat similar. In fact, there are quite a few differences between them. The author puts them The main differences between are listed below.

In an Excel table, the table does not necessarily have a field name, so there must be a column number above (A, B, C…); while in Pandas, the table type must have a field name, and use the field name as the column number (Column index or field name: column), if you find that the table does not have a field name when reading a piece of data, Pandas will use the content of the first row as the field name by default. Of course, this may not be reasonable, but we can Set the parameter header=None when reading the data, indicating that the table has no field name. At this time, Pandas will set the field name of the table as a natural number (from left to right is 0, 1, 2… )
The row number in the Excel table is actually the serial number of the row in the table, from top to bottom, starting from 1, up to 1048576, because the data storage limit of the Excel table is 1048576 (2 to the 20th power) rows. And the row number is fixed and will not change with the deletion or hiding of data rows. In Pandas, the data row number is called row index(index), when reading data, Pandas will automatically assign table data Row index (as shown in the figure above), the row index starts from 0 by default and increases downward. The row index is the number of a row of data. With the deletion and addition of data rows, the row index will also change. For example, after the row of data with index 1 is deleted, the data row index will become 0 , 2, 3… So the row index in Pandas is not continuous and constant. This is to make the positioning of the data more stable and not affected by the change of the data volume.
The row and column numbers of an Excel table are fixed, the row index is a positive integer from 1 to 1048576, and the column number is a number composed of 26 uppercase English letters. The row index of the table (DataFrame) in Pandas is an increasing natural number by default, but the index value can be defined according to the needs of the programmer, and the row index can also be a string type. At any stage of data processing, the user can change the index of a certain (several) rows of data, and can also reset the index of the data row at any time to make it a default continuous natural number.
In fact, Excel does not have a strict concept of field names, but everyone is used to writing the field name on the first line, so in Excel, the value of the field name is arbitrary and can be repeated. In Pandas, although the same field name or row index is also supported, it is not recommended to do so. When reading table data and matching table data, if at least two field names are found to be the same, Pandas will automatically make a distinction ,As shown below.

The above are the main differences between the row and column numbers in Excel software and the row index and field name of the tabular data type DataFrame in Pandas. I believe that everyone has a more comprehensive understanding of the DataFrame index at this point. Next, we will take the default data index as an example to demonstrate how Pandas selects data.

Part 3 Pandas data selection

The data table used for demonstration in this section is part of the information of A-share listed companies (as of May 12, 2023). For the code to read the data and the data preview, please refer to the previous section [Index of table data].

1. Select data field

In many cases, there are too many fields in the form, and we can filter out the ones we need. In Pandas, we can select the required data fields according to the field names. The sample code is as follows (the DATA in the code is the variable obtained by reading the data in the previous section).

## If you only need to select one field, you can use this code: <variable name>[<field name>], the example is as follows
DATA['Company Chinese name'] # The result selected at this time is a Series, no more pictures will be displayed here

## If you need to select several fields, and the result is still a table
## You can use: <variable name>[<field name list>], the example is as follows
DATA[['Company Chinese name', 'Company English name']]

2. Select data row

There is a simple function of selecting data rows in Pandas, which can select a continuous range of data rows, for example, you can select all data between the mth row (including m) and the nth row (not including n). However, the serial numbers m and n here are not the row index values of the data, but the digital serial numbers of the data rows. This row digital serial number is almost the same as the row serial number in Excel. The only difference is that Excel The row number in Pandas starts from 1, while the row number in Pandas starts from 0 (as shown in the figure above). The sample code is as follows.

## Select the 10th to 15th rows of data in the data DATA (the row numbers mentioned here are counted from 1)
# Since the row number of DataaFrame starts from 0, and it is a left-closed right-open interval, the correct code is as follows
DATA[9:15] # Note the row number

3. Select a single data value

In the section [Field Selection] above, we used the code DATA['Company Chinese Name'] to select a field. The author mentioned in the comment that using this code to select The data field is a one-dimensional Series type. In the previous article introducing the Pandas data type >>Python Teaching | Pandas (Basic Articles), an Essential Tool for Data Processing”>>>>Python Teaching | Data Processing Essentials Pandas of Tools (Basic), we have introduced the Series type, which is a one-dimensional data type with an index. If it is a Series obtained from a DataFrame, then the obtained Series also has the same index value as the DataFrame, for example :

DATA['Company Chinese name']

Therefore, the common way to select a single data value is to select a column of data first, and then use the row index to retrieve the target value. For example, if you need to retrieve the data value with the row index value of 2 in the Company Chinese Name field in the table, then The code below can be used.

DATA['Company Chinese name'][2]
# Get: 'Guohai Securities Co., Ltd.' You can refer to the above picture

For students who are new to Pandas, this is the most common way to locate a specific data value, and it is often used in loop codes that process data in batches.

4. Select any data

In the previous sections, we introduced several methods for selecting rows, columns, and values in Pandas, but these methods cannot meet more diverse data needs. In Pandas, there are two functions loc and iloc, which can be used to index or rank according to the row and column of the table The serial number can be used to obtain data in any area or satisfying certain conditions. Let’s introduce the usage of these two functions in detail below.

(1) Select by index: loc function

There is a loc function in the tabular data type DataFrame in Pandas. The usage of the function is as follows:

DataFrame.loc[row_label, column_label]

There are two parameters row_label and column_label in the loc function, which respectively correspond to the row index and field name of the selected area. As we all know, table data is two-dimensional data, and we can find the corresponding data through the row and column numbers of the table data. The loc function selects data through this principle, that is, loc The function selects data according to the index value of the data. For the convenience of use, there are many built-in parameter forms of the loc function. Below we will introduce the usage of the loc function according to the parameter forms.

① The parameter is a single index value

When using the loc function, the parameters can be set to a single row index and column index (the column index is the field name), and this method can be used to obtain the specific value of a certain position in the table. The sample code is as follows.

## Purpose: Select the data value whose row index is 3 in the field of "Securities Abbreviation"
DATA.loc[3, 'Security abbreviation'] # Note that the row index comes first, the field name comes after, separated by commas
# Get: 'GF Securities'

The meaning of the above code DATA.loc[3, 'stock abbreviation'] is to get 'stock abbreviation' in the data DATA A value whose row index is 3. Corresponding to the table is: ‘GF Securities’.

② The parameter is an index list

If we want to get the specified rows and columns of data, we can also set the index parameter to a list containing index values, the sample code is as follows.

## Select the two fields of 'company Chinese name' and 'total number of employees', and the row index is the data of 1, 3, 4, 7
DATA.loc[[1,3,4,7], ['Company Chinese name', 'Total number of employees']]

The selected data is still two-dimensional data, so the result of the selection is still a DataFrame.

③ The parameter is the index range

When we need to select data in a certain range of values, for example, we need to select data with row index values from 100 to 200 (including 200), we can directly use the range parameter 100:200 to act as a parameter of the loc function (for the usage of string slices, please refer to this article: Python Teaching | Loop Structure in Python (Part 1)). Similarly, the field name (column index) can also use the range parameter, the sample code is as follows.

## Select all data from the 'Security Code' field to the 'Company Attribute' field, with row index values from 100 to 200
DATA.loc[100:200, 'Securities code':'Company property']

When using the slice range as the loc function, there are the following points to note.

When using row index slicing, it is necessary to ensure that the row index of the data is continuous, and the slice range is reasonable, and the right index value must be greater than or equal to the left index value;
Similarly, field name slices are selected according to the order of the fields, and the logic of the order must also be guaranteed;
In Python, general slices are left-closed and right-open intervals, that is, the values on the left side of the slice are included, and the values on the right side of the slice are not included. But the loc function is an exception. The slice parameter in it is a left-closed right-closed interval, which also includes index values on both sides of the slice. For example, in the above example, slice 100:200 actually includes the data row whose row index is 200, so the data volume of the selection result is only 101 rows.
When the parameter is an index range slice, the values on both sides of the slice can be omitted. Omitting the left side means starting from the smallest row index value (or the leftmost field name), and omitting the right side means counting from the largest row index value (or the most leftmost field name). The field name on the right side), although the values on both sides of the slice can be omitted, but the colon (:) in the middle of the slice must be retained.

Below is another sample code for the loc function when it uses a range slice as an argument.

## select the row index from 0 to 5; the field name is from 'company Chinese name' to the last data
DATA.loc[:5, 'Company Chinese name':] # Some values in the slice are omitted here

④ The parameters are simple conditions

In addition to directly selecting data according to the index value as in the above cases, the loc function can also indirectly obtain the specified data according to the index value. What does “indirect” refer to here? ? In Pandas, data index values can also be generated by using conditional expressions. The sample code for indirectly obtaining index values according to conditions is as follows.

## Obtain the row index of all the data whose 'total number of employees' in DATA is not less than 10000
DATA['total number of employees'] >= 10000

So we can complete the corresponding data filtering according to the above conditions. The corresponding selection code is as follows.

## Select all field data in DATA where the total number of employees is not less than 10000
DATA.loc[DATA['total number of employees'] >= 10000, :] # The row index parameters here use simple conditions, and the column indexes use range slices,
                                        # and omit the value before and after slicing, indicating all columns

In the above code, the row index parameter uses simple conditions, and the column index uses range slicing. In fact, the column index parameter can also use judgment conditions, but in our daily data use habits, most of the data is filtered according to the data row. According to There are very few scenarios where field names filter fields. Below, the author only introduces an example of filtering data based on field name conditional expressions.

## select all the data in which the total number of employees in DATA is not less than 10000 and the field name contains the word 'company'
DATA.loc[DATA['total number of employees'] >= 10000, DATA.columns.str.contains('company')]

Finally, one thing needs to be clearly stated. When introducing the parameter types of the loc function, the author above will set the row index parameter and column index parameter to the same type. This is for the convenience of everyone learning row and column indexes. The use of parameters. In fact, the loc function is a very flexible data selection function, in which the row index value parameter and the column index value parameter (namely the field name parameter) are independent of each other, that is to say, their types are arbitrary and unnecessary Keep the types consistent, you can mix and match at will, and in addition to the four commonly used parameter types introduced above, there are also other parameter types, and the infinite usage needs to be learned and explored by yourself.

(2) Select according to order: iloc function

Another data selection function that is very similar to the loc function and its usage is the iloc function, and its usage is as follows.

DataFrame.iloc[row_label, column_label]

It can be seen from the function form that the iloc function may have the same usage as the loc function. In fact, this is true in terms of usage. However, since the two functions choose The basis of the data is different, resulting in slight differences in the parameter types between the two. The loc function introduced above selects data according to the row and column indexes of the data, while the iloc function introduced below selects data according to the row and column indexes of the data. This is very similar to what we say about the number of rows and columns, the difference is that The order of rows and columns in Pandas starts from 0, not from 1, so it is still used when using Pay attention to the row and column order values. Below we introduce the iloc function according to different parameter types.

① The parameter is a single serial number value

In the iloc function, the row order value and column order value are all natural numbers, and have nothing to do with the row index value of the data.

## Select row 2, column 6 data
DATA.iloc[1, 5]
# get: 728

After running the above code, the output value is 728, which is exactly the value of line 2 and line 6 in the data DATA, and the location of the data is shown in the figure below.

It should be noted that since the row and column serial numbers of Pandas start from 0, the serial number value in the code is 1 less than the described order.

② The parameter is a list of serial numbers

Like the loc function, the parameter form of the iloc function can be a list. In the loc function, the value in the list is the index value of the row and column, but in the iloc function, the value in the list is the serial number value of the row and column. The sample code is as follows.

## Select rows 3, 4, 6, 7 and columns 2, 6, 8, 9, and 10
DATA.iloc[[2,3,5,6], [1,5,7,8,9]]

③ The parameter is the serial number range

The iloc function also supports passing in range slices as the serial number value for data selection. For example, when you need to select the 100th to 200th (including 200) rows of data, you can set the row serial number slice to 99:200 , the sample code is as follows.

## Select the 100th to 200th (including 200) rows, all column data
DATA.iloc[99:200, :]

Careful students should have discovered the difference between this and the loc function. According to the rules of line numbers, the representation method of lines 100 to 200 (including 200) should not be 99:199. Well, why is 99:200 in the code? This is about the difference between the two functions in the slicing rules.

The index value slice in the loc function corresponds to a left-closed right-close interval, while the sequence number slice in the iloc function is a < strong>left close rightopen range, so the value on the right side of the range needs to be 1 more.
When using the iloc function to select data, there is no requirement for the row and column index values of the data, that is, the index of the data can be discontinuous.

What is the use of Part 4 data selection?

We have already exhausted the two functions of data selection above, so what are the functions and roles of these two functions in data processing? Let’s make a simple explanation below.

1, view data or generate new data

As demonstrated in the introduction of the data selection function, data selection can facilitate us to view the selected data, or when the data needs to be clipped, the data selection function can also be used to generate a new data object. For example, when we only need certain columns of data in the table, we can assign the result of data selection to a new variable, and then process and analyze the new variable. The sample code is as follows.

# Select the fields of the company's Chinese name, company attributes, enterprise size, registered capital, and total number of employees to generate new data
DATA_NEW = DATA[['Company Chinese Name', 'Company Attributes', 'Enterprise Size', 'Registered Capital', 'Total Number of Employees']]
# or: DATA_NEW = DATA.loc[:, ['Company Chinese Name', 'Company Attributes', 'Enterprise Size', 'Registered Capital', 'Total Number of Employees']]
DATA_NEW

2, modify data value

When processing an Excel table, if you need to modify some data values, you can open the Excel table file and modify it manually. But if it is in Pandas, since the read table data is in the memory, we cannot manually modify the value of the table data in the memory. At this time, we need to use the data selection function to modify the data value in the table. The sample code is as follows.

## Assuming that the data needs to be updated manually, modify the total number of employees in the first two lines to 3640 and 4000 respectively

# method 1
DATA_NEW['total number of employees'][0] = 3640
DATA_NEW['total number of employees'][1] = 4000
# Method 2
DATA_NEW.loc[0, 'total number of employees'] = 3640
DATA_NEW.loc[1, 'total number of employees'] = 4000
# Method 3
DATA_NEW.loc[[0,1], 'total number of employees'] = (3640, 4000)
# Use any of the above methods

After modification, it is found that the corresponding position data has been replaced with the value we need, as shown in the figure below.

Of course, this is only a method of manually modifying a few data values. In Pandas, most of the scenarios that really need to modify data values are in the needs of processing “big data”, and in these scenarios, a considerable part of them need to use data selection The way to modify data, so the content introduced in this article is an important knowledge of Python processing big data.

Part 5 Summary

The data selection in Padans is very flexible, especially the data selection functions loc and iloc, many beginners cannot fully understand the similarities and differences between them in a short time, so When the author introduced them, he made clear the differences between them. It has to be admitted that there are indeed indistinguishable differences in the parameter forms and usages of the two data selection functions, so the best way to master these two functions is to practice more and explore more. Finally, help readers and friends to study smoothly and everything will be fine.

In the next article, we will learn about data filtering in Pandas. In Pandas, in addition to basic data filtering like Excel, you can also use regular expressions to complete some wonderful data filtering operations. Stay tuned for the next issue!

Part 6 Python Tutorial

The first step in learning Python – environment installation and configuration
Python primitive data types
Python string manipulation (on)
Python string manipulation (below)
Python variables and basic operations
Composite Data Type – List
Composite data type – collection (contains examples)
Composite Data Types – Dictionaries & Tuples
Branching structure in Python (judgment statement)
Loop structure in Python (Part 1)
Loop structure in Python (Part 2)
Python teaching | Definition and calling of Python functions
Python Teaching | Python Built-in Functions
Python Teaching | One of the most commonly used standard libraries – os
Python teaching | “Xiaobai” friendly regular expression teaching (1)
Python teaching | “Xiaobai” friendly regular expression teaching (2)
Python teaching | “Xiaobai” friendly regular expression teaching (3)
Python teaching | Pandas, an essential tool for data processing (basics)
Python teaching | Pandas, an essential tool for data processing (reading and exporting data)
This period