[python data] conditional query of pandas

1. Description

When using Pandas’ DataFrame for data mining, various conditional queries are required, but what are the basic syntax of these queries and how flexible are the queries? This article will list them in detail for future reference.

2. Pandas condition query method

2.1 Simple condition query

1. Use the ” [] " symbol for simple conditional query

Basic syntax:

For example:

<code>import pandas as pd

df = pd.read_csv('data.csv')
df[df['col1'] > 10] # Query the rows greater than 10 in the col1 column</code>

Expression features: it will be composed of comparison operators such as >, ==, <, and so on.

2. Multi-condition query connected by “& amp;”

Basic syntax

Use multiple conditions to perform compound conditional query, the “[ ]” expression is connected with “& amp;”,

Example:

<code>df[(df['col1'] > 5) & amp; (df['col2'] < 10)] # Query col1 column greater than 5 and col2 column less than 10 OK
</code>

3. str.contains() string condition query

Syntax

Use the str.contains() method to perform string conditional query, which is to find out the rows with “apple” in the substring of the string.

Example:

<code>df[df['col1'].str.contains('apple')] # Query the rows containing the 'apple' string in the col1 column
</code>

4. Multiple string contents are queried with isin conditions

Syntax

Use the isin() method for inclusion queries,

Example:

<code>df[df['col1'].isin(['apple', 'banana'])] # Query col1 column contains 'apple' or 'banana\ 'line
</code>

5. Between condition query

Use the between() method for range query, for example:

<code>df[df['col1'].between(5, 10)] # Query the rows between 5 and 10 in column col1
</code>

6. Query null values

Use the isna() or isnull() method to query for null values, for example:

<code>df[df['col1'].isna()] # Query rows with null values in column col1
</code>

2.2, advanced query

2.2.1 Inline statement query

Example 1: For example, I want to find information about all people with the surname Zhang:

df[[x.startswith('Zhang') for x in df['name']]]

Month Name Gender Payable Salary Actual Salary Position
0 1 Zhang San Male 2000 1500 Principal culprit
1 2 Zhang San Male 2000 1000 Principal culprit
2 3 Zhang San Female 2000 15000 Principal culprit
3 4 Zhang San Female 2000 1500 Principal culprit
4 5 Zhang San Female 2000 1500 Principal culprit

The stratswith method here is the string method that comes with Python, click here to view the detailed description.

There is another way:

criterion = df['name'].map(lambda x: x.startswith('Zhang'))

df[criterion]

Speed comparison:

# the first method
%timeit df[[x.startswith(‘Zhang’) for x in df[‘name’]]]
203 μs ± 8.92 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# The second method
%timeit criterion = df[‘name’].map(lambda x: x.startswith(‘Zhang’))
93.2 μs ± 6.21 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit df[criterion]
201 μs ± 2.44 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As you can see, the second method is actually not significantly faster. And if you add the time to add search rules, it will be slower.

2.2.2 The where method that can be used to modify the content

First look at the usage:

df.where(df[‘gender’] == ‘male’)

Month Name Gender Payable Salary Actual Salary Position
0 1.0 Zhang San Male 2000.0 1500.0 Principal culprit
1 2.0 Zhang San Male 2000.0 1000.0 Principal culprit
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 2.0 Li Si Male 1800.0 1300.0 Accomplice
6 3.0 Li Si Male 1800.0 1300.0 Accomplice
7 4.0 Li Si Male 1800.0 1300.0 Accomplice
8 5.0 Li Si Male 1800.0 1300.0 Accomplice
9 NaN NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN

The use of where here is different from the way of directly accessing labels, which is to keep all items that meet the conditions as they are, and set all other items to NaN. If you want to replace data, you need more complex expressions, here is just a simple example:

dates = pd.date_range(‘1/1/2000’, periods=8)

df = pd.DataFrame(np.random.randn(8, 4),index=dates, columns=[‘A’, ‘B’, ‘C’, ‘D’])

df.where(df < 0, -df)

A B C D
2000-01-01 -2.843891 -0.140803 -1.816075 -0.248443
2000-01-02 -0.195239 -1.014760 -0.621017 -0.308201
2000-01-03 -0.773316 -0.411646 -1.091336 -0.486160
2000-01-04 -1.753884 -0.596536 -0.273482 -0.685287
2000-01-05 -1.125159 -0.549449 -0.275434 -0.861960
2000-01-06 -1.059645 -1.600819 -0.085352 -0.406073
2000-01-07 -1.692449 -1.767384 -0.266578 -0.593165
2000-01-08 -0.163517 -1.645777 -1.509307 -0.637490

Insert a sentence here: In fact, numpy also has a where method, which is similar in usage. You can refer to: Return subscript operation function in Python Numpy – a time-saving tool

2.2.3 Fast query method query

df.query(‘name>sex’)

Month Name Gender Payable Salary Actual Salary Position
2 3 Zhang San Female 2000 15000 Principal culprit
3 4 Zhang San Female 2000 1500 Principal culprit
4 5 Zhang San Female 2000 1500 Principal culprit
9 1 Wang Wu Female 1800 1300 Auxiliary
10 2 Wang Wu Female 1800 1300 Auxiliary
11 3 Wang Wu Female 1800 1300 Actor
12 4 Wang Wu Female 1800 1300 Actor

Here, the comparison of strings can be found in Python’s string comparison. Of course, it can be seen here that the query method is mainly used for column comparison.

2.3 The shift() function in pandas

grammar:

shift(periods, freq, axis)

Parameter notes:

parameter	Parameter meaning
period	Indicates the range of movement, which can be a positive number or It is a negative number, the default value is 1, and 1 means to move once. Note that all the data is moved here, but the index is not moved. If there is no corresponding value after the move, it will be assigned a value of NaN.
freq	DateOffset, timedelta, or time rule string, optional parameter, the default value is None, only applicable to Time series, if this parameter exists, the time index will be moved according to the parameter value, and the data value will not change.
axis	0 is vertical, 1 is horizontal

example code

# form data generation
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame(np.arange(16).reshape(4,4),columns=['A','B','C','D'],index= pd.date_range('20130101', periods=4))

>>>df
           A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15

#The default is the setting of axis = 0 axis, when the period is positive, move down
# Indicates that the table is valid from the second row of the original data
df. shift(2)
A B C D
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN
2013-01-03 0.0 1.0 2.0 3.0
2013-01-04 4.0 5.0 6.0 7.0
#The default is the setting of axis = 0 axis, when the period is negative, move down
df.shift(-2)
?
A B C D
2013-01-01 8.0 9.0 10.0 11.0
2013-01-02 12.0 13.0 14.0 15.0
2013-01-03 NaN NaN NaN NaN NaN
2013-01-04 NaN NaN NaN NaN
#axis = 1, when the period is positive, it moves to the right, and when it is negative, it moves to the left
df.shift(2,axis=1)
A B C D
2013-01-01 NaN NaN 0.0 1.0
2013-01-02 NaN NaN 4.0 5.0
2013-01-03 NaN NaN 8.0 9.0
2013-01-04 NaN NaN 12.0 13.0
# The frep parameter determines that the index is a date, positive plus negative minus
df.shift(freq=datetime.timedelta(1))
A B C D
2013-01-02 0 1 2 3
2013-01-03 4 5 6 7
2013-01-04 8 9 10 11
2013-01-05 12 13 14 15
df.shift(freq=datetime.timedelta(-1))
A B C D
2012-12-31 0 1 2 3
2013-01-01 4 5 6 7
2013-01-02 8 9 10 11
2013-01-03 12 13 14 15

In addition to the above methods, there are: conditional processing of the query method, processing in the case of MultiIndex, get method, lookup method, etc.

3. More content

(updating..)

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledgePython entry skill treeStructured data analysis tool PandasPandas overview 298962 people are learning systematically