50 advanced operations of Pandas, show off!

Article source: Learn Python for rookies

Hello everyone, I am Brother Tao. Today I will share with you a summary of advanced Pandas operations. The full text is 3,000 words and takes about 8 minutes to read.

In the process of data analysis and data modeling, data needs to be cleaned and organized, and sometimes fields need to be added or deleted from the data. The following is a detailed introduction to Pandas’ complex query of data, data type conversion, data sorting, data modification, data iteration and use of functions.

01. Complex query

Actual business needs often require data to be queried according to certain conditions or even complex combinations of conditions. Next, I will introduce how to take advantage of the infinite possibilities of Pandas data filtering and use the data as you like.

1. Logical operations

# Q1 score is greater than 36
df.Q1> 36
# Q1 score is not less than 60 points, and is a member of Group C
(df.Q1< 60) & amp; (df['team'] == 'C')

2. Logically filter data

Slicing ([ ]), .loc[ ] and .iloc[ ] all support the logical expressions introduced above.

The following is an example of logical filtering for slices ([ ]):

df[df['Q1']== 8] # Q1 is equal to 8
df[~(df['Q1']== 8)] # Not equal to 8
df[df.name== 'Ben'] # The name is Ben
df[df.Q1> df.Q2]

Here are .loc[ ] and .lic[ ] examples:

# The expression is consistent with the slice
df.loc[df['Q1']> 90, 'Q1':] # Q1 is greater than 90, only Q1 is displayed
df.loc[(df.Q1> 80) & amp; (df.Q2 < 15)] # and relationship
df.loc[(df.Q1> 90) | (df.Q2 < 90)] # or relationship
df.loc[df['Q1']== 8] # equal to 8
df.loc[df.Q1== 8] # equal to 8
df.loc[df['Q1']> 90, 'Q1':] # Q1 is greater than 90, display Q1 and all subsequent columns

3. Function filtering

# Query the value of the maximum index
df.Q1[lambdas: max(s.index)] # The value is 21
# Calculate the maximum value
max(df.Q1.index)
#99
df.Q1[df.index==99]

4. Comparison function

# The following is equivalent to df[df.Q1 == 60]
df[df.Q1.eq(60)]
df.ne() # Not equal to !=
df.le() # Less than or equal to <=
df.lt() # Less than <
df.ge() # Greater than or equal to >=
df.gt() # Greater than >

5. Query df.query()

df.query('Q1 > Q2 > 90') # Directly write type SQL where statement

It also supports using the @ character to introduce variables

# Supports passing in variables, such as those greater than the average score of 40 points
a = df.Q1.mean()
df.query('Q1 > @a + 40')
df.query('Q1 > `Q2` + @a')

df.eval() is similar to df.query() and can also be used for expression filtering.

# df.eval() usage is similar to df.query
df[df.eval("Q1 > 90 > Q3 >10")]

df[df.eval("Q1 > `Q2` + @a")]

6. Filter df.filter()

df.filter(items=['Q1', 'Q2']) # Select two columns
df.filter(regex='Q', axis=1) # Column whose column name contains Q
df.filter(regex='e$', axis=1) # Columns ending with e
df.filter(regex='1$', axis=0) # Regular, index name ends with 1
df.filter(like='2', axis=0) # There are 2 in the index
# The index starts with 2 and has a column name with Q

df.filter(regex='^2',axis=0).filter(like='Q', axis=1)

7. Query by data type

df.select_dtypes(include=['float64']) # Select float64 data
df.select_dtypes(include='bool')
df.select_dtypes(include=['number']) # Only take numeric types
df.select_dtypes(exclude=['int']) # Exclude int type
df.select_dtypes(exclude=['datetime64'])

Data type conversion

Before starting data analysis, we need to assign appropriate types to the data so that we can process the data efficiently. Different data types lend themselves to different processing methods.

# Specify a unified type for all fields
df = pd.DataFrame(data, dtype='float32')
# Specify each field separately
df = pd.read_excel(data, dtype={'team':'string', 'Q1': 'int32'})

1. Inferred type

# Automatically convert appropriate data types
df.infer_objects() # Inferred DataFrame
df.infer_objects().dtypes

2. Specify type

# Inferred based on general type
m = ['1', 2, 3]
s = pd.to_numeric(s) # Convert to number
pd.to_datetime(m) # Convert to time
pd.to_timedelta(m) # Convert to time difference
pd.to_datetime(m, errors='coerce') # Error handling
pd.to_numeric(m, errors='ignore')
pd.to_numeric(m errors='coerce').fillna(0) # Fill everything in
pd.to_datetime(df[['year', 'month', 'day']])
# Combine into date

3. Type conversion astype()

df.Q1.astype('int32').dtypes
#dtype('int32')
df.astype({'Q1': 'int32','Q2':'int32'}).dtypes

4. Convert to time type

t = pd.Series(['20200801', '20200802'])

03. Data sorting

Data sorting refers to rearranging data in a certain order to help users discover data trends, while providing certain business clues, and also has the function of correcting and classifying data.

1. Index sorting df.sort_index()

s.sort_index() # Sort in ascending order
df.sort_index() # df is also sorted by index
df.team.sort_index()s.sort_index(ascending=False)# Sort in descending order
s.sort_index(inplace=True) # Take effect after sorting, change the original data
# The index is rearranged from 0-(n-1), which is very useful. You can get its sort number.
s.sort_index(ignore_index=True)
s.sort_index(na_position='first') # The null value comes first, and 'last' means the null value comes last.
s.sort_index(level=1) # If there are multiple levels, sort one level
s.sort_index(level=1, sort_remaining=False) #This level is not sorted
# Row index sorting, header sorting

df.sort_index(axis=1) # Will sort the columns in order of column names

2. Numeric sorting sort_values()

df.Q1.sort_values()
df.sort_values('Q4')
df.sort_values(by=['team', 'name'],ascending=[True, False])

Other methods:

s.sort_values(ascending=False) # Descending order
s.sort_values(inplace=True) # Modification takes effect
s.sort_values(na_position='first') # Null values first
# df is sorted by the specified field
df.sort_values(by=['team'])
df.sort_values('Q1')
# According to multiple fields, sort the teams first, and then look at Q1 within the same team.
df.sort_values(by=['team', 'Q1'])
# All descending order
df.sort_values(by=['team', 'Q1'], ascending=False)
# Corresponding to the specified team's Q1 increase and decrease
df.sort_values(by=['team', 'Q1'],ascending=[True, False])
#Re-arrange the index to 0-(n-1)
df.sort_values('team', ignore_index=True)

3. Mixed sorting

df.set_index('name', inplace=True) # Set name as index
df.index.names = ['s_name'] # Name the index
df.sort_values(by=['s_name', 'team']) # Sort

4. Sort nsmallest() and nlargest() by value size

s.nsmallest(3) # The smallest 3
s.nlargest(3) # The largest 3
#Specify columns
df.nlargest(3, 'Q1')
df.nlargest(5, ['Q1', 'Q2'])
df.nsmallest(5, ['Q1', 'Q2'])

04. Add modifications

Data modification, addition and deletion often occur during the data organization process. Modifications generally include modification errors, format conversion, data type modification, etc.

1. Modify the value

df.iloc[0,0] # Query value
# 'Liver'
df.iloc[0,0] = 'Lily' # Modify value
df.iloc[0,0] # View results
# 'Lily'

# Modify scores less than 60 points to 60
df[df.Q1 < 60] = 60
# Check
df.Q1

# Generate a list of length 100
v = [1, 3, 5, 7, 9] * 20

2. Replace data

s.replace(0, 5) # Replace 0 in column data with 5
df.replace(0, 5) # Replace all 0s in the data with 5s
df.replace([0, 1, 2, 3], 4) # Replace all 0~3 with 4
df.replace([0, 1, 2, 3], [4, 3, 2, 1]) # Corresponding modification
s.replace([1, 2], method='bfill') # Fill downwards
df.replace({0: 10, 1: 100}) # Dictionary corresponding modification
df.replace({'Q1': 0, 'Q2': 5}, 100) # Modify the specified value of the specified field to 100
df.replace({'Q1': {0: 100, 4: 400}}) # Replace the specified value in the specified column with another specified value

3. Fill in empty values

df.fillna(0) # Change all null values to 0
# {'backfill', 'bfill', 'pad', 'ffill',None}, default is None
df.fillna(method='ffill') # Change all null values to their previous values
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df.fillna(value=values) # Fill different values for each column
df.fillna(value=values, limit=1) # Only replace the first one

4. Modify the index name

df.rename(columns={'team':'class'})

Commonly used methods are as follows:

df.rename(columns={"Q1":"a", "Q2": "b"}) # Modify the header
df.rename(index={0: "x", 1:"y", 2: "z"}) # Modify the index
df.rename(index=str) # Modify the type
df.rename(str.lower, axis='columns') # Pass the index type
df.rename({1: 2, 2: 4}, axis='index')

# Modify the index name
s.rename_axis("animal")
df.rename_axis("animal") # The default is column index
df.rename_axis("limbs",axis="columns") #Specify row index

# When the index is a multi-level index, you can change type to class
df.rename_axis(index={'type': 'class'})

# You can use set_axis to modify settings
s.set_axis(['a', 'b', 'c'], axis=0)
df.set_axis(['I', 'II'], axis='columns')
df.set_axis(['i', 'ii'], axis='columns',inplace=True)

5. Add columns

df['foo'] = 100 # Add a column of foo, all values are 100
df['foo'] = df.Q1 + df.Q2 # The new column is the sum of the two columns
df['foo'] = df['Q1'] + df['Q2'] # Same as above
# Add up all numeric values
df['total'] =df.select_dtypes(include=['int']).sum(1)df['total'] =
df.loc[:,'Q1':'Q4'].apply(lambda x: sum(x), axis='columns')
df.loc[:, 'Q10'] = 'I'm new here' # OK too
#Add a column and assign a value. If the conditions are not met, it will be NaN.
df.loc[df.num >= 60, 'score'] = 'passed'
df.loc[df.num < 60, 'score'] = 'unqualified'

6. Insert column df.insert()

# Insert a new column total column in the third column, the value is the total score of each row
df.insert(2, 'total', df.sum(1))

7. Specify column df.assign()

# Add total column
df.assign(total=df.sum(1))
#Add two columns
df.assign(total=df.sum(1), Q=100)
df.assign(total=df.sum(1)).assign(Q=100)
Other usage examples:
df.assign(Q5=[100]*100) # Add a new column Q5
df = df.assign(Q5=[100]*100) # Assignment takes effect
df.assign(Q6=df.Q2/df.Q1) # Calculate and add Q6
df.assign(Q7=lambda d: d.Q1 * 9 / 5 + 32) # Use lambda# to add a column, the value is the expression result: True or False
df.assign(tag=df.Q1>df.Q2)
# Comparison calculation, True is 1, False is 0
df.assign(tag=(df.Q1>df.Q2).astype(int))
# Mapping copy
df.assign(tag=(df.Q1>60).map({True:'passed',False:'failed'}))
# Add multiple
df.assign(Q8=lambda d: d.Q1*5,
          Q9=lambda d: d.Q8 + 1) # Q8 is not effective and cannot be used directly df.Q8

8. Execute expression df.eval()

# Pass in the total score expression
df.eval('total = Q1 + Q3 + Q3 + Q4')

Other methods:

df['C1'] = df.eval('Q2 + Q3')
df.eval('C2 = Q2 + Q3') # Calculation
a = df.Q1.mean()df.eval("C3 =`Q3` + @a") # Use variables
df.eval("C3 = Q2 > (`Q3` + @a)") #Add a Boolean value
df.eval('C4 = name + team', inplace=True) # Effective immediately

9. Add rows

# Add new data with index 100
df.loc[100] = ['tom', 'A', 88, 88, 88, 88]

Other methods:

df.loc[101]={'Q1':88,'Q2':99} # Specify the column, the value of the column without data is NaN
df.loc[df.shape[0] + 1] = {'Q1':88,'Q2':99} # Automatically increase index
df.loc[len(df) + 1] = {'Q1':88,'Q2':99}
# Batch operation, you can use iteration
rows = [[1,2],[3,4],[5,6]]
for row in rows:
    df.loc[len(df)] = row

10. Additional merge

df = pd.DataFrame([[1, 2], [3, 4]],columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]],columns=list('AB'))

df.append(df2)

11. Delete

# Delete the data with index 3
s.pop(3)
#93s
s

12. Delete null values

df.dropna() # Delete if there is a missing value in a row
df.dropna(axis='columns') # Only keep columns with all values
df.dropna(how='all') # Delete only if all rows or columns have no value
df.dropna(thresh=2) # Delete only if there are at least two null values
df.dropna(inplace=True) # Delete and make the replacement effective

05, Advanced filtering

Introducing several very useful data filtering and output methods for complex data processing.

1.df.where()

# Value greater than 70
df.where(df > 70)

2. np.where()

# Less than 60 points is considered a failure.
np.where(df>=60, 'Qualified', 'Unqualified')

3. df.mask()

# The qualified ones are NaN
df.mask(s > 80)

4. df.lookup()

# If the number of rows and columns is the same, return an array
df.lookup([1,3,4], ['Q1','Q2','Q3']) # array([36, 96, 61])
df.lookup([1], ['Q1']) # array([36])

06. Data iteration

1. Iteration Series

# Iterate over the specified columns
for i in df.name:
      print(i)
# Iterate over the index and the specified two columns
for i,n,q in zip(df.index, df.name,df.Q1):
print(i, n, q)

2. df.iterrows()

# Iterate, use name, Q1 data
for index, row in df.iterrows():
print(index, row['name'], row.Q1)

3. df.itertuples()

for row in df.itertuples():
print(row)

4.df.items()

#Series takes the first three
for label, ser in df.items():
    print(label)
print(ser[:3], end='\\
\\
')

5. Iterate by column

# Iterate directly over the DataFrame
for column in df:
print(column)

07. Function application

1. pipe()

Applies to the entire DataFrame or Series.

# Apply multiple functions to df multiple times
f(g(h(df), arg1=a), arg2=b, arg3=c)
# Use pipe to connect them
(df.pipe(h)
    .pipe(g, arg1=a)
    .pipe(f, arg2=b, arg3=c)

)

2. apply()

Applies to the row or column of the DataFrame, defaults to column.

# Change all names to lowercase
df.name.apply(lambda x: x.lower())

3. applymap()

Applied to each element of the DataFrame.

# Calculate the length of data
def mylen(x):
    return len(str(x))
df.applymap(lambda x:mylen(x)) # Apply function
df.applymap(mylen) # The effect is the same as above

4. map()

Applies to each element of a column in a Series or DataFrame.

df.team.map({'A':'Class 1', 'B':'Class 2','C':'Class 3', \ 'D':'Class 4',})# Enumeration replacement
df['name'].map(f)

5. agg()

# Maximum value of each column
df.agg('max')
# Aggregate all columns to generate sum and min rows
df.agg(['sum', 'min'])
# Sequence multiple aggregations
df.agg({'Q1' : ['sum', 'min'], 'Q2' : ['min','max']})
# Aggregation after grouping
df.groupby('team').agg('max')
df.Q1.agg(['sum', 'mean'])

6. transform()

df.transform(lambda x: x*2) # Apply anonymous function
df.transform([np.sqrt, np.exp]) # Call multiple functions

7. copy()

s = pd.Series([1, 2], index=["a","b"])
s_1 = s
s_copy = s.copy()
s_1 is s #True
s_copy is s # False

organizing useful notes

  100 Frequently Asked Questions about Reptiles.pdf, so comprehensive!
Python automated operation and maintenance 100 frequently asked questions.pdf
100 common problems in Python web development.pdf

124 Python cases, complete source code!
The Complete Guide to Office Automation with Python

Python crawler learning manual PDF

Beautiful Soup quick start guide, from beginner to proficient

PYTHON 3.10 Chinese version official documentation
"The Road to Python 2.0.pdf", which took three months to compile, is open for download
The most classic programming textbook "Think Python" open source Chinese version. PDF download