Pandas 50 advanced, high-frequency operations

In the process of data analysis and data modeling, it is necessary to clean and organize the data, and sometimes it is necessary to add or delete fields to the data. The following is an introduction to Pandas’ complex query of data, data type conversion, data sorting, data modification, data iteration, and the use of functions.

01. Complex queries

Actual business needs often require data to be queried according to certain conditions or even complex combination conditions. Next, I will introduce how to use the infinite possibilities of Pandas data filtering and access data as you like.

1. Logical operations

# Q1 score is greater than 36
df.Q1> 36
# Q1 score is not less than 60 points, and is a member of Group C
~(df.Q1< 60) & (df['team'] == 'C')

2. Logical filter data

Slicing ([ ]), .loc[ ], and .iloc[ ] all support the logical expressions described above.

The following is an example of logical filtering for slices ([ ]):

df[df['Q1']== 8] # Q1 is equal to 8
df[~(df['Q1']== 8)] # not equal to 8
df[df.name== 'Ben'] # name is Ben
df[df.Q1>df.Q2]

Here are examples of .loc[ ] and .lic[ ]:

# expression is consistent with slice
df.loc[df['Q1']> 90, 'Q1':] # Q1 is greater than 90, only display Q1
df.loc[(df.Q1> 80) & amp; (df.Q2 < 15)] # and relationship
df.loc[(df.Q1> 90) | (df.Q2 < 90)] # or relationship
df.loc[df['Q1']== 8] # is equal to 8
df.loc[df.Q1== 8] # is equal to 8
df.loc[df['Q1']> 90, 'Q1':] # Q1 is greater than 90, display Q1 and all subsequent columns

3. Function screening

# query the value of the largest index
df.Q1[lambdas: max(s.index)] # the value is 21
# Calculate the maximum value
max(df.Q1.index)
#99
df.Q1[df.index==99]

4. Comparison function

# The following is equivalent to df[df.Q1 == 60]
df[df.Q1.eq(60)]
df.ne() # not equal to !=
df.le() # less than or equal to <=
df.lt() # less than <
df.ge() # greater than or equal to >=
df.gt() # greater than >

5. Query df.query()

df.query('Q1 > Q2 > 90') # Directly write type SQL where statement

It also supports the use of @ symbols to introduce variables

# Supports incoming variables, such as those with a score greater than the average score of 40
a = df.Q1.mean()
df.query('Q1 > @a + 40')
df.query('Q1 > `Q2` + @a')

df.eval() is similar to df.query() and can also be used for expression filtering.

# df.eval() usage is similar to df.query
df[df.eval("Q1 > 90 > Q3 >10")]
df[df.eval("Q1 > `Q2` + @a")]

6. Filter df.filter()

df.filter(items=['Q1', 'Q2']) # select two columns
df.filter(regex='Q', axis=1) # column name contains Q column
df.filter(regex='e$', axis=1) # end with e column
df.filter(regex='1$', axis=0) # Regular, the index name ends with 1
df.filter(like='2', axis=0) # There are 2 in the index
# The index starts with 2 and the column name has Q
df.filter(regex='^2', axis=0).filter(like='Q', axis=1)

7. Query by data type

df.select_dtypes(include=['float64']) # select float64 data
df.select_dtypes(include='bool')
df.select_dtypes(include=['number']) # only take digital type
df.select_dtypes(exclude=['int']) # exclude int type
df.select_dtypes(exclude=['datetime64'])

02, data type conversion

Before starting data analysis, we need to assign the appropriate type to the data so that the data can be processed efficiently. Different data types are suitable for different processing methods.

# Specify a unified type for all fields
df = pd.DataFrame(data, dtype='float32')
# Specify for each field separately
df = pd.read_excel(data, dtype={'team':'string', 'Q1': 'int32'})

1. Inferred type

# Automatically convert the appropriate data type
df.infer_objects() # inferred DataFrame
df.infer_objects().dtypes

2. Specified type

# Inferred by general type
m = ['1', 2, 3]
s = pd.to_numeric(s) # convert to a number
pd.to_datetime(m) # convert to time
pd.to_timedelta(m) # convert to time difference
pd.to_datetime(m, errors='coerce') # error handling
pd.to_numeric(m, errors='ignore')
pd.to_numeric(m errors='coerce').fillna(0) # bottom filling
pd.to_datetime(df[['year', 'month', 'day']])
# combined into a date

3. Type conversion astype()

df.Q1.astype('int32').dtypes
# dtype('int32')
df.astype({'Q1': 'int32','Q2':'int32'}).dtypes

4. Convert to time type

t = pd.Series(['20200801', '20200802'])

03. Data sorting

Data sorting refers to rearranging data in a certain order to help users discover the trend of data changes, provide certain business clues, and also have the functions of data error correction and classification.

1. Index sorting df.sort_index()

s.sort_index() # Sort in ascending order
df.sort_index() # df is also sorted by index
df.team.sort_index()s.sort_index(ascending=False)# descending order
s.sort_index(inplace=True) # Take effect after sorting, change the original data
# The index is re-arranged from 0-(n-1), very useful, you can get its sort number
s. sort_index(ignore_index=True)
s.sort_index(na_position='first') # The null value is in the front, and 'last' means the null value is in the back
s.sort_index(level=1) # If there are multiple layers, rank one level
s.sort_index(level=1, sort_remaining=False) #This layer is not sorted
# Row index sorting, header sorting
df.sort_index(axis=1) # will arrange the columns in order of column names

2. Numerical sorting sort_values()

df.Q1.sort_values()
df.sort_values('Q4')
df.sort_values(by=['team', 'name'], ascending=[True, False])

Other methods:

s.sort_values(ascending=False) # descending order
s.sort_values(inplace=True) # modification takes effect
s.sort_values(na_position='first') # empty value first
# df is arranged by the specified field
df.sort_values(by=['team'])
df.sort_values('Q1')
# According to multiple fields, first arrange the team, and then look at Q1 in the same team
df.sort_values(by=['team', 'Q1'])
# full descending order
df.sort_values(by=['team', 'Q1'], ascending=False)
# Corresponding to the specified team upgrade Q1 drop
df.sort_values(by=['team', 'Q1'], ascending=[True, False])
# Index re-0-(n-1) row
df.sort_values('team', ignore_index=True)

3. Mixed sort

df.set_index('name', inplace=True) # set name as index
df.index.names = ['s_name'] # name the index
df.sort_values(by=['s_name', 'team']) # sorting

4. Sort nsmallest() and nlargest() by value

s.nsmallest(3) # The smallest 3
s.nlargest(3) # the largest 3
# specify columns
df.nlargest(3, 'Q1')
df.nlargest(5, ['Q1', 'Q2'])
df.nsmallest(5, ['Q1', 'Q2'])

04. Add and modify

Data modification, addition and deletion often occur in the process of data organization. Modifications generally include modification of errors, format conversion, and data type modification.

1. Modify the value

df.iloc[0,0] # query value
# 'Liver'
df.iloc[0,0] = 'Lily' # Modify value
df.iloc[0,0] # view the results
# 'Lily'

# Modify the score less than 60 to 60
df[df.Q1 < 60] = 60
# Check
df.Q1

# Generate a list of length 100
v = [1, 3, 5, 7, 9] * 20

2. Replace data

s.replace(0, 5) # Change the 0 in the column data to 5
df.replace(0, 5) # Replace all 0 in the data with 5
df.replace([0, 1, 2, 3], 4) # Replace all 0~3 with 4
df.replace([0, 1, 2, 3], [4, 3, 2, 1]) # corresponding modification
s.replace([1, 2], method='bfill') # fill down
df.replace({0: 10, 1: 100}) # dictionary corresponding modification
df.replace({'Q1': 0, 'Q2': 5}, 100) # Modify the specified value of the specified field to 100
df.replace({'Q1': {0: 100, 4: 400}}) # Replace the specified value in the specified column with another specified value

3. Fill empty values

df.fillna(0) # Change all empty values to 0
# {'backfill', 'bfill', 'pad', 'ffill',None}, default is None
df.fillna(method='ffill') # Change the empty value to its previous value
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df.fillna(value=values) # Fill different values for each column
df.fillna(value=values, limit=1) # Only replace the first one

4. Modify the index name

df.rename(columns={'team':'class'})

Common methods are as follows:

df.rename(columns={"Q1":"a", "Q2": "b"}) # Modify the table header
df.rename(index={0: "x", 1:"y", 2: "z"}) # modify the index
df.rename(index=str) # modify the type
df.rename(str.lower, axis='columns') # pass index type
df.rename({1: 2, 2: 4}, axis='index')

# Modify the index name
s.rename_axis("animal")
df.rename_axis("animal") # Default is column index
df.rename_axis("limbs",axis="columns") # specify row index

# When the index is a multi-layer index, the type can be changed to class
df.rename_axis(index={'type': 'class'})

# Can be set and modified with set_axis
s.set_axis(['a', 'b', 'c'], axis=0)
df.set_axis(['I', 'II'], axis='columns')
df.set_axis(['i', 'ii'], axis='columns', inplace=True)

5. Add columns

df['foo'] = 100 # add a column foo, all values are 100
df['foo'] = df.Q1 + df.Q2 # The new column is the addition of two columns
df['foo'] = df['Q1'] + df['Q2'] # Same as above
# add up all values that are numbers
df['total'] = df.select_dtypes(include=['int']).sum(1)df['total'] =
df.loc[:,'Q1':'Q4'].apply(lambda x: sum(x), axis='columns')
df.loc[:, 'Q10'] = 'I'm new here' # also works
# Add a column and assign a value, if the condition is not met, it is NaN
df.loc[df.num >= 60, 'grade'] = 'qualified'
df.loc[df.num < 60, 'grades'] = 'unqualified'

6. Insert column df.insert()

# Insert a new column total at the position of the third column, the value is the total score of each row
df.insert(2, 'total', df.sum(1))

7. Specify column df.assign()

# increase the total column
df.assign(total=df.sum(1))
# Add two columns
df.assign(total=df.sum(1), Q=100)
df.assign(total=df.sum(1)).assign(Q=100)
Other usage examples:
df.assign(Q5=[100]*100) # add a new column Q5
df = df.assign(Q5=[100]*100) # assignment takes effect
df.assign(Q6=df.Q2/df.Q1) # Calculate and add Q6
df.assign(Q7=lambda d: d.Q1 * 9 / 5 + 32) # Use lambda# to add a column, the value is the expression result: True or False
df.assign(tag=df.Q1>df.Q2)
# Comparison calculation, True is 1, False is 0
df.assign(tag=(df.Q1>df.Q2).astype(int))
# Mapping copywriting
df.assign(tag=(df.Q1>60).map({True:'pass', False:'fail'}))
# Add multiple
df.assign(Q8=lambda d: d.Q1*5,
          Q9=lambda d: d.Q8 + 1) # Q8 is not effective, so df.Q8 cannot be used directly

8. Execute the expression df.eval()

# Pass in the total score expression
df.eval('total = Q1 + Q3 + Q3 + Q4')

Other methods:

df['C1'] = df.eval('Q2 + Q3')
df.eval('C2 = Q2 + Q3') # calculation
a = df.Q1.mean()df.eval("C3 =`Q3` + @a") # use variables
df.eval("C3 = Q2 > (`Q3` + @a)") #Add a Boolean value
df.eval('C4 = name + team', inplace=True) # effective immediately

9. Add lines

# Add data with index 100
df.loc[100] = ['tom', 'A', 88, 88, 88, 88]

Other methods:

df.loc[101]={'Q1':88,'Q2':99} # specifies the column, the value of the column without data is NaN
df.loc[df.shape[0] + 1] = {'Q1':88,'Q2':99} # Automatically increase index
df.loc[len(df) + 1] = {'Q1':88,'Q2':99}
# Batch operation, you can use iteration
rows = [[1,2],[3,4],[5,6]]
for row in rows:
    df.loc[len(df)] = row

10. Append merge

df = pd.DataFrame([[1, 2], [3, 4]],columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]],columns=list('AB'))
df.append(df2)

11. Delete

# delete the data with index 3
s. pop(3)
#93s
the s

12. Delete empty values

df.dropna() # If there is a missing value in a row, it will be deleted
df.dropna(axis='columns') # Keep only columns with all values
df.dropna(how='all') # row or column has no value before deleting
df.dropna(thresh=2) # Only delete when there are at least two null values
df.dropna(inplace=True) # delete and make the replacement take effect

05, advanced filtering

Introduce several very useful data filtering output methods for complex data processing.

1, df.where()

# value greater than 70
df.where(df > 70)

2, np.where()

# Less than 60 points are failed
np.where(df>=60, 'qualified', 'unqualified')

3, df.mask()

# Eligible for NaN
df.mask(s > 80)

4, df. lookup()

# The same number of rows and columns, return an array
df. lookup([1,3,4], ['Q1','Q2','Q3']) # array([36, 96, 61])
df. lookup([1], ['Q1']) # array([36])

06. Data iteration

1. Iterate Series

# Iterate over the specified columns
for i in df.name:
      print(i)
# Iterate over the index and the specified two columns
for i,n,q in zip(df.index, df.name,df.Q1):
    print(i, n, q)

2, df.iterrows()

# iteration, use name, Q1 data
for index, row in df.iterrows():
    print(index, row['name'], row.Q1)

3, df.itertuples()

for row in df.itertuples():
    print(row)

4, df.items()

# Series takes the first three
for label, ser in df.items():
    print(label)
    print(ser[:3], end='\\
\\
')

5. Iterate by column

# directly iterate over the DataFrame
for column in df:
    print(column)

07. Function application

1, pipe()

Applies to the entire DataFrame or Series.

# Multiple application of multiple functions to df
f(g(h(df), arg1=a), arg2=b, arg3=c)
# use pipe to connect them
(df.pipe(h)
    .pipe(g, arg1=a)
    .pipe(f, arg2=b, arg3=c)
)

2. apply()

Applies to rows or columns of the DataFrame, defaults to columns.

# Change all names to lowercase
df.name.apply(lambda x: x.lower())

3. applymap()

Applied to each element of the DataFrame.

# Calculate the length of the data
def mylen(x):
    return len(str(x))
df.applymap(lambda x:mylen(x)) # apply function
df.applymap(mylen) # same effect as above

4, map()

Applied to each element of a column of a Series or DataFrame.

df.team.map({'A':'Class One', 'B':'Class Two','C':'Class Three', \ 'D':'four classes',})# enumeration replacement
df['name'].map(f)

5, agg()

# the maximum value of each column
df.agg('max')
# Aggregate all columns to generate two rows of sum and min
df.agg(['sum', 'min'])
# sequence multiple aggregates
df.agg({'Q1' : ['sum', 'min'], 'Q2' : ['min','max']})
# Aggregate after grouping
df.groupby('team').agg('max')
df.Q1.agg(['sum', 'mean'])

6. transform()

df.transform(lambda x: x*2) # apply anonymous function
df.transform([np.sqrt, np.exp]) # call multiple functions

7, copy()

s = pd.Series([1, 2], index=["a","b"])
s_1 = s
s_copy = s.copy()
s_1 is s # True
s_copy is s # False