Article source: Learn Python for rookies
Hello everyone, I am Brother Tao. Today I will share with you a summary of advanced Pandas operations. The full text is 3,000 words and takes about 8 minutes to read.
In the process of data analysis and data modeling, data needs to be cleaned and organized, and sometimes fields need to be added or deleted from the data. The following is a detailed introduction to Pandas’ complex query of data, data type conversion, data sorting, data modification, data iteration and use of functions.
01. Complex query
Actual business needs often require data to be queried according to certain conditions or even complex combinations of conditions. Next, I will introduce how to take advantage of the infinite possibilities of Pandas data filtering and use the data as you like.
1. Logical operations
# Q1 score is greater than 36 df.Q1> 36 # Q1 score is not less than 60 points, and is a member of Group C (df.Q1< 60) & amp; (df['team'] == 'C')
2. Logically filter data
Slicing ([ ]), .loc[ ] and .iloc[ ] all support the logical expressions introduced above.
The following is an example of logical filtering for slices ([ ]):
df[df['Q1']== 8] # Q1 is equal to 8 df[~(df['Q1']== 8)] # Not equal to 8 df[df.name== 'Ben'] # The name is Ben df[df.Q1> df.Q2]
Here are .loc[ ] and .lic[ ] examples:
# The expression is consistent with the slice df.loc[df['Q1']> 90, 'Q1':] # Q1 is greater than 90, only Q1 is displayed df.loc[(df.Q1> 80) & amp; (df.Q2 < 15)] # and relationship df.loc[(df.Q1> 90) | (df.Q2 < 90)] # or relationship df.loc[df['Q1']== 8] # equal to 8 df.loc[df.Q1== 8] # equal to 8 df.loc[df['Q1']> 90, 'Q1':] # Q1 is greater than 90, display Q1 and all subsequent columns
3. Function filtering
# Query the value of the maximum index df.Q1[lambdas: max(s.index)] # The value is 21 # Calculate the maximum value max(df.Q1.index) #99 df.Q1[df.index==99]
4. Comparison function
# The following is equivalent to df[df.Q1 == 60] df[df.Q1.eq(60)] df.ne() # Not equal to != df.le() # Less than or equal to <= df.lt() # Less than < df.ge() # Greater than or equal to >= df.gt() # Greater than >
5. Query df.query()
df.query('Q1 > Q2 > 90') # Directly write type SQL where statement
It also supports using the @ character to introduce variables
# Supports passing in variables, such as those greater than the average score of 40 points a = df.Q1.mean() df.query('Q1 > @a + 40') df.query('Q1 > `Q2` + @a')
df.eval() is similar to df.query() and can also be used for expression filtering.
# df.eval() usage is similar to df.query df[df.eval("Q1 > 90 > Q3 >10")] df[df.eval("Q1 > `Q2` + @a")]
6. Filter df.filter()
df.filter(items=['Q1', 'Q2']) # Select two columns df.filter(regex='Q', axis=1) # Column whose column name contains Q df.filter(regex='e$', axis=1) # Columns ending with e df.filter(regex='1$', axis=0) # Regular, index name ends with 1 df.filter(like='2', axis=0) # There are 2 in the index # The index starts with 2 and has a column name with Q df.filter(regex='^2',axis=0).filter(like='Q', axis=1)
7. Query by data type
df.select_dtypes(include=['float64']) # Select float64 data df.select_dtypes(include='bool') df.select_dtypes(include=['number']) # Only take numeric types df.select_dtypes(exclude=['int']) # Exclude int type df.select_dtypes(exclude=['datetime64'])
Data type conversion
Before starting data analysis, we need to assign appropriate types to the data so that we can process the data efficiently. Different data types lend themselves to different processing methods.
# Specify a unified type for all fields df = pd.DataFrame(data, dtype='float32') # Specify each field separately df = pd.read_excel(data, dtype={'team':'string', 'Q1': 'int32'})
1. Inferred type
# Automatically convert appropriate data types df.infer_objects() # Inferred DataFrame df.infer_objects().dtypes
2. Specify type
# Inferred based on general type m = ['1', 2, 3] s = pd.to_numeric(s) # Convert to number pd.to_datetime(m) # Convert to time pd.to_timedelta(m) # Convert to time difference pd.to_datetime(m, errors='coerce') # Error handling pd.to_numeric(m, errors='ignore') pd.to_numeric(m errors='coerce').fillna(0) # Fill everything in pd.to_datetime(df[['year', 'month', 'day']]) # Combine into date
3. Type conversion astype()
df.Q1.astype('int32').dtypes #dtype('int32') df.astype({'Q1': 'int32','Q2':'int32'}).dtypes
4. Convert to time type
t = pd.Series(['20200801', '20200802'])
03. Data sorting
Data sorting refers to rearranging data in a certain order to help users discover data trends, while providing certain business clues, and also has the function of correcting and classifying data.
1. Index sorting df.sort_index()
s.sort_index() # Sort in ascending order df.sort_index() # df is also sorted by index df.team.sort_index()s.sort_index(ascending=False)# Sort in descending order s.sort_index(inplace=True) # Take effect after sorting, change the original data # The index is rearranged from 0-(n-1), which is very useful. You can get its sort number. s.sort_index(ignore_index=True) s.sort_index(na_position='first') # The null value comes first, and 'last' means the null value comes last. s.sort_index(level=1) # If there are multiple levels, sort one level s.sort_index(level=1, sort_remaining=False) #This level is not sorted # Row index sorting, header sorting df.sort_index(axis=1) # Will sort the columns in order of column names
2. Numeric sorting sort_values()
df.Q1.sort_values() df.sort_values('Q4') df.sort_values(by=['team', 'name'],ascending=[True, False])
Other methods:
s.sort_values(ascending=False) # Descending order s.sort_values(inplace=True) # Modification takes effect s.sort_values(na_position='first') # Null values first # df is sorted by the specified field df.sort_values(by=['team']) df.sort_values('Q1') # According to multiple fields, sort the teams first, and then look at Q1 within the same team. df.sort_values(by=['team', 'Q1']) # All descending order df.sort_values(by=['team', 'Q1'], ascending=False) # Corresponding to the specified team's Q1 increase and decrease df.sort_values(by=['team', 'Q1'],ascending=[True, False]) #Re-arrange the index to 0-(n-1) df.sort_values('team', ignore_index=True)
3. Mixed sorting
df.set_index('name', inplace=True) # Set name as index df.index.names = ['s_name'] # Name the index df.sort_values(by=['s_name', 'team']) # Sort
4. Sort nsmallest() and nlargest() by value size
s.nsmallest(3) # The smallest 3 s.nlargest(3) # The largest 3 #Specify columns df.nlargest(3, 'Q1') df.nlargest(5, ['Q1', 'Q2']) df.nsmallest(5, ['Q1', 'Q2'])
04. Add modifications
Data modification, addition and deletion often occur during the data organization process. Modifications generally include modification errors, format conversion, data type modification, etc.
1. Modify the value
df.iloc[0,0] # Query value # 'Liver' df.iloc[0,0] = 'Lily' # Modify value df.iloc[0,0] # View results # 'Lily' # Modify scores less than 60 points to 60 df[df.Q1 < 60] = 60 # Check df.Q1 # Generate a list of length 100 v = [1, 3, 5, 7, 9] * 20
2. Replace data
s.replace(0, 5) # Replace 0 in column data with 5 df.replace(0, 5) # Replace all 0s in the data with 5s df.replace([0, 1, 2, 3], 4) # Replace all 0~3 with 4 df.replace([0, 1, 2, 3], [4, 3, 2, 1]) # Corresponding modification s.replace([1, 2], method='bfill') # Fill downwards df.replace({0: 10, 1: 100}) # Dictionary corresponding modification df.replace({'Q1': 0, 'Q2': 5}, 100) # Modify the specified value of the specified field to 100 df.replace({'Q1': {0: 100, 4: 400}}) # Replace the specified value in the specified column with another specified value
3. Fill in empty values
df.fillna(0) # Change all null values to 0 # {'backfill', 'bfill', 'pad', 'ffill',None}, default is None df.fillna(method='ffill') # Change all null values to their previous values values = {'A': 0, 'B': 1, 'C': 2, 'D': 3} df.fillna(value=values) # Fill different values for each column df.fillna(value=values, limit=1) # Only replace the first one
4. Modify the index name
df.rename(columns={'team':'class'})
Commonly used methods are as follows:
df.rename(columns={"Q1":"a", "Q2": "b"}) # Modify the header df.rename(index={0: "x", 1:"y", 2: "z"}) # Modify the index df.rename(index=str) # Modify the type df.rename(str.lower, axis='columns') # Pass the index type df.rename({1: 2, 2: 4}, axis='index') # Modify the index name s.rename_axis("animal") df.rename_axis("animal") # The default is column index df.rename_axis("limbs",axis="columns") #Specify row index # When the index is a multi-level index, you can change type to class df.rename_axis(index={'type': 'class'}) # You can use set_axis to modify settings s.set_axis(['a', 'b', 'c'], axis=0) df.set_axis(['I', 'II'], axis='columns') df.set_axis(['i', 'ii'], axis='columns',inplace=True)
5. Add columns
df['foo'] = 100 # Add a column of foo, all values are 100 df['foo'] = df.Q1 + df.Q2 # The new column is the sum of the two columns df['foo'] = df['Q1'] + df['Q2'] # Same as above # Add up all numeric values df['total'] =df.select_dtypes(include=['int']).sum(1)df['total'] = df.loc[:,'Q1':'Q4'].apply(lambda x: sum(x), axis='columns') df.loc[:, 'Q10'] = 'I'm new here' # OK too #Add a column and assign a value. If the conditions are not met, it will be NaN. df.loc[df.num >= 60, 'score'] = 'passed' df.loc[df.num < 60, 'score'] = 'unqualified'
6. Insert column df.insert()
# Insert a new column total column in the third column, the value is the total score of each row df.insert(2, 'total', df.sum(1))
7. Specify column df.assign()
# Add total column df.assign(total=df.sum(1)) #Add two columns df.assign(total=df.sum(1), Q=100) df.assign(total=df.sum(1)).assign(Q=100) Other usage examples: df.assign(Q5=[100]*100) # Add a new column Q5 df = df.assign(Q5=[100]*100) # Assignment takes effect df.assign(Q6=df.Q2/df.Q1) # Calculate and add Q6 df.assign(Q7=lambda d: d.Q1 * 9 / 5 + 32) # Use lambda# to add a column, the value is the expression result: True or False df.assign(tag=df.Q1>df.Q2) # Comparison calculation, True is 1, False is 0 df.assign(tag=(df.Q1>df.Q2).astype(int)) # Mapping copy df.assign(tag=(df.Q1>60).map({True:'passed',False:'failed'})) # Add multiple df.assign(Q8=lambda d: d.Q1*5, Q9=lambda d: d.Q8 + 1) # Q8 is not effective and cannot be used directly df.Q8
8. Execute expression df.eval()
# Pass in the total score expression df.eval('total = Q1 + Q3 + Q3 + Q4')
Other methods:
df['C1'] = df.eval('Q2 + Q3') df.eval('C2 = Q2 + Q3') # Calculation a = df.Q1.mean()df.eval("C3 =`Q3` + @a") # Use variables df.eval("C3 = Q2 > (`Q3` + @a)") #Add a Boolean value df.eval('C4 = name + team', inplace=True) # Effective immediately
9. Add rows
# Add new data with index 100 df.loc[100] = ['tom', 'A', 88, 88, 88, 88]
Other methods:
df.loc[101]={'Q1':88,'Q2':99} # Specify the column, the value of the column without data is NaN df.loc[df.shape[0] + 1] = {'Q1':88,'Q2':99} # Automatically increase index df.loc[len(df) + 1] = {'Q1':88,'Q2':99} # Batch operation, you can use iteration rows = [[1,2],[3,4],[5,6]] for row in rows: df.loc[len(df)] = row
10. Additional merge
df = pd.DataFrame([[1, 2], [3, 4]],columns=list('AB')) df2 = pd.DataFrame([[5, 6], [7, 8]],columns=list('AB')) df.append(df2)
11. Delete
# Delete the data with index 3 s.pop(3) #93s s
12. Delete null values
df.dropna() # Delete if there is a missing value in a row df.dropna(axis='columns') # Only keep columns with all values df.dropna(how='all') # Delete only if all rows or columns have no value df.dropna(thresh=2) # Delete only if there are at least two null values df.dropna(inplace=True) # Delete and make the replacement effective
05, Advanced filtering
Introducing several very useful data filtering and output methods for complex data processing.
1.df.where()
# Value greater than 70 df.where(df > 70)
2. np.where()
# Less than 60 points is considered a failure. np.where(df>=60, 'Qualified', 'Unqualified')
3. df.mask()
# The qualified ones are NaN df.mask(s > 80)
4. df.lookup()
# If the number of rows and columns is the same, return an array df.lookup([1,3,4], ['Q1','Q2','Q3']) # array([36, 96, 61]) df.lookup([1], ['Q1']) # array([36])
06. Data iteration
1. Iteration Series
# Iterate over the specified columns for i in df.name: print(i) # Iterate over the index and the specified two columns for i,n,q in zip(df.index, df.name,df.Q1): print(i, n, q)
2. df.iterrows()
# Iterate, use name, Q1 data for index, row in df.iterrows(): print(index, row['name'], row.Q1)
3. df.itertuples()
for row in df.itertuples(): print(row)
4.df.items()
#Series takes the first three for label, ser in df.items(): print(label) print(ser[:3], end='\\ \\ ')
5. Iterate by column
# Iterate directly over the DataFrame for column in df: print(column)
07. Function application
1. pipe()
Applies to the entire DataFrame or Series.
# Apply multiple functions to df multiple times f(g(h(df), arg1=a), arg2=b, arg3=c) # Use pipe to connect them (df.pipe(h) .pipe(g, arg1=a) .pipe(f, arg2=b, arg3=c) )
2. apply()
Applies to the row or column of the DataFrame, defaults to column.
# Change all names to lowercase df.name.apply(lambda x: x.lower())
3. applymap()
Applied to each element of the DataFrame.
# Calculate the length of data def mylen(x): return len(str(x)) df.applymap(lambda x:mylen(x)) # Apply function df.applymap(mylen) # The effect is the same as above
4. map()
Applies to each element of a column in a Series or DataFrame.
df.team.map({'A':'Class 1', 'B':'Class 2','C':'Class 3', \ 'D':'Class 4',})# Enumeration replacement df['name'].map(f)
5. agg()
# Maximum value of each column df.agg('max') # Aggregate all columns to generate sum and min rows df.agg(['sum', 'min']) # Sequence multiple aggregations df.agg({'Q1' : ['sum', 'min'], 'Q2' : ['min','max']}) # Aggregation after grouping df.groupby('team').agg('max') df.Q1.agg(['sum', 'mean'])
6. transform()
df.transform(lambda x: x*2) # Apply anonymous function df.transform([np.sqrt, np.exp]) # Call multiple functions
7. copy()
s = pd.Series([1, 2], index=["a","b"]) s_1 = s s_copy = s.copy() s_1 is s #True s_copy is s # False
organizing useful notes 100 Frequently Asked Questions about Reptiles.pdf, so comprehensive! Python automated operation and maintenance 100 frequently asked questions.pdf 100 common problems in Python web development.pdf 124 Python cases, complete source code! The Complete Guide to Office Automation with Python Python crawler learning manual PDF Beautiful Soup quick start guide, from beginner to proficient PYTHON 3.10 Chinese version official documentation "The Road to Python 2.0.pdf", which took three months to compile, is open for download The most classic programming textbook "Think Python" open source Chinese version. PDF download