In the process of data analysis and data modeling, it is necessary to clean and organize the data, and sometimes it is necessary to add or delete fields to the data. The following is an introduction to Pandas’ complex query of data, data type conversion, data sorting, data modification, data iteration, and the use of functions.
01. Complex queries
Actual business needs often require data to be queried according to certain conditions or even complex combination conditions. Next, I will introduce how to use the infinite possibilities of Pandas data filtering and access data as you like.
1. Logical operations
# Q1 score is greater than 36 df.Q1> 36 # Q1 score is not less than 60 points, and is a member of Group C ~(df.Q1< 60) & (df['team'] == 'C')
2. Logical filter data
Slicing ([ ]), .loc[ ], and .iloc[ ] all support the logical expressions described above.
The following is an example of logical filtering for slices ([ ]):
df[df['Q1']== 8] # Q1 is equal to 8 df[~(df['Q1']== 8)] # not equal to 8 df[df.name== 'Ben'] # name is Ben df[df.Q1>df.Q2]
Here are examples of .loc[ ] and .lic[ ]:
# expression is consistent with slice df.loc[df['Q1']> 90, 'Q1':] # Q1 is greater than 90, only display Q1 df.loc[(df.Q1> 80) & amp; (df.Q2 < 15)] # and relationship df.loc[(df.Q1> 90) | (df.Q2 < 90)] # or relationship df.loc[df['Q1']== 8] # is equal to 8 df.loc[df.Q1== 8] # is equal to 8 df.loc[df['Q1']> 90, 'Q1':] # Q1 is greater than 90, display Q1 and all subsequent columns
3. Function screening
# query the value of the largest index df.Q1[lambdas: max(s.index)] # the value is 21 # Calculate the maximum value max(df.Q1.index) #99 df.Q1[df.index==99]
4. Comparison function
# The following is equivalent to df[df.Q1 == 60] df[df.Q1.eq(60)] df.ne() # not equal to != df.le() # less than or equal to <= df.lt() # less than < df.ge() # greater than or equal to >= df.gt() # greater than >
5. Query df.query()
df.query('Q1 > Q2 > 90') # Directly write type SQL where statement
It also supports the use of @ symbols to introduce variables
# Supports incoming variables, such as those with a score greater than the average score of 40 a = df.Q1.mean() df.query('Q1 > @a + 40') df.query('Q1 > `Q2` + @a')
df.eval() is similar to df.query() and can also be used for expression filtering.
# df.eval() usage is similar to df.query df[df.eval("Q1 > 90 > Q3 >10")] df[df.eval("Q1 > `Q2` + @a")]
6. Filter df.filter()
df.filter(items=['Q1', 'Q2']) # select two columns df.filter(regex='Q', axis=1) # column name contains Q column df.filter(regex='e$', axis=1) # end with e column df.filter(regex='1$', axis=0) # Regular, the index name ends with 1 df.filter(like='2', axis=0) # There are 2 in the index # The index starts with 2 and the column name has Q df.filter(regex='^2', axis=0).filter(like='Q', axis=1)
7. Query by data type
df.select_dtypes(include=['float64']) # select float64 data df.select_dtypes(include='bool') df.select_dtypes(include=['number']) # only take digital type df.select_dtypes(exclude=['int']) # exclude int type df.select_dtypes(exclude=['datetime64'])
02, data type conversion
Before starting data analysis, we need to assign the appropriate type to the data so that the data can be processed efficiently. Different data types are suitable for different processing methods.
# Specify a unified type for all fields df = pd.DataFrame(data, dtype='float32') # Specify for each field separately df = pd.read_excel(data, dtype={'team':'string', 'Q1': 'int32'})
1. Inferred type
# Automatically convert the appropriate data type df.infer_objects() # inferred DataFrame df.infer_objects().dtypes
2. Specified type
# Inferred by general type m = ['1', 2, 3] s = pd.to_numeric(s) # convert to a number pd.to_datetime(m) # convert to time pd.to_timedelta(m) # convert to time difference pd.to_datetime(m, errors='coerce') # error handling pd.to_numeric(m, errors='ignore') pd.to_numeric(m errors='coerce').fillna(0) # bottom filling pd.to_datetime(df[['year', 'month', 'day']]) # combined into a date
3. Type conversion astype()
df.Q1.astype('int32').dtypes # dtype('int32') df.astype({'Q1': 'int32','Q2':'int32'}).dtypes
4. Convert to time type
t = pd.Series(['20200801', '20200802'])
03. Data sorting
Data sorting refers to rearranging data in a certain order to help users discover the trend of data changes, provide certain business clues, and also have the functions of data error correction and classification.
1. Index sorting df.sort_index()
s.sort_index() # Sort in ascending order df.sort_index() # df is also sorted by index df.team.sort_index()s.sort_index(ascending=False)# descending order s.sort_index(inplace=True) # Take effect after sorting, change the original data # The index is re-arranged from 0-(n-1), very useful, you can get its sort number s. sort_index(ignore_index=True) s.sort_index(na_position='first') # The null value is in the front, and 'last' means the null value is in the back s.sort_index(level=1) # If there are multiple layers, rank one level s.sort_index(level=1, sort_remaining=False) #This layer is not sorted # Row index sorting, header sorting df.sort_index(axis=1) # will arrange the columns in order of column names
2. Numerical sorting sort_values()
df.Q1.sort_values() df.sort_values('Q4') df.sort_values(by=['team', 'name'], ascending=[True, False])
Other methods:
s.sort_values(ascending=False) # descending order s.sort_values(inplace=True) # modification takes effect s.sort_values(na_position='first') # empty value first # df is arranged by the specified field df.sort_values(by=['team']) df.sort_values('Q1') # According to multiple fields, first arrange the team, and then look at Q1 in the same team df.sort_values(by=['team', 'Q1']) # full descending order df.sort_values(by=['team', 'Q1'], ascending=False) # Corresponding to the specified team upgrade Q1 drop df.sort_values(by=['team', 'Q1'], ascending=[True, False]) # Index re-0-(n-1) row df.sort_values('team', ignore_index=True)
3. Mixed sort
df.set_index('name', inplace=True) # set name as index df.index.names = ['s_name'] # name the index df.sort_values(by=['s_name', 'team']) # sorting
4. Sort nsmallest() and nlargest() by value
s.nsmallest(3) # The smallest 3 s.nlargest(3) # the largest 3 # specify columns df.nlargest(3, 'Q1') df.nlargest(5, ['Q1', 'Q2']) df.nsmallest(5, ['Q1', 'Q2'])
04. Add and modify
Data modification, addition and deletion often occur in the process of data organization. Modifications generally include modification of errors, format conversion, and data type modification.
1. Modify the value
df.iloc[0,0] # query value # 'Liver' df.iloc[0,0] = 'Lily' # Modify value df.iloc[0,0] # view the results # 'Lily' # Modify the score less than 60 to 60 df[df.Q1 < 60] = 60 # Check df.Q1 # Generate a list of length 100 v = [1, 3, 5, 7, 9] * 20
2. Replace data
s.replace(0, 5) # Change the 0 in the column data to 5 df.replace(0, 5) # Replace all 0 in the data with 5 df.replace([0, 1, 2, 3], 4) # Replace all 0~3 with 4 df.replace([0, 1, 2, 3], [4, 3, 2, 1]) # corresponding modification s.replace([1, 2], method='bfill') # fill down df.replace({0: 10, 1: 100}) # dictionary corresponding modification df.replace({'Q1': 0, 'Q2': 5}, 100) # Modify the specified value of the specified field to 100 df.replace({'Q1': {0: 100, 4: 400}}) # Replace the specified value in the specified column with another specified value
3. Fill empty values
df.fillna(0) # Change all empty values to 0 # {'backfill', 'bfill', 'pad', 'ffill',None}, default is None df.fillna(method='ffill') # Change the empty value to its previous value values = {'A': 0, 'B': 1, 'C': 2, 'D': 3} df.fillna(value=values) # Fill different values for each column df.fillna(value=values, limit=1) # Only replace the first one
4. Modify the index name
df.rename(columns={'team':'class'})
Common methods are as follows:
df.rename(columns={"Q1":"a", "Q2": "b"}) # Modify the table header df.rename(index={0: "x", 1:"y", 2: "z"}) # modify the index df.rename(index=str) # modify the type df.rename(str.lower, axis='columns') # pass index type df.rename({1: 2, 2: 4}, axis='index') # Modify the index name s.rename_axis("animal") df.rename_axis("animal") # Default is column index df.rename_axis("limbs",axis="columns") # specify row index # When the index is a multi-layer index, the type can be changed to class df.rename_axis(index={'type': 'class'}) # Can be set and modified with set_axis s.set_axis(['a', 'b', 'c'], axis=0) df.set_axis(['I', 'II'], axis='columns') df.set_axis(['i', 'ii'], axis='columns', inplace=True)
5. Add columns
df['foo'] = 100 # add a column foo, all values are 100 df['foo'] = df.Q1 + df.Q2 # The new column is the addition of two columns df['foo'] = df['Q1'] + df['Q2'] # Same as above # add up all values that are numbers df['total'] = df.select_dtypes(include=['int']).sum(1)df['total'] = df.loc[:,'Q1':'Q4'].apply(lambda x: sum(x), axis='columns') df.loc[:, 'Q10'] = 'I'm new here' # also works # Add a column and assign a value, if the condition is not met, it is NaN df.loc[df.num >= 60, 'grade'] = 'qualified' df.loc[df.num < 60, 'grades'] = 'unqualified'
6. Insert column df.insert()
# Insert a new column total at the position of the third column, the value is the total score of each row df.insert(2, 'total', df.sum(1))
7. Specify column df.assign()
# increase the total column df.assign(total=df.sum(1)) # Add two columns df.assign(total=df.sum(1), Q=100) df.assign(total=df.sum(1)).assign(Q=100) Other usage examples: df.assign(Q5=[100]*100) # add a new column Q5 df = df.assign(Q5=[100]*100) # assignment takes effect df.assign(Q6=df.Q2/df.Q1) # Calculate and add Q6 df.assign(Q7=lambda d: d.Q1 * 9 / 5 + 32) # Use lambda# to add a column, the value is the expression result: True or False df.assign(tag=df.Q1>df.Q2) # Comparison calculation, True is 1, False is 0 df.assign(tag=(df.Q1>df.Q2).astype(int)) # Mapping copywriting df.assign(tag=(df.Q1>60).map({True:'pass', False:'fail'})) # Add multiple df.assign(Q8=lambda d: d.Q1*5, Q9=lambda d: d.Q8 + 1) # Q8 is not effective, so df.Q8 cannot be used directly
8. Execute the expression df.eval()
# Pass in the total score expression df.eval('total = Q1 + Q3 + Q3 + Q4')
Other methods:
df['C1'] = df.eval('Q2 + Q3') df.eval('C2 = Q2 + Q3') # calculation a = df.Q1.mean()df.eval("C3 =`Q3` + @a") # use variables df.eval("C3 = Q2 > (`Q3` + @a)") #Add a Boolean value df.eval('C4 = name + team', inplace=True) # effective immediately
9. Add lines
# Add data with index 100 df.loc[100] = ['tom', 'A', 88, 88, 88, 88]
Other methods:
df.loc[101]={'Q1':88,'Q2':99} # specifies the column, the value of the column without data is NaN df.loc[df.shape[0] + 1] = {'Q1':88,'Q2':99} # Automatically increase index df.loc[len(df) + 1] = {'Q1':88,'Q2':99} # Batch operation, you can use iteration rows = [[1,2],[3,4],[5,6]] for row in rows: df.loc[len(df)] = row
10. Append merge
df = pd.DataFrame([[1, 2], [3, 4]],columns=list('AB')) df2 = pd.DataFrame([[5, 6], [7, 8]],columns=list('AB')) df.append(df2)
11. Delete
# delete the data with index 3 s. pop(3) #93s the s
12. Delete empty values
df.dropna() # If there is a missing value in a row, it will be deleted df.dropna(axis='columns') # Keep only columns with all values df.dropna(how='all') # row or column has no value before deleting df.dropna(thresh=2) # Only delete when there are at least two null values df.dropna(inplace=True) # delete and make the replacement take effect
05, advanced filtering
Introduce several very useful data filtering output methods for complex data processing.
1, df.where()
# value greater than 70 df.where(df > 70)
2, np.where()
# Less than 60 points are failed np.where(df>=60, 'qualified', 'unqualified')
3, df.mask()
# Eligible for NaN df.mask(s > 80)
4, df. lookup()
# The same number of rows and columns, return an array df. lookup([1,3,4], ['Q1','Q2','Q3']) # array([36, 96, 61]) df. lookup([1], ['Q1']) # array([36])
06. Data iteration
1. Iterate Series
# Iterate over the specified columns for i in df.name: print(i) # Iterate over the index and the specified two columns for i,n,q in zip(df.index, df.name,df.Q1): print(i, n, q)
2, df.iterrows()
# iteration, use name, Q1 data for index, row in df.iterrows(): print(index, row['name'], row.Q1)
3, df.itertuples()
for row in df.itertuples(): print(row)
4, df.items()
# Series takes the first three for label, ser in df.items(): print(label) print(ser[:3], end='\\ \\ ')
5. Iterate by column
# directly iterate over the DataFrame for column in df: print(column)
07. Function application
1, pipe()
Applies to the entire DataFrame or Series.
# Multiple application of multiple functions to df f(g(h(df), arg1=a), arg2=b, arg3=c) # use pipe to connect them (df.pipe(h) .pipe(g, arg1=a) .pipe(f, arg2=b, arg3=c) )
2. apply()
Applies to rows or columns of the DataFrame, defaults to columns.
# Change all names to lowercase df.name.apply(lambda x: x.lower())
3. applymap()
Applied to each element of the DataFrame.
# Calculate the length of the data def mylen(x): return len(str(x)) df.applymap(lambda x:mylen(x)) # apply function df.applymap(mylen) # same effect as above
4, map()
Applied to each element of a column of a Series or DataFrame.
df.team.map({'A':'Class One', 'B':'Class Two','C':'Class Three', \ 'D':'four classes',})# enumeration replacement df['name'].map(f)
5, agg()
# the maximum value of each column df.agg('max') # Aggregate all columns to generate two rows of sum and min df.agg(['sum', 'min']) # sequence multiple aggregates df.agg({'Q1' : ['sum', 'min'], 'Q2' : ['min','max']}) # Aggregate after grouping df.groupby('team').agg('max') df.Q1.agg(['sum', 'mean'])
6. transform()
df.transform(lambda x: x*2) # apply anonymous function df.transform([np.sqrt, np.exp]) # call multiple functions
7, copy()
s = pd.Series([1, 2], index=["a","b"]) s_1 = s s_copy = s.copy() s_1 is s # True s_copy is s # False