Pandas index operation
Index object Index
1. The indexes in both Series and DataFrame are Index objects
Sample code:
print(type(ser_obj.index)) print(type(df_obj2. index)) print(df_obj2. index)
operation result:
<class 'pandas.indexes.range.RangeIndex'> <class 'pandas. indexes. numeric. Int64Index'> Int64Index([0, 1, 2, 3], dtype='int64')
2. Index objects are immutable, ensuring data security
Sample code:
# index object is immutable df_obj2.index[0] = 2
operation result:
---------------------------------------------- ---------------------------- TypeError Traceback (most recent call last) <ipython-input-23-7f40a356d7d1> in <module>() 1 # Index objects are immutable ----> 2 df_obj2.index[0] = 2 /Users/Power/anaconda/lib/python3.6/site-packages/pandas/indexes/base.py in __setitem__(self, key, value) 1402 1403 def __setitem__(self, key, value): -> 1404 raise TypeError("Index does not support mutable operations") 1405 1406 def __getitem__(self, key): TypeError: Index does not support mutable operations
Common Index Types
- Index, index
- Int64Index, integer index
- MultiIndex, hierarchical index
- DatetimeIndex, timestamp type
Series index
1. index specifies the row index name
Sample code:
ser_obj = pd.Series(range(5), index = ['a', 'b', 'c', 'd', 'e']) print(ser_obj. head())
operation result:
a 0 b1 c 2 d 3 e 4 dtype: int64
2. Row index
ser_obj[‘label’], ser_obj[pos]
Sample code:
# row index print(ser_obj['b']) print(ser_obj[2])
operation result:
1 2
3. Slice index
ser_obj[2:4], ser_obj[‘label1’: ‘label3’]
Note that when slicing by index name, the termination index is included.
Sample code:
# slice index print(ser_obj[1:3]) print(ser_obj['b':'d'])
operation result:
b 1 c 2 dtype: int64 b1 c 2 d 3 dtype: int64
4. Discontinuous index
ser_obj[[‘label1’, ‘label2’, ‘label3’]]
Sample code:
# discontinuous index print(ser_obj[[0, 2, 4]]) print(ser_obj[['a', 'e']])
operation result:
a 0 c 2 e 4 dtype: int64 a 0 e 4 dtype: int64
5. Boolean index
Sample code:
# boolean index ser_bool = ser_obj > 2 print(ser_bool) print(ser_obj[ser_bool]) print(ser_obj[ser_obj > 2])
operation result:
a False b False c False d True e True dtype: bool d 3 e 4 dtype: int64 d 3 e 4 dtype: int64
DataFrame index
1. columns specify column index name
Sample code:
import numpy as np df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd']) print(df_obj. head())
operation result:
a b c d 0 -0.241678 0.621589 0.843546 -0.383105 1 -0.526918 -0.485325 1.124420 -0.653144 2 -1.074163 0.939324 -0.309822 -0.209149 3 -0.716816 1.844654 -2.123637 -1.323484 4 0.368212 -0.910324 0.064703 0.486016
2. Column index
df_obj[[‘label’]]
Sample code:
# column index print(df_obj['a']) # return Series type
operation result:
0 -0.241678 1 -0.526918 2-1.074163 3 -0.716816 4 0.368212 Name: a, dtype: float64
3. Discontinuous index
df_obj[[‘label1’, ‘label2’]]
Sample code:
# discontinuous index print(df_obj[['a','c']])
operation result:
a c 0 -0.241678 0.843546 1 -0.526918 1.124420 2 -1.074163 -0.309822 3 -0.716816 -2.123637 4 0.368212 0.064703
Advanced Indexing: Tags, Positions and Blends
There are 3 types of advanced indexes in Pandas
1. loc tag index
DataFrame cannot be sliced directly, it can be sliced by loc
loc is an index based on the label name, which is our custom index name
Sample code:
# tag index loc # Series print(ser_obj['b':'d']) print(ser_obj.loc['b':'d']) #DataFrame print(df_obj['a']) # The first parameter indexes the row, the second parameter is the column print(df_obj.loc[0:2, 'a'])
operation result:
b 1 c 2 d 3 dtype: int64 b1 c 2 d 3 dtype: int64 0-0.241678 1 -0.526918 2-1.074163 3 -0.716816 4 0.368212 Name: a, dtype: float64 0-0.241678 1 -0.526918 2-1.074163 Name: a, dtype: float64
2. iloc position index
The function is the same as loc, but it is indexed based on the index number
Sample code:
# Integer location index iloc # Series print(ser_obj[1:3]) print(ser_obj. iloc[1:3]) #DataFrame print(df_obj.iloc[0:2, 0]) # Note the difference with df_obj.loc[0:2, 'a']
operation result:
b 1 c 2 dtype: int64 b1 c 2 dtype: int64 0-0.241678 1 -0.526918 Name: a, dtype: float64
3. ix tag and position mixed index
ix is a combination of the above two. It can use both index numbers and custom indexes. It depends on the situation.
If the index contains both numbers and English, then this method is not recommended, and it will easily lead to confusion in positioning.
This method is deprecated in the new version
Sample code:
# mixed index ix # Series print(ser_obj.ix[1:3]) print(ser_obj.ix['b':'c']) #DataFrame print(df_obj.loc[0:2, 'a']) print(df_obj.ix[0:2, 0])
operation result:
b 1 c 2 dtype: int64 b1 c 2 dtype: int64 0-0.241678 1 -0.526918 2-1.074163 Name: a, dtype: float64
Note
DataFrame index operation, which can be regarded as index operation of ndarray
The slice index of the label is inclusive of the end position
Pandas alignment operation
It is an important process of data cleaning. It can be operated according to index alignment. If there is no alignment, NaN will be filled, and NaN can also be filled at the end.
Series alignment operation
1. Series is aligned by row and index
Sample code:
s1 = pd.Series(range(10, 20), index = range(10)) s2 = pd.Series(range(20, 25), index = range(5)) print('s1: ' ) print(s1) print('') print('s2: ') print(s2)
operation result:
s1: 0 10 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 dtype: int64 s2: 0 20 1 21 2 22 3 23 4 24 dtype: int64
2. Alignment operation of Series
Sample code:
# Series alignment operation s1 + s2
operation result:
0 30.0 1 32.0 2 34.0 3 36.0 4 38.0 5 NaN 6 NaN 7 NaN 8 NaN 9 NaN dtype: float64
Alignment operation of DataFrame
1. DataFrame is aligned by row and column index
Sample code:
df1 = pd.DataFrame(np.ones((2,2)), columns = ['a', 'b']) df2 = pd.DataFrame(np.ones((3,3)), columns = ['a', 'b', 'c']) print('df1: ') print(df1) print('') print('df2: ') print(df2)
operation result:
df1: a b 0 1.0 1.0 1 1.0 1.0 df2: a b c 0 1.0 1.0 1.0 1 1.0 1.0 1.0 2 1.0 1.0 1.0
2. Alignment operation of DataFrame
Sample code:
# DataFrame alignment operation df1 + df2
operation result:
a b c 0 2.0 2.0 NaN 1 2.0 2.0 NaN 2 NaN NaN NaN
Fill unaligned data for operation
1. fill_value
While using
add
,sub
,div
,mul
,Specify the filling value by
fill_value
, and the unaligned data will be calculated with the filling value
Sample code:
print(s1) print(s2) s1. add(s2, fill_value = -1) print(df1) print(df2) df1.sub(df2, fill_value = 2.)
operation result:
# print(s1) 0 10 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 dtype: int64 # print(s2) 0 20 1 21 2 22 3 23 4 24 dtype: int64 # s1. add(s2, fill_value = -1) 0 30.0 1 32.0 2 34.0 3 36.0 4 38.0 5 14.0 6 15.0 7 16.0 8 17.0 9 18.0 dtype: float64 # print(df1) a b 0 1.0 1.0 1 1.0 1.0 # print(df2) a b c 0 1.0 1.0 1.0 1 1.0 1.0 1.0 2 1.0 1.0 1.0 # df1. sub(df2, fill_value = 2.) a b c 0 0.0 0.0 1.0 1 0.0 0.0 1.0 2 1.0 1.0 1.0
Arithmetic method table:
Pandas statistical calculation and description
Sample code:
arr1 = np.random.rand(4,3) pd1 = pd.DataFrame(arr1,columns=list('ABC'),index=list('abcd')) f = lambda x: '%.2f'% x pd2 = pd1.applymap(f).astype(float) pd2
operation result:
A B C a 0.87 0.26 0.67 b 0.69 0.89 0.17 c 0.94 0.33 0.04 d 0.35 0.46 0.29
Common statistical calculations
sum, mean, max, min…
Axis=0 statistics by column, axis=1 statistics by row
skipna excludes missing values, defaults to True
Sample code:
pd2.sum() #By default, the Series of this column is calculated, and all rows are summed pd2.sum(axis='columns') #Specify the sum of all columns in each row pd2.idxmax()#Check the label index of the maximum value of all rows in each column, and we can also find the label index of the maximum value of all columns in each row through axis='columns'
operation result:
A 2.85 B 1.94 C 1.17 dtype: float64 a 1.80 b 1.75 c 1.31 d 1.10 dtype: float64 Ac B b C a dtype: object
Common statistical description
describe generates multiple statistics
Sample code:
pd2.describe()#View summary
operation result:
A B C count 4.000000 4.00000 4.000000 mean 0.712500 0.48500 0.292500 std 0.263613 0.28243 0.271585 min 0.350000 0.26000 0.040000 25% 0.605000 0.31250 0.137500 50% 0.780000 0.39500 0.230000 75% 0.887500 0.56750 0.385000 max 0.940000 0.89000 0.670000 # Percentage: divided by the original amount pd2.pct_change() #View the percentage change of the row, also specify the percentage change of axis='columns'column and column A B C a NaN NaN NaN b -0.206897 2.423077 -0.746269 c 0.362319 -0.629213 -0.764706 d -0.627660 0.393939 6.250000