07-Pandas function and operation

Pandas index operation

Index object Index

1. The indexes in both Series and DataFrame are Index objects

Sample code:

print(type(ser_obj.index))
print(type(df_obj2. index))

print(df_obj2. index)

operation result:

<class 'pandas.indexes.range.RangeIndex'>
<class 'pandas. indexes. numeric. Int64Index'>
Int64Index([0, 1, 2, 3], dtype='int64')

2. Index objects are immutable, ensuring data security

Sample code:

# index object is immutable
df_obj2.index[0] = 2

operation result:

---------------------------------------------- ----------------------------
TypeError Traceback (most recent call last)
<ipython-input-23-7f40a356d7d1> in <module>()
      1 # Index objects are immutable
----> 2 df_obj2.index[0] = 2

/Users/Power/anaconda/lib/python3.6/site-packages/pandas/indexes/base.py in __setitem__(self, key, value)
   1402
   1403 def __setitem__(self, key, value):
-> 1404 raise TypeError("Index does not support mutable operations")
   1405
   1406 def __getitem__(self, key):

TypeError: Index does not support mutable operations

Common Index Types

  • Index, index
  • Int64Index, integer index
  • MultiIndex, hierarchical index
  • DatetimeIndex, timestamp type

Series index

1. index specifies the row index name

Sample code:

ser_obj = pd.Series(range(5), index = ['a', 'b', 'c', 'd', 'e'])
print(ser_obj. head())

operation result:

a 0
b1
c 2
d 3
e 4
dtype: int64

2. Row index

ser_obj[‘label’], ser_obj[pos]

Sample code:

# row index
print(ser_obj['b'])
print(ser_obj[2])

operation result:

1
2

3. Slice index

ser_obj[2:4], ser_obj[‘label1’: ‘label3’]

Note that when slicing by index name, the termination index is included.

Sample code:

# slice index
print(ser_obj[1:3])
print(ser_obj['b':'d'])

operation result:

b 1
c 2
dtype: int64
b1
c 2
d 3
dtype: int64

4. Discontinuous index

ser_obj[[‘label1’, ‘label2’, ‘label3’]]

Sample code:

# discontinuous index
print(ser_obj[[0, 2, 4]])
print(ser_obj[['a', 'e']])

operation result:

a 0
c 2
e 4
dtype: int64
a 0
e 4
dtype: int64

5. Boolean index

Sample code:

# boolean index
ser_bool = ser_obj > 2
print(ser_bool)
print(ser_obj[ser_bool])

print(ser_obj[ser_obj > 2])

operation result:

a False
b False
c False
d True
e True
dtype: bool
d 3
e 4
dtype: int64
d 3
e 4
dtype: int64

DataFrame index

1. columns specify column index name

Sample code:

import numpy as np

df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
print(df_obj. head())

operation result:

 a b c d
0 -0.241678 0.621589 0.843546 -0.383105
1 -0.526918 -0.485325 1.124420 -0.653144
2 -1.074163 0.939324 -0.309822 -0.209149
3 -0.716816 1.844654 -2.123637 -1.323484
4 0.368212 -0.910324 0.064703 0.486016

2. Column index

df_obj[[‘label’]]

Sample code:

# column index
print(df_obj['a']) # return Series type

operation result:

0 -0.241678
1 -0.526918
2-1.074163
3 -0.716816
4 0.368212
Name: a, dtype: float64

3. Discontinuous index

df_obj[[‘label1’, ‘label2’]]

Sample code:

# discontinuous index
print(df_obj[['a','c']])

operation result:

 a c
0 -0.241678 0.843546
1 -0.526918 1.124420
2 -1.074163 -0.309822
3 -0.716816 -2.123637
4 0.368212 0.064703

Advanced Indexing: Tags, Positions and Blends

There are 3 types of advanced indexes in Pandas

1. loc tag index

DataFrame cannot be sliced directly, it can be sliced by loc

loc is an index based on the label name, which is our custom index name

Sample code:

# tag index loc
# Series
print(ser_obj['b':'d'])
print(ser_obj.loc['b':'d'])

#DataFrame
print(df_obj['a'])

# The first parameter indexes the row, the second parameter is the column
print(df_obj.loc[0:2, 'a'])

operation result:

b 1
c 2
d 3
dtype: int64
b1
c 2
d 3
dtype: int64

0-0.241678
1 -0.526918
2-1.074163
3 -0.716816
4 0.368212
Name: a, dtype: float64
0-0.241678
1 -0.526918
2-1.074163
Name: a, dtype: float64

2. iloc position index

The function is the same as loc, but it is indexed based on the index number

Sample code:

# Integer location index iloc
# Series
print(ser_obj[1:3])
print(ser_obj. iloc[1:3])

#DataFrame
print(df_obj.iloc[0:2, 0]) # Note the difference with df_obj.loc[0:2, 'a']

operation result:

b 1
c 2
dtype: int64
b1
c 2
dtype: int64

0-0.241678
1 -0.526918
Name: a, dtype: float64

3. ix tag and position mixed index

ix is a combination of the above two. It can use both index numbers and custom indexes. It depends on the situation.

If the index contains both numbers and English, then this method is not recommended, and it will easily lead to confusion in positioning.
This method is deprecated in the new version

Sample code:

# mixed index ix
# Series
print(ser_obj.ix[1:3])
print(ser_obj.ix['b':'c'])

#DataFrame
print(df_obj.loc[0:2, 'a'])
print(df_obj.ix[0:2, 0])

operation result:

b 1
c 2
dtype: int64
b1
c 2
dtype: int64

0-0.241678
1 -0.526918
2-1.074163
Name: a, dtype: float64

Note

DataFrame index operation, which can be regarded as index operation of ndarray

The slice index of the label is inclusive of the end position

Pandas alignment operation

It is an important process of data cleaning. It can be operated according to index alignment. If there is no alignment, NaN will be filled, and NaN can also be filled at the end.

Series alignment operation

1. Series is aligned by row and index

Sample code:

s1 = pd.Series(range(10, 20), index = range(10))
s2 = pd.Series(range(20, 25), index = range(5))

print('s1: ' )
print(s1)

print('')

print('s2: ')
print(s2)

operation result:

s1:
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int64

s2:
0 20
1 21
2 22
3 23
4 24
dtype: int64

2. Alignment operation of Series

Sample code:

# Series alignment operation
s1 + s2

operation result:

0 30.0
1 32.0
2 34.0
3 36.0
4 38.0
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64

Alignment operation of DataFrame

1. DataFrame is aligned by row and column index

Sample code:

df1 = pd.DataFrame(np.ones((2,2)), columns = ['a', 'b'])
df2 = pd.DataFrame(np.ones((3,3)), columns = ['a', 'b', 'c'])

print('df1: ')
print(df1)

print('')
print('df2: ')
print(df2)

operation result:

df1:
     a b
0 1.0 1.0
1 1.0 1.0

df2:
     a b c
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0

2. Alignment operation of DataFrame

Sample code:

# DataFrame alignment operation
df1 + df2

operation result:

 a b c
0 2.0 2.0 NaN
1 2.0 2.0 NaN
2 NaN NaN NaN

Fill unaligned data for operation

1. fill_value

While using add, sub, div, mul,

Specify the filling value by fill_value, and the unaligned data will be calculated with the filling value

Sample code:

print(s1)
print(s2)
s1. add(s2, fill_value = -1)

print(df1)
print(df2)
df1.sub(df2, fill_value = 2.)

operation result:

# print(s1)
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int64

# print(s2)
0 20
1 21
2 22
3 23
4 24
dtype: int64

# s1. add(s2, fill_value = -1)
0 30.0
1 32.0
2 34.0
3 36.0
4 38.0
5 14.0
6 15.0
7 16.0
8 17.0
9 18.0
dtype: float64


# print(df1)
     a b
0 1.0 1.0
1 1.0 1.0

# print(df2)
     a b c
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0


# df1. sub(df2, fill_value = 2.)
     a b c
0 0.0 0.0 1.0
1 0.0 0.0 1.0
2 1.0 1.0 1.0

Arithmetic method table:

Pandas statistical calculation and description

Sample code:

arr1 = np.random.rand(4,3)
pd1 = pd.DataFrame(arr1,columns=list('ABC'),index=list('abcd'))
f = lambda x: '%.2f'% x
pd2 = pd1.applymap(f).astype(float)
pd2

operation result:

 A B C
a 0.87 0.26 0.67
b 0.69 0.89 0.17
c 0.94 0.33 0.04
d 0.35 0.46 0.29

Common statistical calculations

sum, mean, max, min…

Axis=0 statistics by column, axis=1 statistics by row

skipna excludes missing values, defaults to True

Sample code:

pd2.sum() #By default, the Series of this column is calculated, and all rows are summed
pd2.sum(axis='columns') #Specify the sum of all columns in each row
pd2.idxmax()#Check the label index of the maximum value of all rows in each column, and we can also find the label index of the maximum value of all columns in each row through axis='columns'

operation result:

A 2.85
B 1.94
C 1.17
dtype: float64


a 1.80
b 1.75
c 1.31
d 1.10
dtype: float64


Ac
B b
C a
dtype: object

Common statistical description

describe generates multiple statistics

Sample code:

pd2.describe()#View summary

operation result:

 A B C
count 4.000000 4.00000 4.000000
mean 0.712500 0.48500 0.292500
std 0.263613 0.28243 0.271585
min 0.350000 0.26000 0.040000
25% 0.605000 0.31250 0.137500
50% 0.780000 0.39500 0.230000
75% 0.887500 0.56750 0.385000
max 0.940000 0.89000 0.670000
# Percentage: divided by the original amount
pd2.pct_change() #View the percentage change of the row, also specify the percentage change of axis='columns'column and column
     A B C
a NaN NaN NaN
b -0.206897 2.423077 -0.746269
c 0.362319 -0.629213 -0.764706
d -0.627660 0.393939 6.250000

Commonly used statistical description methods: