Data analysis—-Pandas

1. Introduction

pandas is a tool based on NumPy that was created to solve data analysis tasks. Pandas incorporates a number of libraries and some standard data models to provide the tools needed to efficiently manipulate large data sets. Pandas provides a large number of functions and methods that allow us to process data quickly and conveniently. (Excerpted from Baidu Encyclopedia)

2. Import

pandas needs to be imported before using it
The import method is as follows:

import pandas
import pandas as pd #pd is the abbreviation of pandas
from pandas import *

3. Basic data structure

1、Series

(1) Introduction
One-dimensional array, similar to one-dimensional array in Numpy. The two are also very similar to Python’s basic data structure List. Series can now save different data types. Strings, bool values, numbers, etc. can all be saved in Series.
(2) Create a Series
A one-dimensional Series can be initialized with a one-dimensional list:

import pandas as pd
s=pd.Series([1,2,3,4,5,6,7])
print(s)

Note: A null value can be written as np.nan
The output is:

0 1
1 2
twenty three
3 4
4 5
5 6
6 7
dtype: int64

The first column is the index, the second column is the data, and the last row outputs the data type.
(3) Modify the index name
By default, Series indexes are all numbers, and can also be specified using additional parameters.

import pandas as pd
s=pd.Series([1,2,3,4,5,6,7],["a","b","c","d","e","f","g" ])
print(s)

The output is:

a 1
b 2
c 3
d 4
e 5
f 6
g 7
dtype: int64

View index: s.index
Modify the index: s.index=[name1,name2,…,namen]
Set the total index name: s.index.name=” “
(3) value
View all values: s.values
Take a certain value: s[index]
(4) Slice
s[starting point: end point (not included): step size]

2. DataFrame

(1) Introduction
Two-dimensional tabular data structure. Many functions are similar to data.frame in R. DataFrame can be understood as a container for Series. (Excerpted from Baidu Encyclopedia)
(2) Create a DataFrame structure
Can be initialized with a two-dimensional list:
df=pd.DataFrame(two-dimensional array, index=[row index], columns=[column index])

import pandas as pd
df=pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]],index=['a','b'],columns=[1,2 ,3,4,5])
print(df)

The output is:

 1 2 3 4 5
a 1 2 3 4 5
b 6 7 8 9 10

The first column is the row index (index), and the first column is the column index (columns).

You can also use a dictionary to pass in data:
df=pd.DataFrame(dictionary, index=[row index])
Each key of the dictionary represents a column, and its value can be a variety of objects that can be converted into a Series.

import pandas as pd
df=pd.DataFrame({<!-- -->'a':'test','b':[4,5,6]},index=[1,2,3],)
print(df)

The output is:

 a b
1 test 4
2 test 5
3 test 6

Unlike Series, which requires all types to be consistent, DataFrame only requires the format of each column of data to be the same.
View the data type of each column in the DataFrame:
df.dtypes
(3) View data
View the previous n lines (default is 5 lines): df.head(n)
View the next n lines (default is 5 lines): df.tail(n)
(4) Row index, column index, data viewing
Row index: df.index
Column index: df.colunms
Data values: df.values
(5) Read data
Generally it is an excel file, and the excel file suffix is (.xlsx)
If the file is in the same path as the Jupiter notebook:
df=pd.read_excel(file name + suffix’)
If the paths are different, you need to add the path:
df=pd.read_excel(r’path’)
(6) Row operations
Take a certain row:
df.iloc[index] df.loc[index]

Take a few lines:
df.iloc[starting point: end point (not included): step size]
df.loc[starting point: end point (inclusive): step size]

Add a line:
s=pd.Series(dictionary)
s.name=row index
df=df._append(s)
For example:

import pandas as pd
df=pd.DataFrame({<!-- -->'a':'test','b':[4,5,6,7,8]})
s=pd.Series({<!-- -->'a':'test','b':3})
s.name=5
df=df._append(s)
print(df)

Add a line after running:
5 test 3

Delete a line:
df=df.drop(row index)

(7) Column operations
View a column:
df[column index][row start:row end (exclusive)]

View certain columns:
df[[column index 1, column index 2, …, column index n]]

Add a column:
df[column index]=added content

Delete a column:
df=df.drop(column index, axis=1)

(8) Use index to retrieve data
df.loc[index,column]
df.loc[[index1,index2,…,indexn],[column1,column2,…,columnn]]

(9)Conditional selection
df[condition][row start:row end (exclusive)]
Note: with & amp;
or |

(10) Handling of missing values and outliers
dropna: Filter missing values in labels and delete missing values
fillna: fill in missing values
isnull: Returns a Boolean value to determine whether those values are missing values
notnull: the negative of isnull

Determine missing values:
df[column index].isnull()

Fill in missing values:
df[column index].fillna(fill data, inplace=True)

Remove missing values:
df.dropna()
You can fill in the brackets:
how=all’: delete rows or columns with all null values
inplace=True: overwrite previous data
axis=0,axis=1: Select to delete rows or columns, the default is 0

Handling outliers:
df=df[condition]

Change the index after deleting rows:
df.index=range(len(df))

Replace data:
df[index].replace(original data, new data, inplace=True)
df[index].replace([Original data 1,…,Original data n],[New data 1,…,New data n],inplace=True)
Note: Slices use a reference mechanism. If you replace data within a slice, you need to add a .copy() after the statement that intercepts the data to copy a copy of the data to df.

import pandas as pd
df=pd.read_excel(r'C:\Users\Desktop\Data used in the course\film.xlsx')
df1=df[:6].copy()
df1['Premiere Location'].replace('United States','usa',inplace=True)
print(df1)

(11) Data Saving
df.to_excel(file name + suffix’)

(12) Data format conversion
View format:
df[index].dtype

Change format:
df[index]=df[index].astype(target format)

(13) Sorting
Sort by value:
df.sort_values(by=index)
The default is ascending order. To change it to descending order:
df.sort_values(by=index, ascending=False)

Sort by multiple values:
df.sort_values(by=[index 1, index 2,…])
First row by index 1, then row by index 2, and so on.

4. Basic statistical analysis

1. Descriptive statistics

df.describe()
Can provide a series of statistical data to help find some outliers

2. Find some commonly used statistical data

Function Syntax
Find the maximum value df[index].max()
Find the minimum value df[index].min()
Find the mean df[index].mean()
Find the median df[index ].median()
Find the variance df[index].var()
Find the standard deviation df[index].std()
Sum df[index].sum()
Find the correlation coefficient df[[index 1, index 2]].corr()
Find covariance df[[index 1, index 2]].cov()
Count len(df )
Statistical unique values len(df[index].unique())
Count the number of occurrences of unique values df[].value_counts()

5. Data Pivot

1. Basic form

pd.pivot_table(df,index=[index] )

(1) Average calculation is performed by default, but you can also specify a function to count different values:
pd.pivot_table(df,index=[],aggfunc=function)
Perform different functions on different values:
pd.pivot_table(df,index=[ ],values=[ ],aggfunc={index 1: function,…, index n: function})

(2) You can also specify the data to be summarized:
pd.pivot_table(df,index=[index],values=[index])

2. View all results

When too much data is processed, it will be omitted. If you want to view all the data:
Set up to display n columns: pd.set_option(max_.columns’,n)
Set up to display n rows: pd.set_option(max_.rows’,n)

3. Handling non-numeric values (NaN)

This can be set to 0 using fill_value:
pd.pivot_table(df,index=[ ],fill_value=0)

4. Display total data

Add margins=True to display some sum data below
pd.pivot_table(df,index=[],margins=True)

6. Data reshaping and axial rotation

1. Hierarchical index of Series

s=pd.Series([ ],index=[[Large Index],[Small Index]])
For example:

import pandas as pd
s=pd.Series([1,2,3,4,5,6],index=[['a','a','a','b','b','b'],[ 1,2,3,1,2,3]])
print(s)

The output is:

a 1 1
   twenty two
   3 3
b 1 4
   2 5
   3 6
dtype: int64

Among them, ab is a large index and 123 is a small index.

Value:
s[large index]: Get the value under the large index
s[:,1]: Take the first row of each large index
s[large index][small index]

2. Hierarchical index of DataFrame

df=pd.DataFrame([ ],index=[[Large Index],[Small Index]],columns=[[Large Index],[Small Index]])

import pandas as pd
df=pd.DataFrame([[1,2,3],[6,7,8],[4,5,0]],index=[['一','一','二'],[ 1,2,1]],columns=[['a','b','b'],[1,1,2]])
print(df)

The output is:

 a b
      1 1 2
1 1 2 3
   2 6 7 8
2 1 4 5 0

Among them, one and two are the large index of the row, 12 is the small index of the row, ab is the large index of the column, and 12 is the small index of the column.

You can also name big and small indexes for rows and columns:
Name the row index: df.index.names=[name1,name2]
Name the column index: df.columns.names=[name1,name2]

import pandas as pd
df=pd.DataFrame([[1,2,3],[6,7,8],[4,5,0]],index=[['一','一','二'],[ 1,2,1]],columns=[['a','b','b'],[1,1,2]])
df.index.names=['row1','row2']
df.columns.names=['col1','col2']
print(df)

The output is:

col1 a b
col2 1 1 2
row1 row2
1 1 2 3
      2 6 7 8
2 1 4 5 0

After running, row1 is the general name of the large index of the row, row2 is the general name of the small index of the row, col1 is the general name of the large index of the column, and col2 is the general name of the small index of the column.
Use these generic terms to replace the index:
df=df.swaplevel(index general name 1, index general name 2)

Convert column to index: df=df.set_index([column name])
Change index into column: df=df.reset_index([index name])

Cancel hierarchical index: df=df.reset_index()

Each index is a tuple, which can be viewed with df.index()

Data rotation: df.T can directly exchange rows and columns of data

3. Exchange of Series and DataFrame

(1) Convert Series to DataFrame: s.unstack()

import pandas as pd
s=pd.Series([1,2,3,4,5,6],index=[['a','a','a','b','b','b'],[ 1,2,3,1,2,3]])
s=s.unstack()
print(s)

The output is:

 1 2 3
a 1 2 3
b 4 5 6

A large index becomes a row index, and a small index becomes a column index.
(2) Convert DataFrame to Series: df.stack()

import pandas as pd
df=pd.DataFrame([[1,2,3],[6,7,8],[4,5,0]],index=[['一','一','二'],[ 1,2,1]],columns=[['a','b','b'],[1,1,2]])
df=df.stack()
print(df)

The output result is

 a b
1 1 1.0 2
    2 NaN 3
  2 1 6.0 7
    2 NaN 8
2 1 1 4.0 5
    2 NaN 0

The row index becomes a larger index, and the column index becomes a smaller index.

Note: Conversion may cause null values to appear

7. Data grouping and grouping operations

GroupBy technology: implements data grouping and grouping operations, similar to a pivot table

1. Grouping

group=df.groupby (group by)
Multiple grouping variables can also be passed in:
group=df.groupby([Group by 1, Group by 2])

2. Calculate various statistics after grouping

group.mean() etc.
Note: Only numerical variables will be grouped. If the data does not require operations, it can be set to a string

8. Discretization processing

Discretization can also be called grouping and intervalization

pd.cut(x,bins,right=True,labels=None,retbns=False,precision=3,include_lowest=False)

x: Array, Series, DataFrame objects that need to be discretized
bins: grouping basis
right=True: Whether to include the right endpoint, the default is True, that is, it includes
labels=None: define interval name
retbns=False: Whether to return a list of each corresponding bins
precision=3: precision setting
include_lowest=False: Whether to include the left endpoint, the default is False, that is, it does not include
example:

df1=pd.cut(df['rating'],[0,3,5,7,9,10],labels=['E','D','C','B',' A'])

Explanation: Rating (0,3] is E, (3,5] is D, (5,7] is C, (7,9] is B, (9,10] is A

9. Merge data sets

1、append

append can splice two data sets up and down.
df1.append(df2)

2、merge

pd.merge(left,right,how='inner',on=None,left_on=None,right_on=None,left_index=False,right_index=False,sort=True,suffixes=('_x','_y '))

left/right: object
how=inner’: Take the intersection
on: the connection point, must be a column in both left and right
left_on: Use the column selected from left as the connection point
right_on: Use the column selected from right as the connection point
left_index: Use the left row index as its join key
right_index: Use the right row index as its join key
sort=True: Sort the merged data according to the join key, the default is True. Sometimes disabling this option results in better performance when working with large data sets
suffixes=(‘x’,’y’): tuple of string values, used to append to the end of overlapping column names, the default is (‘ x ‘;,’y’). For example, if both left and right DataFrame objects have “data”, “data_x” and “data_y” will appear in the result

3.concat

concat can merge multiple data sets in batches
df=pd.concat([df1,df2,df3],axis=0)
axis defaults to 0, which means adding rows
Change it to 1 to add a column