Data analysis—-Matplotlib

1. Introduction

Matplotlib is a Python 2D plotting library that produces publication-quality graphics in a variety of hardcopy formats and in a cross-platform interactive environment. (Excerpted from Baidu Encyclopedia)

2. Import

Need to import before use

import matplotlib.pyplot as plt

3. Display charts

By default, the image is not displayed directly, and the plt.show() function needs to be called to display the image.
By default, an image is opened in a new window and buttons for operating on the image are provided.

4. Draw linear graph

You need to specify x and y, you can also enter only y, x defaults to 0~n-1

plt.plot(x,y)
plt.show()

5. Basic settings

1. Set x-axis/y-axis/title name

plt.xlabel('x',fontsize=18)
plt.ylabel('y',fontsize=18)
plt.title('title',fontsize=20)

fontsize: Set font size
Fonts can also be rotated:

plt.xticks(rotation=90) #The text on the x-axis is rotated 90 degrees

2. Set the display range of x-axis/y-axis

plt.axis([xmin,xmax,ymin,ymax])

3. Normal display of Chinese characters and negative signs

By default, Chinese characters and negative signs will not be displayed properly. You need to enter the following statement:

plt.rcParams['font.sans-serif']=['SimHei'] #Set the font so that Chinese characters can be displayed normally
plt.rcParams['axes.unicode_minus']=False #Making negative signs display normally

4. Character parameters

(1) indicates color

Character parameters Color
‘b’ Blue
‘g’ Green
‘r’ Red
‘c’ Cyan
‘m’ Magenta
‘y’ Yellow
‘k’ Black
‘w’ White

(2) represents the type:

td>

Character parameters Type
‘-‘ Solid line
‘–’ Dotted line
‘-.’ Dotted line
‘: ‘ Dotted line
‘. ‘ Points
‘, ‘ Pixels
‘ o’ Circular point
‘v’ Lower triangular point
‘^’ Upper triangular point
‘<' Left triangular point
‘>’ Right triangle point
‘s’ Square point
‘p’ Penta Point
‘*’ Star Point
‘h’ Hexagon point 1
‘H’ Hexagon point 2
‘ + ‘ Plus point
‘x’ Multiplication point
‘D’ Solid diamond point
‘d’ Thin diamond point
‘_’ Horizontal point
‘1’ Lower triple point
‘2’ Upper triple point
‘3’ Left trident point
‘4’ Right trident point

Example: draw red dots

plt.plot([1,2,3,4],[1,4,9,16],'ro')

5. Line attributes

(1) Set with character parameters

(2) Use keywords to set

linewidth can change the line width, color can change the line color
example:

plt.plot(x,y,linewidth=4.0,color='r') #Set the line width to 4 and the color to red

(3) Use the return value of plt.plot()

The plot function returns a list of live2D objects, each object represents a pair of input combinations
example:

line1=plt.plot(x,y)
line1.set_antialiasoed(False) #Turn off anti-aliasing
plt.show()

(4) Use plt.setp()

example:

line=plt.plot(x,y)
plt.setp(line,color='g',linewidth=4.0)

6. Sub-picture

(1)figure() function

The figure function will generate a figure with the specified number num

plt.figure(num,figsize=(10,6)) #The number is num, and the size of the figure is set

Note: figure(1) can be omitted

(2)subplot() function

plt.subplot(numrows,numcols,fignum)

Among them, numrows represents the number of rows, numcols represents the number of columns, and fignum is the number of the picture.
The total number of graphs sum=numrows*numcols
When sum<10, the comma in the middle can be omitted

example:

import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
plt.subplot(211) #Equivalent to plt.subplot(2,1,1)
plt.plot([0,1,2,3,4],'r-')
plt.subplot(212)
plt.plot([0,1,2,3,4],[0,3,6,9,12],'b--')
plt.show()

As shown in the figure after running:
Run results

7. Close warning

import warnings
warnings.filterwarnings('ignore')

8. Coordinate system setting

a=plt.gac() #Get the coordinate system
a.patch.set_facecolor('gray') #Set the background color
a.patch.set_alpha(0.3) #Set the background transparency, the value range is 0~1

Add grid lines to the background:

plt.grid()

9. Add data to the graph

plt.text(x,y,n)

Where (x, y) is the coordinate of the added data, n is the added data

10. Add comments

plt.annotate(text,xy=(x,y),xytest=(x + 10,y + 10),arrowprops=dict(facecolor='black',edgecolor='red'))

Note:
text: added comment content
xy: coordinates of the marked point
xytest: coordinates of annotation content
arrowprops=dict(facecolor=black’,edgecolor=red’): Set the fill color and edge color of the arrow

11. Add legend

plt.legend()

6. Bar chart (bar)

1. Basic format:

plt.bar(x,y)

2. Add data to the graph

plt.text(x,y,n,ha='center',va='bottom')

Where (x, y) is the coordinate of the added data, n is the added data
ha=center’: centered
va=bottom’: the number is above the top of the column
va=top’: the number is below the top of the column

7. Pie chart (pie)

pie (x,explode=None,labels=None,colors=None,autopct=None,petdistance=0.6,shadow=False,labeldistance=1.1,startangle=None,radius=None)

Note:
x: the proportion of (each block), if sum (x)>1, sum (x) will be used to normalize it
labels: (each piece) the explanatory text displayed outside the pie chart
explode: (each block) distance from the center
startangle: starting drawing angle. The default drawing is from the positive direction of the x-axis counterclockwise. If set = 90, it will be drawn from the positive direction of the y-axis.
shadow: whether to shadow
labeldistancelabel: drawing position, proportion relative to the radius, if <1, it is drawn inside the pie chart
autopct controls the percentage setting in the pie chart, you can use the format string or format function
%1.1f refers to the number of digits before and after the decimal point (not padded with spaces)
pctdistance: similar to labeldistance, specifies the position scale of autopct
radius: Control the radius of the pie chart
return value:
If autopct is not set, return ( patches , texts )
If autopct is set, returns ( patches , texts , autotexts )
example:

import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_excel(r'C:\Users\Gong Xihui\Desktop\data used in the course\film.xlsx')
plt.figure(figsize=(10,10))
data=pd.cut(df['rating'],[0,3,5,7,9,10]).value_counts()
y=data.values
y=y/sum(y)
plt.title=('Movie rating ratio')
plt.pie(y,labels=data.index,autopct='%.1f.%%',colors='bygr')
plt.show()

The output result is

8. Frequency distribution histogram (hist)

plt.hist(arr)

There are many parameters to hist, only the first one is required, the rest are optional

arr: one-dimensional array whose histogram needs to be calculated
bins: Number of columns of the histogram, optional, default is 10
normed: Whether to normalize the resulting histogram vector. Default is 0
facecolor: histogram color
edgecolor: histogram border color
alpha: transparency
histtype: histogram type, bar , barstacked , step , stepfilled
return value:
n: Histogram vector, whether normalized or not is set by parameter normed
bins: Returns the range of each bin
patches: Returns the data contained in each bin, which is a list
example:

import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_excel(r'C:\Users\Gong Xihui\Desktop\data used in the course\film.xlsx')
plt.figure(figsize=(10,10))
plt.hist(df['rating'],bins=20,edgecolor='k')
plt.show()

The output is:

9. Dual axis chart

Using twinx():

import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
a=plt.plot([0,1,2,3,4,5],'b-')
b=plt.twinx()
b.plot([0,1,4,9,16,25],'r--')
plt.show()

The output is:

10. Scatter

plt.scatter(x,y,marker='.')

marker can set the shape of scatter points
The correspondence between scatter point shapes and character parameters is shown above.

11. Box plot (boxplot)

1. Introduction

Box-plot, also known as box-and-whisker plot, box plot or box plot, is a statistical chart used to display the dispersion of a set of data. It is named after its shape like a box. It is also often used in various fields and is commonly used in quality management. It is mainly used to reflect the characteristics of the original data distribution, and can also compare the distribution characteristics of multiple groups of data. The method of drawing a boxplot is: first find the median, two quartiles, and upper and lower edge lines of a set of data; then, connect the two quartiles to draw the box; then connect the upper and lower edge lines with the box. Connected, the median is in the middle of the box.

2. Drawing steps

(1) Calculate the upper quartile (Q3), median, and lower quartile (Q1)
(2) Calculate the difference between the upper quartile and the lower quartile, that is, the interquartile difference (IQR, interquartile range) Q3-Q1
(3) Draw the upper and lower ranges of the box plot, with the upper limit being the upper quartile and the lower limit being the lower quartile. Draw a horizontal line at the median position inside the box
(4) Values greater than 1.5 times the interquartile difference of the upper quartile, or values less than 1.5 times the interquartile difference of the lower quartile, are classified as outliers (outliers)
(5) Except for outliers, draw horizontal lines at the two values closest to the upper edge and lower edge as the tentacles of the box plot.
(6) Extreme outliers, that is, outliers that are beyond 3 times the interquartile difference, are represented by solid points; more moderate outliers, that is, outliers that are between 1.5 times and 3 times the interquartile difference, Represented by hollow dots
(7) Add a name, number axis, etc. to the box plot

3. Form

pit . boxplot ( x , notch = None , sm = None , vert = None . whis = None , positions = None , widths = None , patch_artist = None , meanline = None , showmeans = None . showcaps = None , showbox = None , showfliers = None , boxprops = None , labels = None , flierprops = None . medianprops = None , meanprops = None , capprops = None , whiskerprops = None )

x: Specify the data to be drawn as a box plot;
notch: Whether to display the box plot in the form of a notch, the default is not a notch;
sym: Specify the shape of the abnormal point, the default is + sign display;
vert: Whether the box plot needs to be placed vertically. The default is vertical placement:
whis: Specify the distance between the upper and lower whiskers and the upper and lower quartiles, the default is 1.5 times the interquartile difference;
positions: Specify the position of the box plot, the default is [0.1.2…];
widths: Specify the width of the box plot, the default is 0.5;
patch_artist: Whether to fill the color of the box;
meanline: whether to express the mean in the form of a line, by default it is expressed as a point;
meanprops: Set the properties of the mean, such as point size, color, etc.;
capprops: Set the properties of the top and end lines of the box plot, such as color, thickness, etc.; whiskerprops: Set the properties of the whiskers, such as color, thickness, line type, etc.

Support multiple sets of data import

12. Correlation coefficient matrix graph

pandas itself also encapsulates the drawing function

1. scatter_martrix()

You can draw a scatter plot between each attribute, and the diagonal line is the distribution plot.

%pylab inline #Display the image
result=pd.scatter_matrix(data,diagonal='hist',color='k',alpha=0.3,figsize=(10,10))

diagonal=hist’: The diagonal distribution chart is a histogram.
diagonal=kde’: the diagonal distribution chart is a curve chart

2. seaborn

seaborn is a streamlined python library that can create statistically significant charts and understands Pandas’ DataFrame type.

seaborn . heatmap ( data , vmin = None , vmax = None . cmap = None , center = None , robust = False , annot = None , fmt ='2g', annot _ kws = None . linewidths =0 . linecolor =' white ', cbar = True , cbar _ kws = None , cbar _ ax = None , square = False , xticklabels =' auto ', yticklabels =' auto ', mask = None , ax = None ," kwargs )

(1) Heat map input data parameters:
data: Matrix data set, which can be a numpy array (array) or a pandas DataFrame. If it is a DataFrame, the index / column information of df will correspond to the columns and rows of heatmap respectively, that is, pt. index is the row label of the heat map, and pt. columns is the column label of the heat map.

(2) Heat map matrix block color parameters:
vmax, vmin: respectively the maximum and minimum color value range of the heat map. The default is determined based on the value in the data data table.

cmap: Mapping from numbers to color space, the value is the colormap name or color object in the matplotlib package, or a list representing colors; change the parameter default value: set according to the center parameter

center: When there are differences in the data table values, set the color center alignment value of the heat map; by setting the center value, you can adjust the overall depth of the generated image color: when setting the center data, if there is data overflow, manually set vmax, vmin will automatically change

robust: Default value False: If it is Faise, and the values of vmin and vmax are not set.

(3) Heat map matrix block annotation parameters:
annot (abbreviation of annotate): Default value is False; if it is True, data is written in each square of the heat map; if it is a matrix, data corresponding to the matrix is written in each square of the heat map.

fmt: string format code, data format identifying numbers on the matrix, such as retaining several digits after the decimal point

annot _ kws: Default value False: If True, set the size, color, and font of the numbers on the heat map matrix. Font settings under the ltext class of the matplotlib package:
(4) Spacing and spacing line parameters between heat map matrix blocks:
linewidths: Define the size of the gap between the matrix patches that represent pairwise feature relationships in the heat map

linecolor: The color of the line that divides each matrix patch on the heat map. The default value is ‘white’
(5) Heat map color scale bar parameters:
cbar: Whether to draw a color scale bar on the side of the heat map, the default value is True

cbar_kws: related font settings when drawing color scale bars on the side of the heat map, the default value is None

cbar_ax: Set the scale bar position when drawing the color bar on the side of the heat map. The default value is None.

xticklabels, yticklabels: xticklabels controls the output of label names in each column; yticklabels controls the output of label names in each row. The default value is auto. If True, the column name of the DataFrame is used as the label name. If False, row label names are not added. If it is a list, the label name is changed to the content given in the list. If it is an integer K, label every K labels on the graph.