This article is about 1,000 words and it is recommended to read 6 minutes
This article summarizes various Python calculation methods for correlation coefficient matrices.
Correlation matrix is a basic tool for data analysis. They allow us to understand how different variables relate to each other. In Python, there are many methods to calculate the correlation coefficient matrix. Today we will summarize these methods.
Pandas
Pandas’ DataFrame object can directly create a correlation matrix using the corr method. Since most people in the data science field use Pandas to get their data, this is often one of the fastest and easiest ways to check for correlations in your data.
import pandas as pd import seaborn as sns data = sns.load_dataset('mpg') correlation_matrix = data.corr(numeric_only=True) correlation_matrix
If you work in statistics and analysis, you may ask “Where is the p-value?”, we will introduce it at the end.
Numpy
Numpy also includes a calculation function for the correlation coefficient matrix, which we can call directly, but because it returns an ndarray, it does not look as clear as pandas.
import numpy as np from sklearn.datasets import load_iris iris = load_iris() np.corrcoef(iris["data"])
For better visualization, we can pass it directly to the sns.heatmap() function.
import seaborn as sns data = sns.load_dataset('mpg') correlation_matrix = data.corr() sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
The annot=True parameter can output some additional useful information. A common hack is to use sns.set_context(‘talk’) to get additional readable output.
This setting is used to generate images for slide presentations that help us read better (larger font size).
Statsmodels
The statistical analysis library Statsmodels is also definitely possible:
import statsmodels.api as sm correlation_matrix = sm.graphics.plot_corr( data.corr(), xnames=data.columns.tolist())
plotly
By default plotly plots how this results in a diagonal of 1.0 running from bottom left to top right. This behavior is opposite to that of most other tools, so you need to pay special attention if you use plotly.
import plotly.offline as pyo pyo.init_notebook_mode(connected=True) import plotly.figure_factory as ff correlation_matrix = data.corr() fig = ff.create_annotated_heatmap( z=correlation_matrix.values, x=list(correlation_matrix.columns), y=list(correlation_matrix.index), colorscale='Blues') fig.show()
Pandas + Matplotlib for better visualization
This result can also be used directly using sns.pairplot(data). The graphs produced by the two methods are similar, but seaborn only needs one sentence:
sns.pairplot(df[['mpg','weight','horsepower','acceleration']])
So here we introduce how to use Matplotlib to achieve:
import matplotlib.pyplot as plt pd.plotting.scatter_matrix( data, alpha=0.2, figsize=(6, 6), diagonal='hist') plt.show()
p-value of correlation
If you are looking for a simple matrix (with p-values), which is what many other tools (SPSS, Stata, R, SAS, etc.) do by default, how do you get it in Python?
Here we need to use the scipy library for scientific computing. The following are the implemented functions:
from scipy.stats import pearsonr import pandas as pd import seaborn as sns def corr_full(df, numeric_only=True, rows=['corr', 'p-value', 'obs']): """ Generates a correlation matrix with correlation coefficients, p-values, and observation count. Args: - df: Input dataframe - numeric_only (bool): Whether to consider only numeric columns for correlation. Default is True. - rows: Determines the information to show. Default is ['corr', 'p-value', 'obs']. Returns: - formatted_table: The correlation matrix with the specified rows. """ # Calculate Pearson correlation coefficients corr_matrix = df.corr( numeric_only=numeric_only) # Calculate the p-values using scipy's pearsonr pvalue_matrix = df.corr( numeric_only=numeric_only, method=lambda x, y: pearsonr(x, y)[1]) # Calculate the non-null observation count for each column obs_count = df.apply(lambda x: x.notnull().sum()) # Calculate observation count for each pair of columns obs_matrix = pd.DataFrame( index=corr_matrix.columns, columns=corr_matrix.columns) for col1 in obs_count.index: for col2 in obs_count.index: obs_matrix.loc[col1, col2] = min(obs_count[col1], obs_count[col2]) # Create a multi-index dataframe to store the formatted correlations formatted_table = pd.DataFrame( index=pd.MultiIndex.from_product([corr_matrix.columns, rows]), columns=corr_matrix.columns ) # Assign values to the appropriate cells in the formatted table for col1 in corr_matrix.columns: for col2 in corr_matrix.columns: if 'corr' in rows: formatted_table.loc[ (col1, 'corr'), col2] = corr_matrix.loc[col1, col2] if 'p-value' in rows: # Avoid p-values for diagonal they correlate perfectly if col1 != col2: formatted_table.loc[ (col1, 'p-value'), col2] = f"({<!-- -->pvalue_matrix.loc[col1, col2]:.4f})" if 'obs' in rows: formatted_table.loc[ (col1, 'obs'), col2] = obs_matrix.loc[col1, col2] return(formatted_table.fillna('') .style.set_properties(**{<!-- -->'text-align': 'center'}))
Calling this function directly, the result we return is as follows:
df = sns.load_dataset('mpg') result = corr_full(df, rows=['corr', 'p-value']) result
Summary
We have introduced various methods for creating correlation coefficient matrices in Python. These methods can be chosen at will (whichever is more convenient for you to use). The standard default output of most tools in Python will not include p-values or observation counts, so if you need statistics on this, you can use the functions provided by our sub-hou, because for a comprehensive and complete correlation analysis, there are p-values And the observation count is very helpful as a reference.
Finally:
Python learning materials
If you want to learn Python to help you automate your office, or are preparing to learn Python or are currently learning it, you should be able to use the following and get it if you need it.
① Python learning roadmap for all directions, knowing what to learn in each direction ② More than 100 Python course videos, covering essential basics, crawlers and data analysis ③ More than 100 Python practical cases, learning is no longer just theory ④ Huawei’s exclusive Python comic tutorial, you can also learn it on your mobile phone ⑤Real Python interview questions from Internet companies over the years, very convenient for review
There are ways to get it at the end of the article
1. Learning routes in all directions of Python
The Python all-direction route is to organize the commonly used technical points of Python to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the above knowledge points to ensure that you learn more comprehensively.
2. Python course video
When we watch videos and learn, we can’t just move our eyes and brain but not our hands. The more scientific learning method is to use them after understanding. At this time, hands-on projects are very suitable.
3. Python practical cases
Optical theory is useless. You must learn to follow along and practice it in order to apply what you have learned to practice. At this time, you can learn from some practical cases.
4. Python Comics Tutorial
Use easy-to-understand comics to teach you to learn Python, making it easier for you to remember and not boring.
5. Internet company interview questions
We must learn Python to find a high-paying job. The following interview questions are the latest interview materials from first-tier Internet companies such as Alibaba, Tencent, Byte, etc., and Alibaba bosses have given authoritative answers. After finishing this set I believe everyone can find a satisfactory job based on the interview information.
This complete version of the complete set of Python learning materials has been uploaded to CSDN. If friends need it, you can also scan the official QR code of csdn below or click on the WeChat card at the bottom of the homepage and article to get the method. [Guaranteed 100% free] strong>