6 Ways to Create a Correlation Coefficient Matrix in Python

dccd403f9863c4b8c8e47ff1892a64ec.png

Source: Data STUDIO
This article is about 1,000 words, and it is recommended to read for 6 minutes
This article summarizes various Python calculation methods for correlation coefficient matrices. 

Correlation matrix is a basic tool for data analysis. They allow us to understand how different variables relate to each other. In Python, there are many methods to calculate the correlation coefficient matrix. Today we will summarize these methods.

Pandas

Pandas’ DataFrame object can directly create a correlation matrix using the corr method. Since most people in the data science field use Pandas to get their data, this is often one of the fastest and easiest ways to check for correlations in your data.


import pandas as pd
 import seaborn as sns
 
 data = sns.load_dataset('mpg')
 correlation_matrix = data.corr(numeric_only=True)
 correlation_matrix

c1e314aba4d83659864d19b1e6821f10.png

If you work in statistics and analysis, you may ask “Where is the p-value?” We will introduce it at the end.

Numpy

Numpy also includes a calculation function for the correlation coefficient matrix, which we can call directly, but because it returns an ndarray, it does not look as clear as pandas.


import numpy as np
 from sklearn.datasets import load_iris


 iris = load_iris()
 np.corrcoef(iris["data"])

08014fa476308b20c451cc7663ac9061.png

For better visualization, we can pass it directly to the sns.heatmap() function.


import seaborn as sns


 data = sns.load_dataset('mpg')
 correlation_matrix = data.corr()


 sns.heatmap(data.corr(),
            annot=True,
            cmap='coolwarm')

7f3cae42f6851cb5a1af26be8d174c33.png

The annot=True parameter can output some additional useful information. A common hack is to use sns.set_context(‘talk’) to get additional readable output.

This setting is used to generate images for slide presentations that help us read better (larger font size).

Statsmodels

The statistical analysis library Statsmodels is also definitely possible:


import statsmodels.api as sm


 correlation_matrix = sm.graphics.plot_corr(
    data.corr(),
    xnames=data.columns.tolist())

1c10626040d24ecbb85adcf0cd24fae7.png


plotly

By default plotly plots how this results in a diagonal of 1.0 running from bottom left to top right. This behavior is opposite to that of most other tools, so you need to pay special attention if you use plotly.


import plotly.offline as pyo
 pyo.init_notebook_mode(connected=True)


 import plotly.figure_factory as ff


 correlation_matrix = data.corr()


 fig = ff.create_annotated_heatmap(
    z=correlation_matrix.values,
    x=list(correlation_matrix.columns),
    y=list(correlation_matrix.index),
    colorscale='Blues')


 fig.show()

fd0d3644e201c13c45e00caa065f42ac.png

Pandas + Matplotlib for better visualization

This result can also be used directly using sns.pairplot(data). The graphs produced by the two methods are similar, but seaborn only needs one sentence:


sns.pairplot(df[['mpg','weight','horsepower','acceleration']])

7e48e2a63fd509e0e5436bb67f7b7496.png

So here we introduce how to use Matplotlib to achieve:


import matplotlib.pyplot as plt


 pd.plotting.scatter_matrix(
    data, alpha=0.2,
    figsize=(6, 6),
    diagonal='hist')


 plt.show()

1a1fa1a34ba74be2b9b18ab8df14debf.png

p-value of correlation

If you are looking for a simple matrix (with p-values), which is what many other tools (SPSS, Stata, R, SAS, etc.) do by default, how do you get it in Python?

Here we need to use the scipy library for scientific computing. The following are the implemented functions:


from scipy.stats import pearsonr
 import pandas as pd
 import seaborn as sns


 def corr_full(df, numeric_only=True, rows=['corr', 'p-value', 'obs']):
    """
    Generates a correlation matrix with correlation coefficients,
    p-values, and observation count.


    Args:
    - df: Input dataframe
    - numeric_only (bool): Whether to consider only numeric columns for
                            correlation. Default is True.
    - rows: Determines the information to show.
                            Default is ['corr', 'p-value', 'obs'].


    Returns:
    - formatted_table: The correlation matrix with the specified rows.
    """


    # Calculate Pearson correlation coefficients
    corr_matrix = df.corr(
        numeric_only=numeric_only)


    # Calculate the p-values using scipy's pearsonr
    pvalue_matrix = df.corr(
        numeric_only=numeric_only,
        method=lambda x, y: pearsonr(x, y)[1])


    # Calculate the non-null observation count for each column
    obs_count = df.apply(lambda x: x.notnull().sum())


    # Calculate observation count for each pair of columns
    obs_matrix = pd.DataFrame(
        index=corr_matrix.columns, columns=corr_matrix.columns)
    for col1 in obs_count.index:
        for col2 in obs_count.index:
            obs_matrix.loc[col1, col2] = min(obs_count[col1], obs_count[col2])


    # Create a multi-index dataframe to store the formatted correlations
    formatted_table = pd.DataFrame(
        index=pd.MultiIndex.from_product([corr_matrix.columns, rows]),
        columns=corr_matrix.columns
    )


    # Assign values to the appropriate cells in the formatted table
    for col1 in corr_matrix.columns:
        for col2 in corr_matrix.columns:
            if 'corr' in rows:
                formatted_table.loc[
                    (col1, 'corr'), col2] = corr_matrix.loc[col1, col2]


            if 'p-value' in rows:
                # Avoid p-values for diagonal they correlate perfectly
                if col1 != col2:
                    formatted_table.loc[
                        (col1, 'p-value'), col2] = f"({pvalue_matrix.loc[col1, col2]:.4f})"
            if 'obs' in rows:
                formatted_table.loc[
                    (col1, 'obs'), col2] = obs_matrix.loc[col1, col2]


    return(formatted_table.fillna('')
            .style.set_properties(**{'text-align': 'center'}))

Calling this function directly, the result we return is as follows:


df = sns.load_dataset('mpg')
 result = corr_full(df, rows=['corr', 'p-value'])
 result

4f572a0af57c5d2297eb96d70e9ab6b5.png

Summary

We have introduced various methods for creating correlation coefficient matrices in Python. These methods can be chosen at will (whichever is more convenient for you to use). The standard default output of most tools in Python will not include p-values or observation counts, so if you need statistics on this, you can use the functions provided by our sub-hou, because for a comprehensive and complete correlation analysis, there are p-values And the observation count is very helpful as a reference.

Editor: Huang Jiyan

1bd64482b8835bc3bf34a24fdeb130a1.png

The knowledge points of the article match the official knowledge archives, and you can further learn relevant knowledge. Python introductory skill treeScientific computing basic software package NumPyMatrix object 378865 people are learning the system