With just a few lines of Python code, you can achieve comprehensive automatic exploratory data analysis!

Source | Data STUDIO

Exploratory data analysis is one of the important components of data science model development and data set research. When you get a new data set, you first need to spend a lot of time conducting EDA to study the inherent information in the data set. The automated EDA Python package can perform EDA with just a few lines of Python code. In this article, we have compiled 10 Python packages that can automatically execute EDA and generate insights about data to see what functions they have and to what extent they can help us automatically solve EDA needs.
  1. DTale

  2. Pandas-profiling

  3. sweetviz

  4. autoviz

  5. dataprep

  6. KLib

  7. dabl

  8. speedML

  9. datatile

  10. edaviz

1. D-Tale

fa1e8850484d08c472840bd350782b86.png

D-Tale uses Flask as backend, React frontend and can be seamlessly integrated with ipython notebook and terminal. D-Tale can support Pandas DataFrame, Series, MultiIndex, DatetimeIndex and RangeIndex.

import dtale
import pandas as pd
dtale.show(pd.read_csv("titanic.csv"))

44a2cde8e1c8bb7c91a32f90ef7e9901.gif

The D-Tale library can generate a report with a single line of code that contains an overall summary of the dataset, correlations, graphs and heatmaps, highlighting missing values, etc. D-Tale can also analyze each chart in the report. In the screenshot above, we can see that the charts can be interactively operated.

2. Pandas-Profiling

f8020a15f2540956e9e3ae630dc997d8.png

Pandas-Profiling can generate summary reports of Pandas DataFrame. panda-profiling extends pandas DataFrame df.profile_report() and works very well on large datasets, it can create reports in seconds.

#Install the below libaries before importing
import pandas as pd
from pandas_profiling import ProfileReport

#EDA using pandas-profiling
profile = ProfileReport(pd.read_csv('titanic.csv'), explorative=True)

#Saving results to a HTML file
profile.to_file("output.html")

6fd2dc8b25ab8059589781ece90a3244.gif

3. Sweetviz

df859f367700702cc3afaeec724771bd.png

Sweetviz is an open source Python library that requires only two lines of Python code to generate beautiful visualizations and launch EDA (Exploratory Data Analysis) as an HTML application. The Sweetviz package is built around quickly visualizing target values and comparing data sets.

import pandas as pd
import sweetviz as sv

#EDA using Autoviz
sweet_report = sv.analyze(pd.read_csv("titanic.csv"))

#Saving results to HTML file
sweet_report.show_html('sweet_report.html')

The reports generated by the Sweetviz library contain an overall summary of the dataset, correlations, classifications, and numerical feature associations.

5700d590beffc52997b4ebd6c107f5b8.gif

4. AutoViz

11c4219436ddfc250fc06f5a9dde649d.png

The Autoviz package can automatically visualize data sets of any size with one line of code and automatically generate HTML, bokeh and other reports. Users can interact with HTML reports generated by the AutoViz package.

import pandas as pd
from autoviz.AutoViz_Class import AutoViz_Class

#EDA using Autoviz
autoviz = AutoViz_Class().AutoViz('train.csv')

0713f0b7d19f54c35e1f87dc4d9c0243.gif

5. Dataprep

ccf263fa191dbb2e3a7aade8876ca97a.png

Dataprep is an open source Python package for analyzing, preparing, and processing data. DataPrep is built on top of Pandas and Dask DataFrame and can be easily integrated with other Python libraries.

DataPrep runs the fastest among these 10 packages and can generate reports for Pandas/Dask DataFrame in seconds.

from dataprep.datasets import load_dataset
from dataprep.eda import create_report

df = load_dataset("titanic.csv")
create_report(df).show_browser()

70bee9768c50a9b35a6d5382abb27edc.png

6. Klib

2e3304df43059dbe6686649ae13b13c7.png

klib is a Python library for importing, cleaning, analyzing and preprocessing data.

import klib
import pandas as pd

df = pd.read_csv('DATASET.csv')
klib.missingval_plot(df)

82cf7d3dd8f03ee24caa7bee0a746c13.png

klib.corr_plot(df_cleaned, annot=False)

3059d22d4a184e5e324c8c142c0ed73c.png

klib.dist_plot(df_cleaned['Win_Prob'])

3732a3cf51556960fd520d9d5e733d7c.png

klib.cat_plot(df, figsize=(50,15))

fce1f5db2cdc4b672f9d27abf8b5474f.png

Although klibe provides a lot of analysis functions, we need to manually write code for each analysis, so it can only be said to be a semi-automated operation, but if we need more customized analysis, it is very convenient.

d6ba21534a4cc021313c379d3b7fefb5.gif

7. Dabl

Dabl focuses less on statistical measures of individual columns and more on providing a quick overview through visualization, as well as convenient machine learning preprocessing and model search.

e48f78a54f3f261a9ff86d95f949d7f0.png

The Plot() function in dabl can achieve visualization by drawing various graphs, including:

  • target distribution map

  • Scatter plot

  • linear discriminant analysis

import pandas as pd
import dabl

df = pd.read_csv("titanic.csv")
dabl.plot(df, target_col="Survived")

3675095a430e0e7c527eed14b18ac477.gif

8. Speedml

SpeedML is a Python package for quickly launching machine learning pipelines. SpeedML integrates some commonly used ML packages, including Pandas, Numpy, Sklearn, Xgboost and Matplotlib, so in fact SpeedML not only includes automated EDA functions.

SpeedML officials say that using it enables development based on iteration, reducing coding time by 70%.

from speedml import Speedml

sml = Speedml('../input/train.csv', '../input/test.csv',
            target = 'Survived', uid = 'PassengerId')
sml.train.head()

eb7fdfcc2ae7d2dabf363a3d10e411d6.png

sml.plot.correlate()

b8fb03ba7b0e2cf80088332c9791b5a4.png

sml.plot.distribute()

d062bf73f39cbc2f69e92a5eabbf9152.png

sml.plot.ordinal('Parch')

fdd4dd604ccc70ee17aa05e68e3c8771.png

sml.plot.ordinal('SibSp')

f104260ebb1d233f39e1739c42da424e.png

sml.plot.continuous('Age')

6f78b5cc45837d116372814be71eb0b6.png

9. DataTile

DataTile (formerly Pandas-Summary) is an open source Python package responsible for managing, summarizing, and visualizing data. DataTile is basically an extension of the PANDAS DataFrame describe() function.

import pandas as pd
from datatile.summary.df import DataFrameSummary

df = pd.read_csv('titanic.csv')
dfs = DataFrameSummary(df)
dfs.summary()

f102c6a0709d5ce09a0f0931c13bae16.png

10, edaviz

edaviz is a python library that can be used for data exploration and visualization in Jupyter Notebook and Jupyter Lab. It was originally very easy to use, but was later acquired by Databricks and integrated into bamboolib, so here is a brief introduction Demo.

f1f0a64372ff5266a8e98e203c259d95.gif

Summary

In this article, we introduce 10 automated exploratory data analysis Python packages that can generate data summaries and visualizations in a few lines of Python code. Automating work can save us a lot of time.

Dataprep is my most commonly used EDA package. AutoViz and D-table are also good choices. If you need customized analysis, you can use Klib. SpeedML integrates many things. Using it alone for EDA analysis is not particularly suitable. Others You can choose packages based on personal preferences. In fact, they are all very useful. In the end, edaviz should not be considered because it is no longer open source.