With just a few lines of Python code, you can achieve comprehensive automatic exploratory data analysis!

Exploratory data analysis is one of the important components of data science model development and data set research. When you get a new data set, you first need to spend a lot of time conducting EDA to study the inherent information in the data set. The automated EDA Python package can perform EDA with just a few lines of Python code. In this article, we have compiled 10 Python packages that can automatically execute EDA and generate insights about data to see what functions they have and to what extent they can help us automatically solve EDA needs.

Article directory

- 1. D-Tale
- 2.Pandas-Profiling
- 3. Sweetviz
- 4. AutoViz
- 5. Dataprep
- 6. Klib
- 7. Dabl
- 8. Speedml
- 9.DataTile
- 10.edaviz
- Summarize

1. D-Tale

D-Tale uses Flask as backend, React frontend and can be seamlessly integrated with ipython notebook and terminal. D-Tale can support Pandas DataFrame, Series, MultiIndex, DatetimeIndex and RangeIndex.

import dtale
import pandas as pd
dtale.show(pd.read_csv("titanic.csv"))

The D-Tale library can generate a report with a single line of code that contains an overall summary of the dataset, correlations, graphs and heatmaps, highlighting missing values, etc. D-Tale can also analyze each chart in the report. In the screenshot above, we can see that the charts can be interactively operated.

2. Pandas-Profiling

Pandas-Profiling can generate summary reports of Pandas DataFrame. panda-profiling extends pandas DataFrame df.profile_report() and works very well on large datasets, it can create reports in seconds.

#Install the below libaries before importing
import pandas as pd
from pandas_profiling import ProfileReport
  
#EDA using pandas-profiling
profile = ProfileReport(pd.read_csv('titanic.csv'), explorative=True)
  
#Saving results to a HTML file
profile.to_file("output.html")

3. Sweetviz

Sweetviz is an open source Python library that requires only two lines of Python code to generate beautiful visualizations and launch EDA (Exploratory Data Analysis) as an HTML application. The Sweetviz package is built around quickly visualizing target values and comparing data sets.

import pandas as pd
import sweetviz as sv
  
#EDA using Autoviz
sweet_report = sv.analyze(pd.read_csv("titanic.csv"))
  
#Saving results to HTML file
sweet_report.show_html('sweet_report.html')

The reports generated by the Sweetviz library contain an overall summary of the dataset, correlations, classifications, and numerical feature associations.

4. AutoViz

The Autoviz package can automatically visualize data sets of any size with one line of code and automatically generate HTML, bokeh and other reports. Users can interact with HTML reports generated by the AutoViz package.

import pandas as pd
from autoviz.AutoViz_Class import AutoViz_Class
  
#EDA using Autoviz
autoviz = AutoViz_Class().AutoViz('train.csv')

5. Dataprep

Dataprep is an open source Python package for analyzing, preparing, and processing data. DataPrep is built on top of Pandas and Dask DataFrame and can be easily integrated with other Python libraries.

DataPrep runs the fastest among these 10 packages and can generate reports for Pandas/Dask DataFrame in seconds.

from dataprep.datasets import load_dataset
from dataprep.eda import create_report
  
df = load_dataset("titanic.csv")
create_report(df).show_browser()

6. Klib

klib is a Python library for importing, cleaning, analyzing and preprocessing data.

import klib
import pandas as pd
  
df = pd.read_csv('DATASET.csv')
klib.missingval_plot(df)

klib.corr_plot(df_cleaned, annot=False)

klib.dist_plot(df_cleaned['Win_Prob'])

klib.cat_plot(df, figsize=(50,15))

Although klibe provides a lot of analysis functions, we need to manually write code for each analysis, so it can only be said to be a semi-automated operation, but if we need more customized analysis, it is very convenient.

7. Dabl

Dabl focuses less on statistical measures of individual columns and more on providing a quick overview through visualization, as well as convenient machine learning preprocessing and model search.

The Plot() function in dabl can achieve visualization by drawing various graphs, including:

target distribution map
Scatter plot
linear discriminant analysis

import pandas as pd
import dabl
  
df = pd.read_csv("titanic.csv")
dabl.plot(df, target_col="Survived")

8. Speedml

SpeedML is a Python package for quickly launching machine learning pipelines. SpeedML integrates some commonly used ML packages, including Pandas, Numpy, Sklearn, Xgboost and Matplotlib, so in fact SpeedML not only includes automated EDA functions.

SpeedML officials say that using it enables development based on iteration, reducing coding time by 70%.

from speedml import Speedml
  
sml = Speedml('../input/train.csv', '../input/test.csv',
            target = 'Survived', uid = 'PassengerId')
sml.train.head()

sml.plot.correlate()

sml.plot.distribute()

sml.plot.ordinal('Parch')

sml.plot.ordinal('SibSp')

sml.plot.continuous('Age')

9. DataTile

DataTile (formerly Pandas-Summary) is an open source Python package responsible for managing, summarizing, and visualizing data. DataTile is basically an extension of the PANDAS DataFrame describe() function.

import pandas as pd
from datatile.summary.df import DataFrameSummary
  
df = pd.read_csv('titanic.csv')
dfs = DataFrameSummary(df)
dfs.summary()

10, edaviz

edaviz is a python library that can be used for data exploration and visualization in Jupyter Notebook and Jupyter Lab. It was originally very easy to use, but was later acquired by Databricks and integrated into bamboolib, so here is a brief introduction Demo.

Summary

In this article, we introduce 10 automated exploratory data analysis Python packages that can generate data summaries and visualizations in a few lines of Python code. Automating work can save us a lot of time.

Dataprep is my most commonly used EDA package. AutoViz and D-table are also good choices. If you need customized analysis, you can use Klib. SpeedML integrates many things. Using it alone for EDA analysis is not particularly suitable. Others You can choose packages based on personal preferences. In fact, they are all very useful. In the end, edaviz should not be considered because it is no longer open source.

Interested friends will receive a complete set of Python learning materials, including interview questions, resume information, etc. See below for details.

1. Python learning routes in all directions

The technical points in all directions of Python have been compiled to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the following knowledge points to ensure that you learn more comprehensively.

2. Essential development tools for Python

The tools have been organized for you, and you can get started directly after installation!

3. Latest Python study notes

When I learn a certain basic and have my own understanding ability, I will read some books or handwritten notes compiled by my seniors. These notes record their understanding of some technical points in detail. These understandings are relatively unique and can be learned. to a different way of thinking.

4. Python video collection

Watch a comprehensive zero-based learning video. Watching videos is the fastest and most effective way to learn. It is easy to get started by following the teacher’s ideas in the video, from basic to in-depth.

5. Practical cases

What you learn on paper is ultimately shallow. You must learn to type along with the video and practice it in order to apply what you have learned into practice. At this time, you can learn from some practical cases.

6. Interview Guide

Resume template

If there is any infringement, please contact us for deletion.