Source | Data STUDIO
Exploratory data analysis is one of the important components of data science model development and data set research. When you get a new data set, you first need to spend a lot of time conducting EDA to study the inherent information in the data set. The automated EDA Python package can perform EDA with just a few lines of Python code. In this article, we have compiled 10 Python packages that can automatically execute EDA and generate insights about data to see what functions they have and to what extent they can help us automatically solve EDA needs.
-
DTale
-
Pandas-profiling
-
sweetviz
-
autoviz
-
dataprep
-
KLib
-
dabl
-
speedML
-
datatile
-
edaviz
1. D-Tale
D-Tale uses Flask as backend, React frontend and can be seamlessly integrated with ipython notebook and terminal. D-Tale can support Pandas DataFrame, Series, MultiIndex, DatetimeIndex and RangeIndex.
import dtale import pandas as pd dtale.show(pd.read_csv("titanic.csv"))
The D-Tale library can generate a report with a single line of code that contains an overall summary of the dataset, correlations, graphs and heatmaps, highlighting missing values, etc. D-Tale can also analyze each chart in the report. In the screenshot above, we can see that the charts can be interactively operated.
2. Pandas-Profiling
Pandas-Profiling can generate summary reports of Pandas DataFrame. panda-profiling extends pandas DataFrame df.profile_report() and works very well on large datasets, it can create reports in seconds.
#Install the below libaries before importing import pandas as pd from pandas_profiling import ProfileReport #EDA using pandas-profiling profile = ProfileReport(pd.read_csv('titanic.csv'), explorative=True) #Saving results to a HTML file profile.to_file("output.html")
3. Sweetviz
Sweetviz is an open source Python library that requires only two lines of Python code to generate beautiful visualizations and launch EDA (Exploratory Data Analysis) as an HTML application. The Sweetviz package is built around quickly visualizing target values and comparing data sets.
import pandas as pd import sweetviz as sv #EDA using Autoviz sweet_report = sv.analyze(pd.read_csv("titanic.csv")) #Saving results to HTML file sweet_report.show_html('sweet_report.html')
The reports generated by the Sweetviz library contain an overall summary of the dataset, correlations, classifications, and numerical feature associations.
4. AutoViz
The Autoviz package can automatically visualize data sets of any size with one line of code and automatically generate HTML, bokeh and other reports. Users can interact with HTML reports generated by the AutoViz package.
import pandas as pd from autoviz.AutoViz_Class import AutoViz_Class #EDA using Autoviz autoviz = AutoViz_Class().AutoViz('train.csv')
5. Dataprep
Dataprep is an open source Python package for analyzing, preparing, and processing data. DataPrep is built on top of Pandas and Dask DataFrame and can be easily integrated with other Python libraries.
DataPrep runs the fastest among these 10 packages and can generate reports for Pandas/Dask DataFrame in seconds.
from dataprep.datasets import load_dataset from dataprep.eda import create_report df = load_dataset("titanic.csv") create_report(df).show_browser()
6. Klib
klib is a Python library for importing, cleaning, analyzing and preprocessing data.
import klib import pandas as pd df = pd.read_csv('DATASET.csv') klib.missingval_plot(df)
klib.corr_plot(df_cleaned, annot=False)
klib.dist_plot(df_cleaned['Win_Prob'])
klib.cat_plot(df, figsize=(50,15))
Although klibe provides a lot of analysis functions, we need to manually write code for each analysis, so it can only be said to be a semi-automated operation, but if we need more customized analysis, it is very convenient.
7. Dabl
Dabl focuses less on statistical measures of individual columns and more on providing a quick overview through visualization, as well as convenient machine learning preprocessing and model search.
The Plot() function in dabl can achieve visualization by drawing various graphs, including:
-
target distribution map
-
Scatter plot
-
linear discriminant analysis
import pandas as pd import dabl df = pd.read_csv("titanic.csv") dabl.plot(df, target_col="Survived")
8. Speedml
SpeedML is a Python package for quickly launching machine learning pipelines. SpeedML integrates some commonly used ML packages, including Pandas, Numpy, Sklearn, Xgboost and Matplotlib, so in fact SpeedML not only includes automated EDA functions.
SpeedML officials say that using it enables development based on iteration, reducing coding time by 70%.
from speedml import Speedml sml = Speedml('../input/train.csv', '../input/test.csv', target = 'Survived', uid = 'PassengerId') sml.train.head()
sml.plot.correlate()
sml.plot.distribute()
sml.plot.ordinal('Parch')
sml.plot.ordinal('SibSp')
sml.plot.continuous('Age')
9. DataTile
DataTile (formerly Pandas-Summary) is an open source Python package responsible for managing, summarizing, and visualizing data. DataTile is basically an extension of the PANDAS DataFrame describe() function.
import pandas as pd from datatile.summary.df import DataFrameSummary df = pd.read_csv('titanic.csv') dfs = DataFrameSummary(df) dfs.summary()
10, edaviz
edaviz is a python library that can be used for data exploration and visualization in Jupyter Notebook and Jupyter Lab. It was originally very easy to use, but was later acquired by Databricks and integrated into bamboolib, so here is a brief introduction Demo.
Summary
In this article, we introduce 10 automated exploratory data analysis Python packages that can generate data summaries and visualizations in a few lines of Python code. Automating work can save us a lot of time.
Dataprep is my most commonly used EDA package. AutoViz and D-table are also good choices. If you need customized analysis, you can use Klib. SpeedML integrates many things. Using it alone for EDA analysis is not particularly suitable. Others You can choose packages based on personal preferences. In fact, they are all very useful. In the end, edaviz should not be considered because it is no longer open source.