Exploratory data analysis is one of the important components of data science model development and data set research. When you get a new data set, you first need to spend a lot of time conducting EDA to study the inherent information in the data set. The automated EDA Python package can perform EDA with just a few lines of Python code. In this article, we have compiled 10 Python packages that can automatically execute EDA and generate insights about data to see what functions they have and to what extent they can help us automatically solve EDA needs.
Article directory
-
- 1. D-Tale
- 2.Pandas-Profiling
- 3. Sweetviz
- 4. AutoViz
- 5. Dataprep
- 6. Klib
- 7. Dabl
- 8. Speedml
- 9.DataTile
- 10.edaviz
- Summarize
1. D-Tale
D-Tale uses Flask as backend, React frontend and can be seamlessly integrated with ipython notebook and terminal. D-Tale can support Pandas DataFrame, Series, MultiIndex, DatetimeIndex and RangeIndex.
import dtale import pandas as pd dtale.show(pd.read_csv("titanic.csv"))
The D-Tale library can generate a report with a single line of code that contains an overall summary of the dataset, correlations, graphs and heatmaps, highlighting missing values, etc. D-Tale can also analyze each chart in the report. In the screenshot above, we can see that the charts can be interactively operated.
2. Pandas-Profiling
Pandas-Profiling can generate summary reports of Pandas DataFrame. panda-profiling extends pandas DataFrame df.profile_report() and works very well on large datasets, it can create reports in seconds.
#Install the below libaries before importing import pandas as pd from pandas_profiling import ProfileReport #EDA using pandas-profiling profile = ProfileReport(pd.read_csv('titanic.csv'), explorative=True) #Saving results to a HTML file profile.to_file("output.html")
3. Sweetviz
Sweetviz is an open source Python library that requires only two lines of Python code to generate beautiful visualizations and launch EDA (Exploratory Data Analysis) as an HTML application. The Sweetviz package is built around quickly visualizing target values and comparing data sets.
import pandas as pd import sweetviz as sv #EDA using Autoviz sweet_report = sv.analyze(pd.read_csv("titanic.csv")) #Saving results to HTML file sweet_report.show_html('sweet_report.html')
The reports generated by the Sweetviz library contain an overall summary of the dataset, correlations, classifications, and numerical feature associations.
4. AutoViz
The Autoviz package can automatically visualize data sets of any size with one line of code and automatically generate HTML, bokeh and other reports. Users can interact with HTML reports generated by the AutoViz package.
import pandas as pd from autoviz.AutoViz_Class import AutoViz_Class #EDA using Autoviz autoviz = AutoViz_Class().AutoViz('train.csv')
5. Dataprep
Dataprep is an open source Python package for analyzing, preparing, and processing data. DataPrep is built on top of Pandas and Dask DataFrame and can be easily integrated with other Python libraries.
DataPrep runs the fastest among these 10 packages and can generate reports for Pandas/Dask DataFrame in seconds.
from dataprep.datasets import load_dataset from dataprep.eda import create_report df = load_dataset("titanic.csv") create_report(df).show_browser()
6. Klib
klib is a Python library for importing, cleaning, analyzing and preprocessing data.
import klib import pandas as pd df = pd.read_csv('DATASET.csv') klib.missingval_plot(df)
klib.corr_plot(df_cleaned, annot=False)
klib.dist_plot(df_cleaned['Win_Prob'])
klib.cat_plot(df, figsize=(50,15))
Although klibe provides a lot of analysis functions, we need to manually write code for each analysis, so it can only be said to be a semi-automated operation, but if we need more customized analysis, it is very convenient.
7. Dabl
Dabl focuses less on statistical measures of individual columns and more on providing a quick overview through visualization, as well as convenient machine learning preprocessing and model search.
The Plot() function in dabl can achieve visualization by drawing various graphs, including:
-
target distribution map
-
Scatter plot
-
linear discriminant analysis
import pandas as pd import dabl df = pd.read_csv("titanic.csv") dabl.plot(df, target_col="Survived")
8. Speedml
SpeedML is a Python package for quickly launching machine learning pipelines. SpeedML integrates some commonly used ML packages, including Pandas, Numpy, Sklearn, Xgboost and Matplotlib, so in fact SpeedML not only includes automated EDA functions.
SpeedML officials say that using it enables development based on iteration, reducing coding time by 70%.
from speedml import Speedml sml = Speedml('../input/train.csv', '../input/test.csv', target = 'Survived', uid = 'PassengerId') sml.train.head()
sml.plot.correlate()
sml.plot.distribute()
sml.plot.ordinal('Parch')
sml.plot.ordinal('SibSp')
sml.plot.continuous('Age')
9. DataTile
DataTile (formerly Pandas-Summary) is an open source Python package responsible for managing, summarizing, and visualizing data. DataTile is basically an extension of the PANDAS DataFrame describe() function.
import pandas as pd from datatile.summary.df import DataFrameSummary df = pd.read_csv('titanic.csv') dfs = DataFrameSummary(df) dfs.summary()
10, edaviz
edaviz is a python library that can be used for data exploration and visualization in Jupyter Notebook and Jupyter Lab. It was originally very easy to use, but was later acquired by Databricks and integrated into bamboolib, so here is a brief introduction Demo.
Summary
In this article, we introduce 10 automated exploratory data analysis Python packages that can generate data summaries and visualizations in a few lines of Python code. Automating work can save us a lot of time.
Dataprep is my most commonly used EDA package. AutoViz and D-table are also good choices. If you need customized analysis, you can use Klib. SpeedML integrates many things. Using it alone for EDA analysis is not particularly suitable. Others You can choose packages based on personal preferences. In fact, they are all very useful. In the end, edaviz should not be considered because it is no longer open source.
Interested friends will receive a complete set of Python learning materials, including interview questions, resume information, etc. See below for details.
1. Python learning routes in all directions
The technical points in all directions of Python have been compiled to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the following knowledge points to ensure that you learn more comprehensively.
2. Essential development tools for Python
The tools have been organized for you, and you can get started directly after installation!
3. Latest Python study notes
When I learn a certain basic and have my own understanding ability, I will read some books or handwritten notes compiled by my seniors. These notes record their understanding of some technical points in detail. These understandings are relatively unique and can be learned. to a different way of thinking.
4. Python video collection
Watch a comprehensive zero-based learning video. Watching videos is the fastest and most effective way to learn. It is easy to get started by following the teacher’s ideas in the video, from basic to in-depth.
5. Practical cases
What you learn on paper is ultimately shallow. You must learn to type along with the video and practice it in order to apply what you have learned into practice. At this time, you can learn from some practical cases.
6. Interview Guide
Resume template
If there is any infringement, please contact us for deletion.