Pandas 2.0 main advantages and code implementation

Foreword

After the official release of pandas 2.0.0, it caused an uproar in the data science community.

Due to its wide functionality and versatility, data manipulation is almost impossible to do without importpandas as pd, right?

Now, hear me out: with all the big language model buzz going on over the past few months, I somehow missed the fact that pandas just went through a major release! Yes, pandas 2.0 has its guns sharpened Here it comes (What’s new in 2.0.0 (April 3, 2023) – pandas 2.1.0.dev0 + 1237.gc92b3701ef documentation (pydata.org))!

While I wasn’t aware of all the hype, the data center AI community was quick to lend a helping hand

Fun fact: Did you realize this distribution took an amazing 3 years to make? That’s what I call a “commitment to the community”!

So what does pandas 2.0 bring? Let’s take a closer look right away!

1. Performance, speed and memory efficiency

As we know, pandas is built using numpy and is not intentionally designed as a backend for a dataframe library. For this reason, one of the main limitations of pandas is the in-memory processing of larger data sets.

In this version, the big change comes from the introduction of the Apache Arrow backend for pandas data.

Essentially, Arrow is a standardized in-memory columnar data format with available libraries for multiple programming languages (C, C++, R, Python, etc.). For Python, there’s PyArrow, which is based on the C++ implementation of Arrow, so it’s fast!

So, long story short, PyArrow takes into account the memory limitations of our previous 1.0-point versions, allowing us to perform faster, more memory-efficient data operations, especially for large data sets.

Here is a comparison between reading the data without the pyarrow backend and using the pyarrow backend using the Hacker News dataset (approximately 650 MB) (license CC BY-NC-SA 4.0):

%timeit df = pd.read_csv("data/hn.csv")
# 12 s ± 304 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_arrow = pd.read_csv("data/hn.csv", engine='pyarrow', dtype_backend='pyarrow')
# 329 ms ± 65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compare read_csv(): using the pyarrow backend is more than 35 times faster.

As you can see, using the new backend makes reading data nearly 35 times faster. Other aspects worth pointing out:

  • Without the pyarrow backend, each column/feature is stored as its own unique data type: numeric features are stored as int64 or float64, while string values are stored as objects;

  • Using pyarrow, all functions use Arrow dtypes: note the [pyarrow] annotation and the different types of data: int64, float64, string, timestamp and double:

df = pd.read_csv("data/hn.csv")
df.info()


#
# RangeIndex: 3885799 entries, 0 to 3885798
# Data columns (total 8 columns):
# # Column Dtype
# --- ------ -----
# 0 Object ID int64
# 1 Title object
# 2 Post Type object
# 3 Author object
# 4 Created At object
# 5 URL object
# 6 Points int64
# 7 Number of Comments float64
# dtypes: float64(1), int64(2), object(5)
# memory usage: 237.2 + MB


df_arrow = pd.read_csv("data/hn.csv", dtype_backend='pyarrow', engine='pyarrow')
df_arrow.info()


#
# RangeIndex: 3885799 entries, 0 to 3885798
# Data columns (total 8 columns):
# # Column Dtype
# --- ------ -----
# 0 Object ID int64[pyarrow]
# 1 Title string[pyarrow]
# 2 Post Type string[pyarrow]
# 3 Author string[pyarrow]
# 4 Created At timestamp[s][pyarrow]
# 5 URL string[pyarrow]
# 6 Points int64[pyarrow]
# 7 Number of Comments double[pyarrow]
# dtypes: double[pyarrow](1), int64[pyarrow](2), string[pyarrow](4), timestamp[s][pyarrow](1)
# memory usage: 660.2 MB

df.info(): Investigate the dtype of each data frame.

2. Arrow data type and Numpy index

In addition to reading data (which is the simplest case), you can also expect other improvements for a range of other operations, especially those involving string manipulation, since pyarrow’s implementation of string data types is very efficient:

%timeit df["Author"].str.startswith('phy')
# 851 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
          

%timeit df_arrow["Author"].str.startswith('phy')
# 27.9 ms ± 538 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Compare string operations: Demonstrate the efficiency of Arrow implementation.

In fact, Arrow has more (and better supported) data types than numpy that are necessary outside the scientific (numeric) scope: dates and times, durations, binary, decimals, lists, and maps. It might actually be a good exercise to browse the equivalences between pyarrow supported data types and numpy data types so that you learn how to take advantage of them.

It is now possible to save more numpy numeric types in indexes as well. The traditional int64, uint64 and float64 open space for all numpy numeric dtypes Index values, so we can specify their 32-bit versions:

pd.Index([1, 2, 3])
# Index([1, 2, 3], dtype='int64')
          

pd.Index([1, 2, 3], dtype=np.int32)
# Index([1, 2, 3], dtype='int32')

Make your code more memory efficient by taking advantage of 32-bit numpy indexes.

3. Easier to deal with missing values

Being built on numpy makes it difficult for pandas to handle missing values in an easy, flexible way, since numpy does not support null values for some data types.

For example, integers are automatically converted to floats, which is not ideal:

df = pd.read_csv("data/hn.csv")
          

points = df["Points"]
points.isna().sum()
#0
          

points[0:5]
# 0 61
# 1 16
# 2 7
#35
# 4 7
# Name: Points, dtype: int64
          

# Setting first position to None
points.iloc[0] = None
          

points[0:5]
# 0 NaN
# 1 16.0
#2 7.0
#3 5.0
#4 7.0
# Name: Points, dtype: float64

Missing values: converted to floating point numbers.

Notice how the points automatically change from int64 to float64 after the singleNone value is introduced.

There’s nothing worse for data flow than bad typography, especially in the data-centric AI paradigm.

Bad typesetting directly affects data preparation decisions, causing incompatibilities between different data chunks, and even when passed silently, they can harm certain operations that output meaningless results.

For example, in the Data-Centric AI Community (discord.com) we are working on synthetic data for data privacy (GitHub – Data-Centric-AI-Community/nist-crc-2023: NIST Collaborative Research Cycle on Synthetic Data. Learn about Synthetic Data week by week!) Work on a project. One of the features, NOC (number of children, number of children), has missing values and is therefore automatically converted to a float when loading the data. When passing data into a generative model as floats, we may get output values as decimals, such as 2.5 – unless you are a mathematician with 2 kids, a newborn, and a weird sense of humor, in which case 2.5 A child is not acceptable.

In pandas 2.0, we can take advantage of dtype = ‘numpy_nullable’, where missing values are considered without any dtype changes, so we can retain the original data type (int64 in this case):

df_null = pd.read_csv("data/hn.csv", dtype_backend='numpy_nullable')
          

points_null = df_null["Points"]
points_null.isna().sum()
#0
          

points_null[0:5]
# 0 61
# 1 16
# 2 7
#35
# 4 7
# Name: Points, dtype: Int64
          

points_null.iloc[0] = None
          

points_null[0:5]
#0
# 1 16
# 2 7
#35
# 4 7
# Name: Points, dtype: Int64

Leveraging “numpy_nullable”, pandas 2.0 can handle missing values without changing the original data type. Author code snippet.

This may seem like a subtle change, but it means that pandas now natively handles missing values using Arrow . This makes the operation more efficient because pandas does not have to implement its own version to handle null values for each data type.

4. Copy-on-write optimization

Pandas 2.0 also adds a new lazy copy mechanism that delays copying dataframe and series objects until they are modified.

This means that when copy-on-write is enabled, some methods will return a view instead of a copy, which improves memory efficiency by minimizing unnecessary data duplication.

This also means that you need to be careful when using chained allocations.

If copy-on-write mode is enabled, chained allocations will not work because they point to a temporary object that is the result of an index operation (behaves like a copy under copy-on-write).

With copy_on_write disabled, operations such as slicing may change the original df if you change the new dataframe:

pd.options.mode.copy_on_write = False # disable copy-on-write (default in pandas 2.0)
          

df = pd.read_csv("data/hn.csv")
df.head()
          

# Throws a 'SettingWithCopy' warning
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame
df["Points"][0] = 2000
          

df.head() # <---- df changes

Disable copy-on-write: change original dataframe in linked allocation.

When copy_on_write is enabled, a copy will be created on allocation (python – What rules does Pandas use to generate a view vs a copy? – Stack Overflow), so the original dataframe is never changed. Pandas 2.0 raises ChainedAssignmentError in these situations to avoid silent errors:

pd.options.mode.copy_on_write = True
          

df = pd.read_csv("data/hn.csv")
df.head()
          

# Throws a ChainedAssignmentError
df["Points"][0] = 2000
          

# ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame
# or Series through chained assignment. When using the Copy-on-Write mode,
# such chained assignment never works to update the original DataFrame
# or Series, because the intermediate object on which we are setting
# values always behaves as a copy.
# Try using '.loc[row_indexer, col_indexer] = value' instead,
# to perform the assignment in a single step.
          

df.head() # <---- df does not change

Enable copy-on-write: the original data frame is not changed in the linked assignment.

5. Dependable options

When using pip, version 2.0 gives us the flexibility to install optional dependencies, which is a plus when it comes to customization and optimization of resources.

We can tailor the installation to our specific requirements without spending disk space on stuff we don’t really need.

Additionally, it saves a lot of “dependency headaches” and reduces the possibility of compatibility issues or conflicts with other packages that may be present in the development environment:

pip install "pandas[postgresql, aws, spss]>=2.0.0"

Install dependency options.

Let’s try it out!

However, the question lingers: Is this heat really justified? I’m curious if pandas 2.0 will work for me on a daily basis

Some packages provide significant improvements: ydata-profiling, matplotlib, seaborn, scikit-learn.

From these, I decided to give ydata-profiling a try – it just added support for pandas 2.0, which seems to be a must-have in the community! In the new version, users can rest assured that their pipelines will not break if they use pandas 2.0, which is a major advantage! But beyond that?

To be honest, ydata-profiling has always been one of my favorite tools for exploratory data analysis, and it’s also a great quick benchmark – it’s only 1 line of code on my side, but underneath that, it’s full of data as Scientist I need to solve calculations – descriptive statistics, histogram plotting, analyzing correlations, etc.

So, what better way to do this than to test the impact of the pyarrow engine on all engines simultaneously with minimal effort?

import pandas as pd
from ydata_profiling import ProfileReport
          

# Using pandas 1.5.3 and ydata-profiling 4.2.0
%timeit df = pd.read_csv("data/hn.csv")
# 10.1 s ± 215 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
          

%timeit profile = ProfileReport(df, title="Pandas Profiling Report")
# 4.85 ms ± 77.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
          

%timeit profile.to_file("report.html")
# 18.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
          

# Using pandas 2.0.2 and ydata-profiling 4.3.1
%timeit df_arrow = pd.read_csv("data/hn.csv", engine='pyarrow')
# 3.27 s ± 38.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
          

%timeit profile_arrow = ProfileReport(df_arrow, title="Pandas Profiling Report")
# 5.24 ms ± 448 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
          

%timeit profile_arrow.to_file("report.html")
# 19 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use ydata-profiling for benchmarking.

Again, it’s definitely better to use the pyarrow engine to read the data, although creating data profiles doesn’t change significantly in terms of speed.

However, the difference may depend on memory efficiency, for which we have to perform a different analysis. Additionally, we can further investigate the type of analysis performed on the data: for some operations, the differences between versions 1.5.2 and 2.0 appear to be negligible.

But the main thing I noticed that might have an impact in this regard is that ydata-profiling doesn’t yet take advantage of the pyarrow data type. This update will likely have a significant impact on speed and memory, and is what I expect going forward!

Conclusion: Performance, flexibility, interoperability!

This new pandas 2.0 version brings a lot of flexibility and performance optimizations, as well as subtle but key changes “under the hood”.

Maybe they are not “flashy” to someone new to the world of data operations, but to experienced data scientists who have jumped through hoops to overcome the limitations of previous versions, they are like water in the desert.

To summarize, these are the main benefits introduced in the new version:

  • Performance optimizations: With the introduction of the Apache Arrow backend, more numpy dtype indexes and copy-on-write mode;

  • Increased flexibility and customization: Allows users to control optional dependencies and take advantage of Apache Arrow data types (including nullability from the start!);

  • Interoperability: Perhaps a less “heralded” benefit of the new version, but one that has a huge impact. Because Arrow is language independent, in-memory data can be transferred not only between programs built on Python, but also between R, Spark, and other programs using the Apache Arrow backend!

I hope this summary puts to rest some of your questions about pandas 2.0 and its suitability for our data manipulation tasks.

The knowledge points of the article match the official knowledge archive, and you can further learn relevant knowledge. Python entry-level skill treeStructured data analysis tool PandasPandas overview 388,288 people are learning the system