numpy Head and Tail, attributes and underlying data, accelerated operations, binary operations, descriptive statistics, function application

Basic usage

This section introduces the basic usage of Pandas data structures. The following code creates the sample data object used in the previous section:

In [1]: index = pd.date_range('1/1/2000', periods=8)

In [2]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e '])

In [3]: df = pd.DataFrame(np.random.randn(8, 3), index=index,
   ...: columns=['A', 'B', 'C'])
   ...:

n=10
index=pd.date_range('20000101',periods=n)
print(index)
s=pd.Series(np.random.rand(8),index=list(string.ascii_letters[:8]))
# This is a function that facilitates users to port code from Matlab and wraps random_sample. This function takes a tuple to specify the size of the output, consistent with other NumPy functions such as numpy.zeros and numpy.ones.
# Create an array of the given shape and fill it with random samples uniformly distributed on [0, 1).
print(s)
df=pd.DataFrame(np.random.rand(n,8),index=index,columns=list(string.ascii_letters[:8]))
print(df)


DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08',
               '2000-01-09', '2000-01-10'],
              dtype='datetime64[ns]', freq='D')
a 0.516861
b 0.220140
c 0.667483
d 0.550804
e 0.631757
f 0.494191
g 0.908128
h 0.544240
dtype: float64
                   a b c d e f \
2000-01-01 0.951364 0.165370 0.630935 0.898385 0.040457 0.846490
2000-01-02 0.636938 0.081687 0.899861 0.887537 0.427308 0.678841
2000-01-03 0.954679 0.590853 0.779015 0.758092 0.318983 0.797294
2000-01-04 0.000130 0.314111 0.346645 0.382747 0.644762 0.215349
2000-01-05 0.150452 0.720875 0.363832 0.954863 0.298944 0.880833
2000-01-06 0.006456 0.399737 0.081659 0.310611 0.495450 0.466368
2000-01-07 0.386115 0.415283 0.238122 0.993131 0.114366 0.098060
2000-01-08 0.325408 0.339976 0.122992 0.035576 0.910130 0.398590
2000-01-09 0.780130 0.928054 0.853599 0.879124 0.095143 0.117855
2000-01-10 0.764207 0.449446 0.367501 0.709571 0.872381 0.210814

                   h
2000-01-01 0.893280 0.336661
2000-01-02 0.182542 0.554913
2000-01-03 0.470463 0.322261
2000-01-04 0.513242 0.928213
2000-01-05 0.772144 0.953689
2000-01-06 0.404961 0.624106
2000-01-07 0.414454 0.517201
2000-01-08 0.198384 0.800897
2000-01-09 0.132946 0.581366
2000-01-10 0.578434 0.481795

Head and Tail

head()open in new window and tail()open in new window are used to quickly preview Series and DataFrame. By default, 5 pieces of data are displayed, and the number of displayed data can also be specified. .

In [4]: long_series = pd.Series(np.random.randn(1000))

In [5]: long_series.head()
Out[5]:
0 -1.157892
1 -1.344312
2 0.844885
3 1.075770
4-0.109050
dtype: float64

In [6]: long_series.tail(3)
Out[6]:
997-0.289388
998-1.020544
999 0.589993
dtype: float64

Attributes and underlying data

Pandas can access metadata through multiple properties:

  • shape:
    • Axis dimensions of the output object, consistent with ndarray
  • Axis labels
    • Series: Index (only this axis)
    • DataFrame: Index (row) and column

Note: Assigning values to properties is safe!

In [7]: df[:2]
Out[7]:
                   A B C
2000-01-01 -0.173215 0.119209 -1.044236
2000-01-02 -0.861849 -2.104569 -0.494929

In [8]: df.columns = [x.lower() for x in df.columns]

In [9]: df
Out[9]:
                   a b c
2000-01-01 -0.173215 0.119209 -1.044236
2000-01-02 -0.861849 -2.104569 -0.494929
2000-01-03 1.071804 0.721555 -0.706771
2000-01-04 -1.039575 0.271860 -0.424972
2000-01-05 0.567020 0.276232 -1.087401
2000-01-06 -0.673690 0.113648 -1.478427
2000-01-07 0.524988 0.404705 0.577046
2000-01-08 -1.715002 -1.039268 -0.370647

Pandas objects (Indexopen in new window, Seriesopen in new window, DataFrameopen in new window) are equivalent to containers of arrays , used to store data and perform calculations. The underlying array of most types is numpy.ndarrayopen in new window. However, Pandas and third-party support libraries generally extend the NumPy type system to add custom arrays (see data types open in new window).

The .array property is used to extract data in Indexopen in new window or Seriesopen in new window.

In [10]: s.array
Out[10]:
<PandasArray>
[0.4691122999071863, -0.2828633443286633, -1.5090585031735124,
 -1.1356323710171934, 1.2121120250208506]
Length: 5, dtype: float64

In [11]: s.index.array
Out[11]:
<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

arrayopen in new window generally refers to ExtensionArrayopen in new window. As for what ExtensionArrayopen in new window is and why Pandas uses ExtensionArrayopen in new window, it is not the content to be explained in this section. See Data Types open in new window for more information.

To extract a NumPy array, use to_numpy()open in new window or numpy.asarray().

In [12]: s.to_numpy()
Out[12]: array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121])

In [13]: np.asarray(s)
Out[13]: array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121])

When the types of Series and Index are ExtensionArrayopen in new window, to_numpy()open in new window will copy data, and cast the value. See data typesopen in new window for details.

to_numpy()open in new window can control the data type generated by numpy.ndarrayopen in new window. Taking datetime with time zone as an example, NumPy does not provide a datetime data type with time zone information, while Pandas provides two forms of expression:

  1. One is numpy.ndarrayopen in new window with Timestampopen in new window, which provides correct tz information.
  2. The other is datetime64[ns], which is also a numpy.ndarrayopen in new window where the value is converted to UTC but the time zone information is removed.

Time zone information can be saved using dtype=object.

In [14]: ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))

In [15]: ser.to_numpy(dtype=object)
Out[15]:
array([Timestamp('2000-01-01 00:00:00 + 0100', tz='CET', freq='D'),
       Timestamp('2000-01-02 00:00:00 + 0100', tz='CET', freq='D')],
      dtype=object)

Or use dtype='datetime64[ns]' to remove it.

In [16]: ser.to_numpy(dtype="datetime64[ns]")
Out[16]:
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')

Extracting the original data in DataFrame is a little complicated. When the data types of all columns in the DataFrame are the same, DataFrame.to_numpy()open in new window returns the underlying data:

In [17]: df.to_numpy()
Out[17]:
array([[-0.1732, 0.1192, -1.0442],
       [-0.8618, -2.1046, -0.4949],
       [1.0718, 0.7216, -0.7068],
       [-1.0396, 0.2719, -0.425],
       [0.567, 0.2762, -1.0874],
       [-0.6737, 0.1136, -1.4784],
       [0.525, 0.4047, 0.577],
       [-1.715, -1.0393, -0.3706]])

When the DataFrame is homogeneous data, Pandas directly modifies the original ndarray, and the modifications will be directly reflected in the data structure. For heterogeneous data, that is, when the data types of DataFrame columns are different, this operation mode is not the case. Unlike axis labels, the value property cannot be assigned a value.

Notice

When dealing with heterogeneous data, the data type of the output result ndarray is suitable for the various types of data involved. If the DataFrame contains strings, the data type of the output result is object. If there are only floating point numbers or integers, the data type of the output result is floating point number.

Previously, Pandas recommended using Series.valuesopen in new window or DataFrame.valuesopen in new window to extract data from Series or DataFrame. This operation is still used in old code libraries or online tutorials, but Pandas has improved this feature. Now, it is recommended to use .array or to_numpy to extract data , stop using .values. .values has the following disadvantages:

  1. When Series contains extension type open in new window, Series.valuesopen in new window cannot determine whether to return NumPy array or ExtensionArray. And Series.arrayopen in new window only returns ExtensionArrayopen in new window and does not copy the data. Series.to_numpy()open in new window returns a NumPy array, at the cost of copying and casting the data values.
  2. When the DataFrame contains multiple data types, DataFrame.valuesopen in new window will copy the data and force the data values to the same data type, which is a costly operation. DataFrame.to_numpy()open in new window returns a NumPy array. This method is clearer and does not treat the data in the DataFrame as one type.

Accelerate operations

With the numexpr and bottleneck support libraries, Pandas can accelerate certain types of binary numeric and Boolean operations.

These two support libraries are particularly useful when processing large data sets, and the acceleration effect is also very obvious. numexpr uses intelligent blocking, caching and multi-core technology. bottleneck is a set of proprietary cython routines that are particularly fast when processing arrays containing nans values.

Consider the following example (DataFrame contains 100 columns X 100,000 rows of data):

Operation Version 0.11.0 (ms) Old version (ms) Increase ratio
df1 > df2 13.32 125.35 0.1063
df1 * df2 21.71 36.63 0.5928
df1 + df2 22.04 36.50 0.6039

It is strongly recommended to install these two support libraries. For more information, see Recommended Support Librariesopen in new window.

Both support libraries are enabled by default and can be set with the following options:

New in version 0.20.0.

pd.set_option('compute.use_bottleneck', False)
pd.set_option('compute.use_numexpr', False)

Binary operations

When performing binary operations between Pandas data structures, you should pay attention to the following two key points:

  • Broadcast mechanism between multi-dimensional (DataFrame) and low-dimensional (Series) objects;
  • Handling missing values in calculations.

These two problems can be dealt with at the same time, but let’s first introduce how to deal with them separately.

Matching/Broadcasting Mechanism

DataFrame supports add()open in new window, sub()open in new window, mul()open in new window, div()open in new window and radd()open in new window, rsub()open in new window and other methods perform binary operations. The broadcast mechanism focuses on the input Series. These functions can be called through the axis keyword, matching index or columns.

In [18]: df = pd.DataFrame({
   <!-- -->
   ....: 'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
   ....: 'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd' ]),
   ....: 'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
   ....:

In [19]: df
Out[19]:
        one two three
a 1.394981 1.772517 NaN
b 0.343054 1.912123 -0.050390
c 0.695246 1.478369 1.227435
d NaN 0.279344 -0.613172

In [20]: row = df.iloc[1]

In [21]: column = df['two']

In [22]: df.sub(row, axis='columns')
Out[22]:
        one two three
a 1.051928 -0.139606 NaN
b 0.000000 0.000000 0.000000
c 0.352192 -0.433754 1.277825
d NaN -1.632779 -0.562782

In [23]: df.sub(row, axis=1)
Out[23]:
        one two three
a 1.051928 -0.139606 NaN
b 0.000000 0.000000 0.000000
c 0.352192 -0.433754 1.277825
d NaN -1.632779 -0.562782

In [24]: df.sub(column, axis='index')
Out[24]:
        one two three
a -0.377535 0.0 NaN
b -1.569069 0.0 -1.962513
c -0.783123 0.0 -0.250933
d NaN 0.0 -0.892516

In [25]: df.sub(column, axis=0)
Out[25]:
        one two three
a -0.377535 0.0 NaN
b -1.569069 0.0 -1.962513
c -0.783123 0.0 -0.250933
d NaN 0.0 -0.892516

In pandas, iat and iloc are both methods for selecting data by location. There are some subtle differences between the two:

  1. IAT selects a single element by integer position, while iloc selects a subset by integer position.

  2. iat can only accept integer indexes, while iloc can accept slice objects for slicing.

  3. The index of iat starts from 0, and the index of iloc starts from 0 or 1 depending on whether the index contains 0.

example:

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'), index=[0, 1])

#iat selects a single element
df.iat[0,1] # 2

# iloc selects a subset
df.iloc[0:1,0:1]
# A
# 0 1

#iloc supports slicing
df.iloc[:,0:1]
# A
# 0 1
# 1 3

Summarize:

  • iat is used to select a single element by position
  • iloc is used to select subsets by position and supports slicing
  • iloc indexes usually start at 0, iat indexes always start at 0

The axis parameter in df.sub(column, axis=index’) specifies the axis for calculation.

In Pandas, axis=0 or axis=’index’ means calculation along the direction of the row, that is, operating on each column.

axis=1 or axis=’columns’ means calculation along the direction of the columns, that is, operating on each row.

For example:

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])

# Along the row direction, based on column A, calculate the difference between other columns and column A
df.sub(df['A'], axis='index')

# result:
   A B
0 0 1
1 0 1

# Along the column direction, taking row 0 as the basis, calculate the difference between other rows and row 0
df.sub(df.iloc[0], axis='columns')

# result:
     A B
0 0.0 0.0
1 2.0 2.0

So in df.sub(column, axis=index’), it means taking column column as the basis to calculate the difference between other columns in df and this column.

The axis parameter is very commonly used in Pandas and can flexibly specify the direction of calculation.

You can also use Series to align a certain level of a multi-level index DataFrame.

In [26]: dfmi = df.copy()

In [27]: dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'),
   ....: (1, 'c'), (2, 'a')],
   ....: names=['first', 'second'])
   ....:

In [28]: dfmi.sub(column, axis=0, level='second')
Out[28]:
                   one two three
first second
1 a -0.377535 0.000000 NaN
      b -1.569069 0.000000 -1.962513
      c -0.783123 0.000000 -0.250933
2 a NaN -1.493173 -2.385688

Series and Index also support the divmod()open in new window built-in function, which performs downward division and modulo operations at the same time, returning two tuples of the same type as the left side. Examples include:

In [29]: s = pd.Series(np.arange(10))

In [30]: s
Out[30]:
0 0
1 1
twenty two
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64

In [31]: div, rem = divmod(s, 3)

In [32]: div
Out[32]:
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
9 3
dtype: int64

In [33]: rem
Out[33]:
0 0
1 1
twenty two
3 0
4 1
5 2
6 0
7 1
8 2
9 0
dtype: int64

In [34]: idx = pd.Index(np.arange(10))

In [35]: idx
Out[35]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [36]: div, rem = divmod(idx, 3)

In [37]: div
Out[37]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')

In [38]: rem
Out[38]: Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0]