Basic usage
This section introduces the basic usage of Pandas data structures. The following code creates the sample data object used in the previous section:
In [1]: index = pd.date_range('1/1/2000', periods=8) In [2]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e ']) In [3]: df = pd.DataFrame(np.random.randn(8, 3), index=index, ...: columns=['A', 'B', 'C']) ...: n=10 index=pd.date_range('20000101',periods=n) print(index) s=pd.Series(np.random.rand(8),index=list(string.ascii_letters[:8])) # This is a function that facilitates users to port code from Matlab and wraps random_sample. This function takes a tuple to specify the size of the output, consistent with other NumPy functions such as numpy.zeros and numpy.ones. # Create an array of the given shape and fill it with random samples uniformly distributed on [0, 1). print(s) df=pd.DataFrame(np.random.rand(n,8),index=index,columns=list(string.ascii_letters[:8])) print(df) DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08', '2000-01-09', '2000-01-10'], dtype='datetime64[ns]', freq='D') a 0.516861 b 0.220140 c 0.667483 d 0.550804 e 0.631757 f 0.494191 g 0.908128 h 0.544240 dtype: float64 a b c d e f \ 2000-01-01 0.951364 0.165370 0.630935 0.898385 0.040457 0.846490 2000-01-02 0.636938 0.081687 0.899861 0.887537 0.427308 0.678841 2000-01-03 0.954679 0.590853 0.779015 0.758092 0.318983 0.797294 2000-01-04 0.000130 0.314111 0.346645 0.382747 0.644762 0.215349 2000-01-05 0.150452 0.720875 0.363832 0.954863 0.298944 0.880833 2000-01-06 0.006456 0.399737 0.081659 0.310611 0.495450 0.466368 2000-01-07 0.386115 0.415283 0.238122 0.993131 0.114366 0.098060 2000-01-08 0.325408 0.339976 0.122992 0.035576 0.910130 0.398590 2000-01-09 0.780130 0.928054 0.853599 0.879124 0.095143 0.117855 2000-01-10 0.764207 0.449446 0.367501 0.709571 0.872381 0.210814 h 2000-01-01 0.893280 0.336661 2000-01-02 0.182542 0.554913 2000-01-03 0.470463 0.322261 2000-01-04 0.513242 0.928213 2000-01-05 0.772144 0.953689 2000-01-06 0.404961 0.624106 2000-01-07 0.414454 0.517201 2000-01-08 0.198384 0.800897 2000-01-09 0.132946 0.581366 2000-01-10 0.578434 0.481795
Head and Tail
head()
open in new window and tail()
open in new window are used to quickly preview Series and DataFrame. By default, 5 pieces of data are displayed, and the number of displayed data can also be specified. .
In [4]: long_series = pd.Series(np.random.randn(1000)) In [5]: long_series.head() Out[5]: 0 -1.157892 1 -1.344312 2 0.844885 3 1.075770 4-0.109050 dtype: float64 In [6]: long_series.tail(3) Out[6]: 997-0.289388 998-1.020544 999 0.589993 dtype: float64
Attributes and underlying data
Pandas can access metadata through multiple properties:
- shape:
- Axis dimensions of the output object, consistent with ndarray
- Axis labels
- Series: Index (only this axis)
- DataFrame: Index (row) and column
Note: Assigning values to properties is safe!
In [7]: df[:2] Out[7]: A B C 2000-01-01 -0.173215 0.119209 -1.044236 2000-01-02 -0.861849 -2.104569 -0.494929 In [8]: df.columns = [x.lower() for x in df.columns] In [9]: df Out[9]: a b c 2000-01-01 -0.173215 0.119209 -1.044236 2000-01-02 -0.861849 -2.104569 -0.494929 2000-01-03 1.071804 0.721555 -0.706771 2000-01-04 -1.039575 0.271860 -0.424972 2000-01-05 0.567020 0.276232 -1.087401 2000-01-06 -0.673690 0.113648 -1.478427 2000-01-07 0.524988 0.404705 0.577046 2000-01-08 -1.715002 -1.039268 -0.370647
Pandas objects (Index
open in new window, Series
open in new window, DataFrame
open in new window) are equivalent to containers of arrays , used to store data and perform calculations. The underlying array of most types is numpy.ndarray
open in new window. However, Pandas and third-party support libraries generally extend the NumPy type system to add custom arrays (see data types open in new window).
The .array
property is used to extract data in Index
open in new window or Series
open in new window.
In [10]: s.array Out[10]: <PandasArray> [0.4691122999071863, -0.2828633443286633, -1.5090585031735124, -1.1356323710171934, 1.2121120250208506] Length: 5, dtype: float64 In [11]: s.index.array Out[11]: <PandasArray> ['a', 'b', 'c', 'd', 'e'] Length: 5, dtype: object
array
open in new window generally refers to ExtensionArray
open in new window. As for what ExtensionArray
open in new window is and why Pandas uses ExtensionArray
open in new window, it is not the content to be explained in this section. See Data Types open in new window for more information.
To extract a NumPy array, use to_numpy()
open in new window or numpy.asarray()
.
In [12]: s.to_numpy() Out[12]: array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121]) In [13]: np.asarray(s) Out[13]: array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121])
When the types of Series
and Index
are ExtensionArray
open in new window, to_numpy()
open in new window will copy data, and cast the value. See data typesopen in new window for details.
to_numpy()
open in new window can control the data type generated by numpy.ndarray
open in new window. Taking datetime with time zone as an example, NumPy does not provide a datetime data type with time zone information, while Pandas provides two forms of expression:
- One is
numpy.ndarray
open in new window withTimestamp
open in new window, which provides correcttz
information. - The other is
datetime64[ns]
, which is also anumpy.ndarray
open in new window where the value is converted to UTC but the time zone information is removed.
Time zone information can be saved using dtype=object
.
In [14]: ser = pd.Series(pd.date_range('2000', periods=2, tz="CET")) In [15]: ser.to_numpy(dtype=object) Out[15]: array([Timestamp('2000-01-01 00:00:00 + 0100', tz='CET', freq='D'), Timestamp('2000-01-02 00:00:00 + 0100', tz='CET', freq='D')], dtype=object)
Or use dtype='datetime64[ns]'
to remove it.
In [16]: ser.to_numpy(dtype="datetime64[ns]") Out[16]: array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'], dtype='datetime64[ns]')
Extracting the original data in DataFrame
is a little complicated. When the data types of all columns in the DataFrame are the same, DataFrame.to_numpy()
open in new window returns the underlying data:
In [17]: df.to_numpy() Out[17]: array([[-0.1732, 0.1192, -1.0442], [-0.8618, -2.1046, -0.4949], [1.0718, 0.7216, -0.7068], [-1.0396, 0.2719, -0.425], [0.567, 0.2762, -1.0874], [-0.6737, 0.1136, -1.4784], [0.525, 0.4047, 0.577], [-1.715, -1.0393, -0.3706]])
When the DataFrame is homogeneous data, Pandas directly modifies the original ndarray
, and the modifications will be directly reflected in the data structure. For heterogeneous data, that is, when the data types of DataFrame columns are different, this operation mode is not the case. Unlike axis labels, the value property cannot be assigned a value.
Notice
When dealing with heterogeneous data, the data type of the output result ndarray
is suitable for the various types of data involved. If the DataFrame contains strings, the data type of the output result is object
. If there are only floating point numbers or integers, the data type of the output result is floating point number.
Previously, Pandas recommended using Series.values
open in new window or DataFrame.values
open in new window to extract data from Series or DataFrame. This operation is still used in old code libraries or online tutorials, but Pandas has improved this feature. Now, it is recommended to use .array
or to_numpy
to extract data , stop using .values
. .values
has the following disadvantages:
- When Series contains extension type open in new window, Series.valuesopen in new window cannot determine whether to return NumPy
array
orExtensionArray
. AndSeries.array
open in new window only returnsExtensionArray
open in new window and does not copy the data.Series.to_numpy()
open in new window returns a NumPy array, at the cost of copying and casting the data values. - When the DataFrame contains multiple data types,
DataFrame.values
open in new window will copy the data and force the data values to the same data type, which is a costly operation.DataFrame.to_numpy()
open in new window returns a NumPy array. This method is clearer and does not treat the data in the DataFrame as one type.
Accelerate operations
With the numexpr
and bottleneck
support libraries, Pandas can accelerate certain types of binary numeric and Boolean operations.
These two support libraries are particularly useful when processing large data sets, and the acceleration effect is also very obvious. numexpr
uses intelligent blocking, caching and multi-core technology. bottleneck
is a set of proprietary cython routines that are particularly fast when processing arrays containing nans
values.
Consider the following example (DataFrame
contains 100 columns X 100,000 rows of data):
Operation | Version 0.11.0 (ms) | Old version (ms) | Increase ratio |
---|---|---|---|
df1 > df2 |
13.32 | 125.35 | 0.1063 |
df1 * df2 |
21.71 | 36.63 | 0.5928 |
df1 + df2 |
22.04 | 36.50 | 0.6039 |
It is strongly recommended to install these two support libraries. For more information, see Recommended Support Librariesopen in new window.
Both support libraries are enabled by default and can be set with the following options:
New in version 0.20.0.
pd.set_option('compute.use_bottleneck', False) pd.set_option('compute.use_numexpr', False)
Binary operations
When performing binary operations between Pandas data structures, you should pay attention to the following two key points:
- Broadcast mechanism between multi-dimensional (DataFrame) and low-dimensional (Series) objects;
- Handling missing values in calculations.
These two problems can be dealt with at the same time, but let’s first introduce how to deal with them separately.
Matching/Broadcasting Mechanism
DataFrame supports add()
open in new window, sub()
open in new window, mul()
open in new window, div()
open in new window and radd()
open in new window, rsub()
open in new window and other methods perform binary operations. The broadcast mechanism focuses on the input Series. These functions can be called through the axis
keyword, matching index or columns.
In [18]: df = pd.DataFrame({ <!-- --> ....: 'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']), ....: 'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd' ]), ....: 'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])}) ....: In [19]: df Out[19]: one two three a 1.394981 1.772517 NaN b 0.343054 1.912123 -0.050390 c 0.695246 1.478369 1.227435 d NaN 0.279344 -0.613172 In [20]: row = df.iloc[1] In [21]: column = df['two'] In [22]: df.sub(row, axis='columns') Out[22]: one two three a 1.051928 -0.139606 NaN b 0.000000 0.000000 0.000000 c 0.352192 -0.433754 1.277825 d NaN -1.632779 -0.562782 In [23]: df.sub(row, axis=1) Out[23]: one two three a 1.051928 -0.139606 NaN b 0.000000 0.000000 0.000000 c 0.352192 -0.433754 1.277825 d NaN -1.632779 -0.562782 In [24]: df.sub(column, axis='index') Out[24]: one two three a -0.377535 0.0 NaN b -1.569069 0.0 -1.962513 c -0.783123 0.0 -0.250933 d NaN 0.0 -0.892516 In [25]: df.sub(column, axis=0) Out[25]: one two three a -0.377535 0.0 NaN b -1.569069 0.0 -1.962513 c -0.783123 0.0 -0.250933 d NaN 0.0 -0.892516
In pandas, iat and iloc are both methods for selecting data by location. There are some subtle differences between the two:
-
IAT selects a single element by integer position, while iloc selects a subset by integer position.
-
iat can only accept integer indexes, while iloc can accept slice objects for slicing.
-
The index of iat starts from 0, and the index of iloc starts from 0 or 1 depending on whether the index contains 0.
example:
import pandas as pd df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'), index=[0, 1]) #iat selects a single element df.iat[0,1] # 2 # iloc selects a subset df.iloc[0:1,0:1] # A # 0 1 #iloc supports slicing df.iloc[:,0:1] # A # 0 1 # 1 3
Summarize:
- iat is used to select a single element by position
- iloc is used to select subsets by position and supports slicing
- iloc indexes usually start at 0, iat indexes always start at 0
The axis parameter in df.sub(column, axis=index’) specifies the axis for calculation.
In Pandas, axis=0 or axis=’index’ means calculation along the direction of the row, that is, operating on each column.
axis=1 or axis=’columns’ means calculation along the direction of the columns, that is, operating on each row.
For example:
import pandas as pd df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B']) # Along the row direction, based on column A, calculate the difference between other columns and column A df.sub(df['A'], axis='index') # result: A B 0 0 1 1 0 1 # Along the column direction, taking row 0 as the basis, calculate the difference between other rows and row 0 df.sub(df.iloc[0], axis='columns') # result: A B 0 0.0 0.0 1 2.0 2.0
So in df.sub(column, axis=index’), it means taking column column as the basis to calculate the difference between other columns in df and this column.
The axis parameter is very commonly used in Pandas and can flexibly specify the direction of calculation.
You can also use Series to align a certain level of a multi-level index DataFrame.
In [26]: dfmi = df.copy() In [27]: dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'), ....: (1, 'c'), (2, 'a')], ....: names=['first', 'second']) ....: In [28]: dfmi.sub(column, axis=0, level='second') Out[28]: one two three first second 1 a -0.377535 0.000000 NaN b -1.569069 0.000000 -1.962513 c -0.783123 0.000000 -0.250933 2 a NaN -1.493173 -2.385688
Series and Index also support the divmod()
open in new window built-in function, which performs downward division and modulo operations at the same time, returning two tuples of the same type as the left side. Examples include:
In [29]: s = pd.Series(np.arange(10)) In [30]: s Out[30]: 0 0 1 1 twenty two 3 3 4 4 5 5 6 6 7 7 8 8 9 9 dtype: int64 In [31]: div, rem = divmod(s, 3) In [32]: div Out[32]: 0 0 1 0 2 0 3 1 4 1 5 1 6 2 7 2 8 2 9 3 dtype: int64 In [33]: rem Out[33]: 0 0 1 1 twenty two 3 0 4 1 5 2 6 0 7 1 8 2 9 0 dtype: int64 In [34]: idx = pd.Index(np.arange(10)) In [35]: idx Out[35]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64') In [36]: div, rem = divmod(idx, 3) In [37]: div Out[37]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64') In [38]: rem Out[38]: Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0]