[Python] Pandas data analysis

?

Article directory

    • ?
  • Rename column name rename
  • Deduplication function drop_duplicates()
  • Sorting sort_values()
  • merge merge()
  • Rank rank()
  • Group by()
  • transpose transform()
  • String matching contains()
  • capitalize()
  • Conditional judgment where()
  • Conditional filtering query()
  • Insert data insert()
  • Accumulate cumsum()
  • Random sampling sample()

Rename column name rename

When working with data using pandas, sometimes you need to rename the DataFrame’s column names. This can be achieved through the rename function. The following is about how to use the rename function.

The basic syntax of the rename function is as follows:

DataFrame.rename(columns=None, inplace=False)

Parameter Description:

columns: A dictionary used to specify new column names (the keys of the dictionary are the original column names, and the values are the new column names), or a callable object (such as a function, lambda expression).

inplace: A Boolean value indicating whether to modify the DataFrame in place. The default is False, which means a renamed copy is created and returned. If set to True, the modification is made on the original DataFrame.

Code example:

df.rename(columns={<!-- -->'A': 'Column1'}, inplace=True)

Removal function drop_duplicates()

The Panda DataFrame object provides a data deduplication function drop_duplicates(). This section introduces the usage of this function in detail.
function format
The syntax format of the drop_duplicates() function is as follows:

df.drop_duplicates(subset=['A','B','C'],keep='first',inplace=True)

Parameter description is as follows:

subset: Indicates the column name to be added, the default is None.

keep: There are three optional parameters, namely first, last, and False. The default is first, which means to keep only the first occurrence of duplicates and delete the remaining duplicates. last means to keep only the last occurrence of duplicates, and False means Remove all duplicates.

inplace: Boolean parameter, the default is False, which means returning a copy after deleting duplicates, and if it is True, it means deleting duplicates directly on the original data.

Code example:

df.drop_duplicates(subset=['email'],keep='first',inplace=True)

Sort sort_values()

The sort_values() function of the pandas library can sort the Dataframe data set according to the data in a certain field. This function can specify column data or row data to sort, either single or multiple (it was often used to process single column/row data in the past, but I forgot that sort_values() can also process multiple columns/row data). series also has a sort_values() function, but it has slightly different parameters.

function format
The syntax format of the sort_values() function is as follows:

df.sort_values(by='Column name or index value for sorting', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key= None)

Parameter description is as follows:

by specifies the column name or index value to be sorted
axis If axis=0 or index’, sort according to the data size of the specified column; if axis=1 or columns’, sort according to the data size in the specified index. Default axis=0
ascending If ascending=True, it will be sorted in ascending order; if ascending=False, it will be sorted in descending order. The default is True, which is ascending order. If this is a bool list, it must match the length of by
inplace Whether the sorted data replaces the original data, the default is False, that is, no replacement
ignore_index Whether to reset the index, the default is not to reset

?Code example:

# Single column/row sorting
# Sort by hello column in descending order
data = df.sort_values(by="hello", ascending=False, ,axis=0)
#axis=0 means by column, similarly axis=1 means by row
#Multiple column/row sorting
# Sort according to the first column in descending order. When the first columns are the same, sort according to the third column in ascending order. And reset the index and replace the original data
data = df.sort_values(by = ['col1','col3'], ascending=[False,True],ignore_index=True,inplace=True)

Merge merge()

The merge function is the preferred function in Pandas to perform basic data set merging. The function combines two datasets based on a given dataset index or column.

function format
The syntax format of merge() function is as follows:

df.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
             left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'),
             copy=True, indicator=False, validate=None)

Parameter description is as follows:

left: The first DataFrame object to be merged.
right: The second DataFrame object to be merged.
how: merge method. Can be 'inner', 'outer', 'left' or 'right'.
on: Column name used for merging. If this parameter is specified, both left and right DataFrame objects must contain this column.
left_on: Column name in the left DataFrame object used for merging.
right_on: Column name in the right DataFrame object used for merging.
left_index: If True, use the index in the left DataFrame object as the join key.
right_index: If True, use the index in the right DataFrame object as the join key.
sort: Whether to sort the results. If True, the results are sorted by the join key.
suffixes: Suffixes for overlapping column names. If there are columns with the same name in both DataFrame objects, the specified suffix is added to the column names to distinguish them.
copy: Whether to copy data. If False, avoid copying data to improve performance.
indicator: Whether to add a column named '_merge' to indicate the source of each row. If True, the column is added.
validate: Verify that the connection key is unique. Can be 'one_to_one', 'one_to_many' or 'many_to_one'.

?Code example:

df = employee.merge(department, left_on='departmentId', right_on='id', how='left')

Ranking rank()

The function of the rank method is to calculate the ranking of each data in the axis direction (indicate the ranking of these data after sorting)

This method is used for ranking (the ranking value starts from 1), it can destroy the equal relationship according to certain rules,
By default, if (method=’average’), rank destroys the peer relationship by “assigning an average ranking to each group”.

function format
The syntax format of the rank() function is as follows:

Series.rank(axis=0,method='average',numeric_only=None,na_option='keep',ascending=True,pct=False)

Parameter description is as follows:

axis:{0 or 'index',1 or 'columns'} default 0 means ranking along the index direction by default

method:{'average','min','max','first','dense'} specifies the method option used to destroy the sibling relationship when ranking (note: bits with the same value are in the same group)
method description
'average' Default: Assigns an average rank to each value in equal groupings
'min' uses the minimum rank of the entire group
'max' uses the maximum rank of the entire group
'first' assigns a rank to the value in the order in which it appears in the original data
'dense' is similar to 'min', but the ranking will only increase by 1 each time, that is, the tied data only occupies one ranking.

ascending Whether it is ascending order, the default is True

na_option is used to handle NaN values
na_option Description
'keep' keeps NA values in their original positions
'top' ascending order smallest
'bottom' descending order smallest

Is the pct ranking a percentage?

?Code example:

 # method='dense'The ranking will only increase by 1 each time, that is, the parallel data only occupies one ranking.
 scores['rank'] = scores['score'].rank(method='dense', ascending=False)

Group groupby()

During data analysis, it is often necessary to divide data into different groups. The groupby() function in pandas can perfectly complete various grouping operations.

Grouping is based on a certain field value of the DataFrame/Series, and the rows/columns with equal values of the field are divided into the same group. Each group is a new DataFrame or Series.

groupby() can also be grouped by multiple fields in the DataFrame. When the values of multiple fields are equal, they are divided into the same group.

groupby() is often used in conjunction with the batch processing function apply(), the aggregation function agg(), etc., to achieve multiple processing of data.

function format
The syntax format of the groupby() function is as follows:

df.groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=no_default, observed=False, dropna=True):

Parameter description is as follows:

by: Specify which field/fields to group by. The default value is None. When grouping by multiple fields, a list is passed in. The by parameter can be passed in as a positional parameter. axis: Set whether to group by columns or rows, 0 or index means grouping by columns, 1 or columns means grouping by rows, the default value is 0. level: When the index of the DataFrame is a multi-index, the level parameter specifies the index used for grouping. You can pass in the subscript (0,1...) or index name of the index in the multi-index. Multiple entries are passed in as a list.

The level parameter cannot be used at the same time as the by parameter. If both exist at the same time, when the by parameter is passed in the index in the multi-index, the level will not take effect. When the by parameter is passed in the column name of the DataFrame, an error will be reported.

as_index: The grouping result defaults to the value of the grouping column as the index. If grouped by a single column, the result defaults to a single index. If grouped by multiple columns, the result defaults to a multiple index. Setting as_index to False resets the index (0,1...). sort: The results are sorted in ascending order by the value of the grouping column. Setting sort to False will not sort, which can improve performance. dropna: By default, NaN in the grouping column is not retained in the grouping result. Set dropna to False to retain NaN grouping.

You don’t need to pay attention to the other three parameters. The group_keys parameter is not used in the source code. The squeeze parameter has been officially deprecated due to incompatible types. The observed parameter means that when resetting the index, all NaN rows created will be retained.

?Code example:

 df.groupby('Department')

Transpose transform()

transform() is the transformation function in pandas. It returns a DataFrame of the same shape after executing the passed function on the DataFrame. Used to transform data in DataFrame, this article will introduce the transform() function in detail.

function format
The syntax format of the transform() function is as follows:

df.transform(func, axis=0, *args, **kwargs):

Parameter description is as follows:

func: Function used to convert data. The function must be passed to a DataFrame to be used normally, or passed to DataFrame.apply() to be used normally.
func can receive the name of a function, a string of function names, a list of functions, a dictionary of row/column labels and function names.

axis: Set whether to convert by column or row. Set to 0 or index to apply the transformation function to each column, and set to 1 or columns to apply the transformation function to each row.

args: Positional arguments passed to function func.

kwargs: keyword arguments passed to function func

?Code example:

max_salary = df.groupby('Department')['Salary'].transform('max')

String matching contains()

The pandas.str.contains() function is a string matching function in the pandas library, which is used to find and filter qualified strings through regular expression matching in Series and DataFrame objects. The detailed usage and examples of this function are as follows

patients[conditions’].str.contains(r’\bDIAB1’) is a conditional expression used to filter out records that meet the conditions. Where patients[conditions’] means to retrieve the conditions column from the patients table, and the str.contains(pattern) method is used to check whether the conditions string in each record contains the specified pattern pattern.

Here, r’\bDIAB1’ is a regular expression pattern containing the following two parts:

\b represents a word boundary, used to ensure that there are no other characters in front of DIAB1 to avoid matching strings like "XDIAB1".
DIAB1 is the specific string you want to match, the code prefix.

function format
The syntax format of the sort_values() function is as follows:

df.str.contains(pat, case=True, flags=0, na=None, regex=True)

Parameter description is as follows:

pat: String or regular expression, pattern used for search.

case: bool type, the default is True, indicating whether it is case sensitive.

flags: Flags used for re.compile() to compile regular expressions.

na: Representation used to replace missing values.

regex: bool type, the default is True, indicating whether the string is matched by a regular expression

?Code example:

 df.loc[(patients['conditions'].str.contains(r'\bDIAB1'))]

Regular expression matching match()

Series.str can be used to access the values of a series as strings and apply several methods to them. The Pandas Series.str.match() function is used to determine whether each string in the underlying data of a given Series object matches a regular expression.

function format
The syntax format of the match() function is as follows:

 df.str.match(pat, case=True, flags=0, na=nan)

Parameter description is as follows:

pat : Regular expression pattern with capturing groups.

case : If True, case-sensitive
flags: A re module flag, such as re.IGNORECASE.
na: Default NaN, value to fill missing values

Return value: sequence/array of Boolean values
Common instructions for pat:
^: Matches the beginning of the string. For example, ^hello will match strings starting with "hello".
\t
$: Matches the end position of the string. For example, world$ will match a string ending in "world".
\t
.: Matches any character except newline characters. For example, a.b would match "a + b", "a@b", etc.
\t
*: Matches the previous pattern zero or more times. For example, a*b will match "b", "ab", "aab", etc.
\t
+ : Matches the previous pattern one or more times. For example, a + b would match "ab", "aab", "aaab", etc.
\t
?: Match the previous pattern zero or one time. For example, a?b will match "b" and "ab".
\t
#[]: Define character set. For example, [abc] will match any character among "a", "b", and "c".
\t
[^]: Negative character set. For example, [^abc] will match any character except "a", "b", and "c".
\t
\d: Match numbers. Equivalent to [0-9].
\t
\w: Matches letters, numbers, or underscores. Equivalent to [A-Za-z0-9_].
\t
\s: Matches whitespace characters, including spaces, tabs, newlines, etc.
\t
\b: Match word boundaries. For example, \btest\b would match the single word "test".

?Code example:

users=users[users['mail'].str.match('^[a-zA-Z][\w\d.-]*@leetcode\.com')]

Capitalize()

Series.str.capitalize() is used to capitalize string elements in pandas series. Series is a data structure type used in Python like a list. Series can contain different types of data, just like we want to enter a list in a series.

function format
The syntax format of the capitalize() function is as follows:

 Series.str.capitalize()

Parameter description is as follows:

Parameter: None
Return: Series

?Code example:

# Convert to uppercase
 users['name']=users['name'].str.capitalize()

Conditional judgment where()

The where function is used to replace data under specified conditions. If no condition is specified, the default replacement value is NaN.

function format
The syntax format of the where() function is as follows:

df.where(self, cond, other=nan, inplace=False, axis=None, level=None, errors='' raise', try_cast=False)
    Replace the value with a condition of False.

The where function is used to replace data under specified conditions. If no condition is specified, the default replacement value is NaN.
Parameter description is as follows:

cond: boolean Series/DataFrame, array or callable Series/DataFrame
    When `cond` is True, the original value is retained. in
    False, substitute the corresponding value from `other`.
    If `cond` is callable, it will be calculated on the series/data frame and
    Should return a boolean series/dataframe or array. The retrievable object must be
    Doesn't change the input series/dataframe (although pandas doesn't check that).

    ...version added:: 0.18.1
        A callable can be used as a conc.

Others: scalar, Series/DataFrame or callable
    Entries with `cond` set to False are replaced by
    The corresponding value in `other`.
    If other is callable, computed on the Series/DataFrame, and
    Should return a scalar or Series/DataFrame. The callable object cannot be
    Change input series/dataframe (although pandas doesn't check).

    ...version added:: 0.18.1
        A callable function can be used as another function.

inplace: bool, default is False
    Whether to perform in-place operations on data.
axis : int, default None
    Optionally align the axes if desired.
level : int, default None
    Align levels if necessary.
errors : str, {''improve', 'ignore'}, default is 'improve'
    Note that currently this parameter will not affect
    The result is always enforced to a suitable dtype.

    - 'raise': Allow exceptions to be raised.
    - 'ignore': suppress the exception. On error, the original object is returned.

try_cast: bool, default is False
    Try to cast the result back to the input type if possible.

?Code example:

# where(df['new_col']>0,0) specifies that all data with a value greater than 0 in the 'new_col' column is the replaced object and is replaced with 0.
df['new_col'].where(df['new_col'] > 0, 0)

Conditional filtering query()

The query function in the Python pandas library is a very useful tool that allows you to filter data using Boolean expressions. The main advantage of this function is that it can complete the filtering operation in one line of code without using loops or other conditional statements. SQL is also a language we are often exposed to and familiar with, so we use something similar to SQL to query our data. We can use the query() method to do this.

The query() function can be used to filter data in a data frame, similar to the WHERE clause in SQL.

function format
The syntax format of the query() function is as follows:

DataFrame.query(expr, inplace=False, **kwargs)

Parameter description is as follows:

expr: a string representing a Boolean expression used to filter data.

inplace: a Boolean value, defaults to False. If True, operates on the original DataFrame and does not return a new DataFrame. If False, a new DataFrame is returned, leaving the original DataFrame unchanged.

**kwargs: Other optional parameters.

?Code example:

 df.query("population>=25000000 or area>=3000000" ) [["name","population","area"]]

Insert data insert()

When adding a column of data to the dataframe, it is added at the end by default. When we need to add it at any location, we can use the insert function.

To use this function, you only need to specify the insertion position, column name, and inserted object data.

function format
The syntax format of the insert() function is as follows:

Dataframe.insert(loc, column, value, allow_duplicates=False)
#Insert data into the specified column of the Dataframe.

Parameter description is as follows:

loc: int type, indicating which column; if data is inserted in the first column, then loc=0

column: Name the inserted column, such as column='new column'

value: numbers, arrays, series, etc. (you can try it yourself)

allow_duplicates: Whether to allow duplicate column names. Select True to allow new column names to duplicate existing column names.

?Code example:

#Insert a column in the first column and name it 'haha'
df.insert(loc=0,column='haha',value=6)
#Insert a column in the first column and name it 'haha' (duplicate selection is allowed)
df.insert(loc=0,column='haha',value=6,allow_duplicates=True)

Accumulate cumsum()

Pandas provides an easy-to-use function for calculating sums, cumsum.

function format
The syntax format of cumsum() function is as follows:

DataFrame.cumsum(axis=None, skipna=True, *args, **kwargs)

Parameter description is as follows:

axis: index or axis name

skipna: exclude NA/null values

?Code example:

#Example dataframe contains annual data for 3 groups. We may only be interested in annual data, but in some cases we also need cumulative data. Pandas provides an easy-to-use function for calculating sums, cumsum.
#If we simply use the cumsum function, the (A, B, C) group will be ignored. The cumulative value obtained in this way is not meaningful in some cases, because we need the cumulative data of different groups. There is a very simple and convenient solution to this problem, we can apply both groupby and cumsum functions.
df['cumsum_2'] = df[['value_2','group'].groupby('group').cumsum()]

Random sampling sample()

Random sampling is a commonly used method in statistics, which can help us quickly build a set of data analysis models from a large amount of data. In Pandas, if you want to randomly sample a data set, you need to use the sample() function.

function format
The syntax format of the sample() function is as follows:

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)

Parameter description is as follows:

n represents the number of rows to be extracted.
frac represents the proportion of extraction, such as frac=0.5, which represents 50% of the overall data extraction.
replace is a Boolean parameter, indicating whether to select with replacement sampling. The default is False, and the data will not be replaced after being taken out.
Weights is an optional parameter, representing the weight value of each sample. The parameter value is a string or array.
random_state is an optional parameter that controls the random state. The default is None, which means the random data will not be repeated; if it is 1, it means repeated data will be obtained.
axis represents the direction in which data is extracted (axis=1 represents columns/axis=0 represents rows).

?Code example:

#Default randomly selects two lines
info.sample(n=2)
#Randomly select two columns
info.sample(n=2,axis=1)