[Data Mining | Data Preprocessing] Missing value processing & duplicate value processing & text processing Are you sure you want to take a look?

?♂? Personal homepage: @AI_magician
Homepage address: About the author: CSDN content partner, high-quality creator in the full-stack field.
?Vision: Aiming to grow together with more partners who love computers! ! ?
?♂?Statement: I am currently a sophomore in college, and my research interests include artificial intelligence & hardware (although I haven’t started playing with hardware yet, I have always been interested! I hope you can help me)


[Deep Learning | Core Concepts] Are you sure you want to take a look at those core concepts that you must pass through on the road to deep learning? (one)


Author: Computer Magician

Version: 1.0 (2023.8.27)

Abstract: This series aims to popularize the core concepts that must be passed on the road to deep learning. The content of the articles is collected and written by bloggers. Welcome to support Sanlian! This series will be updated all the time, and the core concept series will be updated all the time! Welcome everyone to subscribe

This article contains columns
[?- “In-depth Analysis of Machine Learning: A Comprehensive Guide from Principles to Applications” -?]

Data preprocessing

Handling missing values

The choice of these methods depends on the characteristics of the data set, the pattern of missing values, and the analysis method used. In practical applications, appropriate methods need to be selected according to specific situations, and verified and evaluated to ensure the effectiveness and rationality of handling missing values.

When there are null values in the data (in addition to looking at the number of missing values, it is recommended to look at the proportion of missing values, which is more representative),

# Customize the analysis function to implement descriptive statistical analysis and missing value analysis for data information exploration.
def analysis(data):
    print('The descriptive statistical analysis results are:\\
', data.describe())
    print('The proportion of missing values for each attribute is:\\
', 100*(data.isnull().sum() / len(data)))
Name Introduction Advantages and Disadvantages
Delete Deletes data rows or columns that contain missing values. Advantages: Simple and fast, suitable for cases with few missing values. Disadvantages: Useful information may be lost, especially when the pattern of missing values is related to other variables. If the proportion of missing values is large, it may lead to sample reduction.
Imputation Use statistical methods to estimate missing values and fill in the data. Common interpolation methods include mean, median, mode, regression, etc. Or fill in according to the actual data scenario, such as e-commerce mobile phone data and other data of the same series can be supplemented Advantages: retain the sample size and will not lose data. Disadvantages: It may introduce estimation errors and may change the distribution and relationship of the data. The choice and quality of the interpolation method have a greater impact on the results.
Mark Use special values (such as NaN, -1) or labels (such as “Unknown”, “Other”) to mark missing values . Advantages: Simple and intuitive, does not change the distribution and relationship of data. Disadvantages: Bias may be introduced in some algorithms. Care needs to be taken in how you handle tagged values to avoid introducing errors.
Classification Treat missing values as a special category. Advantages: No information will be lost, suitable for situations where missing values have special meaning. Disadvantages: Data may become more complex, and some algorithms may require additional tuning to handle categorical features.
Multiple imputation Use multiple imputation models to perform imputation iteratively. Advantages: Missing values can be estimated more accurately and uncertainty estimates are provided. Disadvantages: Higher computational complexity and may require longer processing time. Convergence and stability during iterations need to be handled carefully.
Model prediction Use machine learning models to predict missing values. Other features can be used as input to predict missing values. Advantages: Missing values can be estimated more accurately, taking into account the relationship between features. Disadvantages: The computational complexity is high and the model needs to be trained and adjusted. Model prediction errors may be introduced.
Interpolation

For time series data, the following interpolation methods are commonly used and recommended:

  1. Linear interpolation: Linear interpolation is one of the simplest and commonly used interpolation methods. It assumes that the data changes linearly between two known data points and fills in the nulls by calculating a linear function between the two known data points. Linear interpolation is simple, fast, and suitable for most situations.

  2. Lagrangian interpolation: Lagrangian interpolation is a polynomial interpolation method that approximates changes in data by calculating a polynomial function. Lagrangian interpolation can more accurately fit nonlinear changes in data, but can be computationally expensive for large-scale data sets and high-order polynomials.

  3. Spline interpolation: Spline interpolation is a smooth interpolation method that approximates changes in data by fitting a smooth curve. Spline interpolation can handle curves and trend changes in data. Commonly used spline interpolation methods include linear spline interpolation, cubic spline interpolation, etc.

  4. Time series model interpolation: For time series data, time series models can be used to predict and fill null values. Commonly used time series models include ARIMA models, exponential smoothing models, neural network models, etc. These models can predict future values based on time trends, seasonality and other characteristics, and fill in empty values.

When choosing an interpolation method, the most appropriate method should be selected based on the nature and characteristics of the time series data. For stationary time series, linear interpolation or Lagrangian interpolation may be sufficient; for nonlinear or seasonal time series, spline interpolation or time series model interpolation may be more appropriate.

In addition, the interpolation method can be selected based on the continuity and periodicity of the data. For example, for missing periodic data, you can use periodic interpolation methods such as periodic moving average or periodic linear interpolation.

Handling duplicate values

Method name Method introduction Advantages and disadvantages
Remove Duplicates Removes all duplicate observations or rows from the data set. Advantages: simple and fast; disadvantages: may cause data loss, especially when the values of other columns are also different.
Uniqueness Preserves unique values in a data set and removes duplicate observations or rows. Advantages: Retains the unique information in the data set; Disadvantages: May cause data loss, especially if the values of other columns are also different.
Mark duplicate values Mark duplicate values in the data set so that they can be identified in subsequent analysis. Advantages: All information in the data set is retained and duplicate values can be identified; Disadvantages: It may increase the size of the data set and increase the complexity of subsequent processing.
Aggregate data Aggregate repeated values into a single value, such as calculating an average or merging text strings. Advantages: All information in the data set is retained and summary results are provided; Disadvantages: Depending on the specific circumstances, summary errors or information loss may be introduced.
Keep the first/last Only keep the first or last observation value among the duplicate values, and delete other duplicate values. Advantages: Simple and easy; Disadvantages: Bias may be introduced because the retained observations may not represent the characteristics of the entire repeated value group.

These methods can be selected and adapted based on specific data sets and analysis needs. Before processing duplicate values, it is also often necessary to sort the data to ensure consistency between adjacent observations. In addition, it is important to understand the causes of duplicate values in your data set to help determine the most appropriate treatment method.

Note that when using pd.drop_duplicates(), select a certain column in the subset to avoid deleting all of them.

Text processing

Text preprocessing is an important step when it comes to natural language processing (NLP) tasks. It is designed to convert raw text data into a format that machine learning algorithms can understand and process. Below are several common text preprocessing algorithms, including their introduction and advantages and disadvantages.

Name Introduction Advantages and Disadvantages
Tokenization The process of splitting text into words (or tokens). A common approach is to use spaces or punctuation to separate words. For example, jieba library (etc.) Advantages: simple and fast, suitable for most NLP tasks. Disadvantages: Cannot handle ambiguities and special cases (such as abbreviations and compound words).
Stop Word Removal Stop words are words that appear frequently in text but usually do not carry much information (such as “the”, “is”, “and”, etc.). The goal of the algorithm is to remove these stop words from the text. Generally speaking, there are ready-made stop words. In fact, additional unnecessary text needs to be removed based on actual problems Advantages: reduce data dimensions and improve the effect of subsequent steps. Disadvantages: Sometimes important contextual information may be removed.
Normalization (Normalization) Convert words in the text into standard forms to eliminate the impact of morphological changes on analysis. For example, convert tense, number, and person of a word into a unified form. Advantages: Reduce vocabulary diversity and improve the generalization ability of the model. Disadvantages: May cause some information to be lost.
Stemming Convert a word into its stem form by removing its suffixes. For example, convert “running”, “runs”, and “ran” to “run”. Advantages: Simple and fast, suitable for some information retrieval tasks. Disadvantage: May get lexical forms that don’t really exist.
Lemmatization (Lemmatization) Restore a word to its basic form (called a word element) with semantic accuracy. For example, reduce “am”, “are” and “is” to “be”. Advantages: Provides more accurate vocabulary forms, suitable for tasks requiring high precision. Disadvantages: higher computational cost and slower speed.
Cleaning Remove noise, expressions, special characters and HTML tags or emoticons from text (usually & amp; letters;) and other non-text data. Remove data that is useless to the target based on the data set. For example, e-commerce data defaults to “good reviews if you do not fill in the content” Advantages: improve text quality and reduce irrelevant information . Disadvantages: Some useful features may be lost.
Encoding (Encoding) Convert text into a numerical representation so that machine learning algorithms can process it. Common encoding methods include one-hot encoding, bag-of-words models, and word embeddings. Advantages: Convenient for algorithm processing and retaining certain semantic information. Disadvantages: Relationships and contextual information between words may not be captured.

These algorithms are often used in combination depending on the characteristics of the specific task and data set. Choosing an appropriate text preprocessing step depends on the goals of the task and the characteristics of the data.

 At this point, if you still have any questions
You are welcome to send private messages to bloggers with questions, and the bloggers will do their best to answer your questions! 
If it is helpful to you, your like is the greatest support for the blogger! !