[R Statistics] Various interpolation methods solve the problem of missing data!

Personal information: Cool is moving forward

Copyright: The blog post is originally written by [Cool on the Move]. If you need to reprint, please contact the blogger.

If the blog post is helpful to you, please like, follow, collect + subscribe to the column

This article is included in [R Statistics]. This column mainly introduces the process of statistical analysis using R language, such as descriptive statistics of data, t test, variance analysis, correlation, linear regression, etc. Please pay more attention, like and support, and make progress together~ Everyone is welcome to subscribe!

Article directory

Build data
simple interpolation
- Mean/median imputation
- random interpolation
Model-based interpolation methods
- linear regression interpolation
- k-nearest neighbor interpolation (k-NN)
- random forest interpolation
Multiple Imputation
Data interpolation effect display

In daily scientific research work, missing data is a very common problem. Especially in large data sets, data loss is almost inevitable due to various force majeure factors. But this brings up a question: How should we deal with missing data? Directly deleting data rows containing missing values seems to be a simple and direct method, but this will lead to the loss of valid data. Today, I want to share with you several ways to deal with missing data. Please note that each of these methods has pros and cons, and the most appropriate method should be chosen based on specific data characteristics and research purposes.

Build data

First, we need to read in an ecological data set with 30 rows and 14 columns. This dataset is used to demonstrate how to deal with missing values in the data. Through the random sampling method, we artificially generated some missing values in the copy_SOC column of the dataset.

# Data reading

test_data<- read.csv('H:/data/test_data.csv')

test_data$copy_SOC <- test_data$SOC

# Calculate the number of data that need to be replaced with NA
num_na <- round(nrow(test_data) * 0.20)

# Randomly select 20% index
random_indices <- sample(1:nrow(test_data), size=num_na)

# Replace the data corresponding to the selected index with NA
test_data[random_indices,15] <- NA

colSums(is.na(test_data))
          sites NPP ANPP Root.biomass SOC
              0 0 0 0 0
             TN pH Clay Silt Sand
              0 0 0 0 0
   Bulk.density total.PLFA Fungal.PLFAs Bacterial.PLFAs copy_SOC
              0 0 0 0 6

See that there are 6 missing values in the copy_SOC column

Simple interpolation

Mean/median interpolation

This is a very basic and commonly used method. Suitable for situations where missing data is random. The method is to directly replace missing values with the mean or median of the variable.

# Fill missing values using the column's mean, median, or mode. This is the simplest way.
test_data$mean_copy_SOC <- test_data$copy_SOC
test_data$mean_copy_SOC[is.na(test_data$mean_copy_SOC)] <- mean(test_data$copy_SOC, na.rm = TRUE)

test_data$median_copy_SOC <- test_data$copy_SOC
test_data$median_copy_SOC[is.na(test_data$median_copy_SOC)] <- median(test_data$copy_SOC, na.rm = TRUE)

Random interpolation

Simply randomly select a value from existing observations to replace missing values. This method is suitable for cases where the missing data is completely random.

# Randomly select values from existing observations to fill in missing values.
library(Hmisc)
test_data$Hmisc_copy_SOC <- test_data$copy_SOC
test_data$Hmisc_copy_SOC <- impute(test_data$Hmisc_copy_SOC, 'random')

# When using the impute function, make sure your data is numeric, as this function is mainly designed for numeric data.
# impute can also use mean and median for interpolation
# impute(test_data$Hmisc_copy_SOC, 'mean')
# impute(test_data$Hmisc_copy_SOC, 'median')

Model-based interpolation method

Linear regression interpolation

Use other variables to perform linear regression prediction on variables with missing values, and then use predicted values to replace missing values.

# Perform linear regression using other known variables as predictors, and then use the regression model to predict missing values.
test_data$lm_copy_SOC <- test_data$copy_SOC

train_data <- test_data[!is.na(test_data$lm_copy_SOC),]

# Use train_data to build a linear model
lm_fit <- lm(lm_copy_SOC ~ NPP + ANPP + Root.biomass + TN + pH + Clay + Silt + Sand + Bulk.density + total.PLFA +
             Fungal.PLFAs + Bacterial.PLFAs,train_data )
# Perform stepwise regression on the linear model and filter variables
lm_fit2 <- step(lm_fit)

#Model summary
summary(lm_fit2)

Call:
lm(formula = lm_copy_SOC ~ NPP + ANPP + Root.biomass + TN + Clay +
    Sand + Bulk.density + total.PLFA + Fungal.PLFAs + Bacterial.PLFAs,
    data = train_data)

Residuals:
    Min 1Q Median 3Q Max
-2.9593 -2.0936 0.2103 1.0633 4.2886

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept) -50.34984 29.42610 -1.711 0.1308
NPP -0.03340 0.01210 -2.760 0.0281 *
ANPP -0.34054 0.27252 -1.250 0.2516
Root.biomass 0.05054 0.04098 1.233 0.2573
TN 15.00918 1.48659 10.096 2.01e-05 ***
Clay 1.17952 1.16784 1.010 0.3461
Sand 0.65299 0.38771 1.684 0.1360
Bulk.density -9.35362 8.41716 -1.111 0.3032
total.PLFA -1.39401 0.90615 -1.538 0.1678
Fungal.PLFAs 2.88526 1.91431 1.507 0.1755
Bacterial.PLFAs 2.53241 1.72284 1.470 0.1850
---
Signif. codes: 0 ***’ 0.001 **’ 0.01 *’ 0.05 .’ 0.1  ’ 1

Residual standard error: 3.586 on 7 degrees of freedom
Multiple R-squared: 0.9934, Adjusted R-squared: 0.984
F-statistic: 105.4 on 10 and 7 DF, p-value: 1.149e-06

# Delete the data set used for prediction
predict_data <- test_data[is.na(test_data$lm_copy_SOC),names(coefficients(lm_fit2))[-1]]

# Use this model to predict missing values
predicted_values <- predict(lm_fit2, newdata = predict_data)

# Insert the predicted value into the missing position in the data
test_data$lm_copy_SOC[is.na(test_data$lm_copy_SOC)] <- predicted_values

This method first builds a linear model using other known variables and then uses this model to predict missing values.

k-Nearest Neighbor Interpolation (k-NN)

This method imputes data by finding the k observations in the entire data set that are closest to the missing value.

# Use the knnImputation function of the DMwR package to fill missing values based on the k-NN method.
remotes::install_github("cran/DMwR")

library(DMwR)
test_data$DMwR_copy_SOC <- test_data$copy_SOC
knnImputation_data <- knnImputation(test_data)

test_data$DMwR_copy_SOC <- knnImputation_data$DMwR_copy_SOC

Random forest interpolation

Random forest is an ensemble learning method that can be used to deal with missing data problems.

# Use the missForest package, which imputes missing values based on the random forest algorithm.

library(missForest)
test_data$missForest_copy_SOC <- test_data$copy_SOC

result <- missForest(as.matrix(test_data))
result$OOBerror

test_data_missForest <- as.data.frame(result$ximp)

test_data$missForest_copy_SOC <- test_data_missForest$missForest_copy_SOC

Multiple Imputation

Multiple imputation is a more complex method, but is currently widely considered to be one of the best methods for dealing with missing data.

#There are many ways to implement multiple imputation using the mice package. This is a more complex but widely accepted method that creates multiple datasets and performs analysis on each dataset.

library(mice)
test_data$mice_copy_SOC <- test_data$copy_SOC


# Perform interpolation
imputed_test_data <- mice(test_data[c(8:14,22)], m = 5, maxit = 50, method = 'pmm', seed = 10)
# m represents the number of generated data sets, the maximum iteration is 50 times, pmm method, other methods can also be used, specifically:
# pmm any Predictive mean matching
# midastouch any Weighted predictive mean matching
# sample any Random sample from observed values
# cart any Classification and regression trees
# rf any Random forest imputations
# mean numeric Unconditional mean imputation
# norm numeric Bayesian linear regression
# norm.nob numeric Linear regression ignoring model error
# norm.boot numeric Linear regression using bootstrap
# norm.predict numeric Linear regression, predicted values
# lasso.norm numeric Lasso linear regression
# lasso.select.norm numeric Lasso select + linear regression
# quadratic numeric Imputation of quadratic terms
#ri numeric Random indicator for nonignorable data
#logreg binary Logistic regression
# logreg.boot binary Logistic regression with bootstrap
# lasso.logreg binary Lasso logistic regression
# lasso.select.logreg binary Lasso select + logistic regression
# polr ordered Proportional odds model
#polyreg unordered Polytomous logistic regression
# lda unordered Linear discriminant analysis
# 2l.norm numeric Level-1 normal heteroscedastic
# 2l.lmer numeric Level-1 normal homoscedastic, lmer
# 2l.pan numeric Level-1 normal homoscedastic, pan
# 2l.bin binary Level-1 logistic, glmer
# 2lonly.mean numeric Level-2 class mean
# 2lonly.norm numeric Level-2 class normal
# 2lonly.pmm any Level-2 class predictive mean matching


# interpolated data
imputed_test_data$imp$mice_copy_SOC

# Select the first data set
completed_test_data <- mice::complete(imputed_test_data)

test_data$mice_copy_SOC <- completed_test_data$mice_copy_SOC

Data interpolation effect display

Finally, we can use scatter plots to visually see how various imputation methods relate to the original data.

library(ggplot2)

#Set drawing theme
the <- theme_bw() +
  theme(legend.position = "none",
        axis.ticks = element_line(color = "black"),
        axis.text = element_text(color = "black", size=13),
        axis.title= element_text(color = "black", size=13),
        axis.line = element_line(color = "black"),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank())


test_data%>%
  dplyr::select(SOC, "mean_copy_SOC", "median_copy_SOC", "Hmisc_copy_SOC",
               "lm_copy_SOC", "DMwR_copy_SOC", "missForest_copy_SOC",
               "mice_copy_SOC") %>%
  pivot_longer(cols = -1, ) %>%
  ggplot(aes(x=value,y=SOC)) +
  geom_point() +
  geom_smooth(method = 'lm',se=FALSE) +
  stat_poly_eq(use_label(c( "R2", "P"), sep = "*"; "*"), formula = y ~ x) +
  the +
  labs(x= 'fited', y= 'real') +
  facet_wrap(name~.,ncol=3) +
  geom_abline(intercept = 0, slope = 1) # 1:1 line

Missing data is a common problem in scientific research, but fortunately there are many ways to deal with it. The methods introduced in this article are only part of them. In fact, there are many other methods waiting for everyone to explore and practice. Hope this article can be helpful to everyone! If you have any questions or suggestions, please leave a message to communicate.