- Personal information: Cool is moving forward
- Copyright: The blog post is originally written by [Cool on the Move]. If you need to reprint, please contact the blogger.
- If the blog post is helpful to you, please like, follow, collect + subscribe to the column
- This article is included in [R Statistics]. This column mainly introduces the process of statistical analysis using R language, such as descriptive statistics of data, t test, variance analysis, correlation, linear regression, etc. Please pay more attention, like and support, and make progress together~ Everyone is welcome to subscribe!
Article directory
- Build data
- simple interpolation
-
- Mean/median imputation
- random interpolation
- Model-based interpolation methods
-
- linear regression interpolation
- k-nearest neighbor interpolation (k-NN)
- random forest interpolation
- Multiple Imputation
- Data interpolation effect display
In daily scientific research work, missing data is a very common problem. Especially in large data sets, data loss is almost inevitable due to various force majeure factors. But this brings up a question: How should we deal with missing data? Directly deleting data rows containing missing values seems to be a simple and direct method, but this will lead to the loss of valid data. Today, I want to share with you several ways to deal with missing data. Please note that each of these methods has pros and cons, and the most appropriate method should be chosen based on specific data characteristics and research purposes.
Build data
First, we need to read in an ecological data set with 30 rows and 14 columns. This dataset is used to demonstrate how to deal with missing values in the data. Through the random sampling method, we artificially generated some missing values in the copy_SOC column of the dataset.
# Data reading test_data<- read.csv('H:/data/test_data.csv') test_data$copy_SOC <- test_data$SOC # Calculate the number of data that need to be replaced with NA num_na <- round(nrow(test_data) * 0.20) # Randomly select 20% index random_indices <- sample(1:nrow(test_data), size=num_na) # Replace the data corresponding to the selected index with NA test_data[random_indices,15] <- NA colSums(is.na(test_data)) sites NPP ANPP Root.biomass SOC 0 0 0 0 0 TN pH Clay Silt Sand 0 0 0 0 0 Bulk.density total.PLFA Fungal.PLFAs Bacterial.PLFAs copy_SOC 0 0 0 0 6
See that there are 6 missing values in the copy_SOC column
Simple interpolation
Mean/median interpolation
This is a very basic and commonly used method. Suitable for situations where missing data is random. The method is to directly replace missing values with the mean or median of the variable.
# Fill missing values using the column's mean, median, or mode. This is the simplest way. test_data$mean_copy_SOC <- test_data$copy_SOC test_data$mean_copy_SOC[is.na(test_data$mean_copy_SOC)] <- mean(test_data$copy_SOC, na.rm = TRUE) test_data$median_copy_SOC <- test_data$copy_SOC test_data$median_copy_SOC[is.na(test_data$median_copy_SOC)] <- median(test_data$copy_SOC, na.rm = TRUE)
Random interpolation
Simply randomly select a value from existing observations to replace missing values. This method is suitable for cases where the missing data is completely random.
# Randomly select values from existing observations to fill in missing values. library(Hmisc) test_data$Hmisc_copy_SOC <- test_data$copy_SOC test_data$Hmisc_copy_SOC <- impute(test_data$Hmisc_copy_SOC, 'random') # When using the impute function, make sure your data is numeric, as this function is mainly designed for numeric data. # impute can also use mean and median for interpolation # impute(test_data$Hmisc_copy_SOC, 'mean') # impute(test_data$Hmisc_copy_SOC, 'median')
Model-based interpolation method
Linear regression interpolation
Use other variables to perform linear regression prediction on variables with missing values, and then use predicted values to replace missing values.
# Perform linear regression using other known variables as predictors, and then use the regression model to predict missing values. test_data$lm_copy_SOC <- test_data$copy_SOC train_data <- test_data[!is.na(test_data$lm_copy_SOC),] # Use train_data to build a linear model lm_fit <- lm(lm_copy_SOC ~ NPP + ANPP + Root.biomass + TN + pH + Clay + Silt + Sand + Bulk.density + total.PLFA + Fungal.PLFAs + Bacterial.PLFAs,train_data ) # Perform stepwise regression on the linear model and filter variables lm_fit2 <- step(lm_fit) #Model summary summary(lm_fit2) Call: lm(formula = lm_copy_SOC ~ NPP + ANPP + Root.biomass + TN + Clay + Sand + Bulk.density + total.PLFA + Fungal.PLFAs + Bacterial.PLFAs, data = train_data) Residuals: Min 1Q Median 3Q Max -2.9593 -2.0936 0.2103 1.0633 4.2886 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -50.34984 29.42610 -1.711 0.1308 NPP -0.03340 0.01210 -2.760 0.0281 * ANPP -0.34054 0.27252 -1.250 0.2516 Root.biomass 0.05054 0.04098 1.233 0.2573 TN 15.00918 1.48659 10.096 2.01e-05 *** Clay 1.17952 1.16784 1.010 0.3461 Sand 0.65299 0.38771 1.684 0.1360 Bulk.density -9.35362 8.41716 -1.111 0.3032 total.PLFA -1.39401 0.90615 -1.538 0.1678 Fungal.PLFAs 2.88526 1.91431 1.507 0.1755 Bacterial.PLFAs 2.53241 1.72284 1.470 0.1850 --- Signif. codes: 0 ***’ 0.001 **’ 0.01 *’ 0.05 .’ 0.1 ’ 1 Residual standard error: 3.586 on 7 degrees of freedom Multiple R-squared: 0.9934, Adjusted R-squared: 0.984 F-statistic: 105.4 on 10 and 7 DF, p-value: 1.149e-06 # Delete the data set used for prediction predict_data <- test_data[is.na(test_data$lm_copy_SOC),names(coefficients(lm_fit2))[-1]] # Use this model to predict missing values predicted_values <- predict(lm_fit2, newdata = predict_data) # Insert the predicted value into the missing position in the data test_data$lm_copy_SOC[is.na(test_data$lm_copy_SOC)] <- predicted_values
This method first builds a linear model using other known variables and then uses this model to predict missing values.
k-Nearest Neighbor Interpolation (k-NN)
This method imputes data by finding the k observations in the entire data set that are closest to the missing value.
# Use the knnImputation function of the DMwR package to fill missing values based on the k-NN method. remotes::install_github("cran/DMwR") library(DMwR) test_data$DMwR_copy_SOC <- test_data$copy_SOC knnImputation_data <- knnImputation(test_data) test_data$DMwR_copy_SOC <- knnImputation_data$DMwR_copy_SOC
Random forest interpolation
Random forest is an ensemble learning method that can be used to deal with missing data problems.
# Use the missForest package, which imputes missing values based on the random forest algorithm. library(missForest) test_data$missForest_copy_SOC <- test_data$copy_SOC result <- missForest(as.matrix(test_data)) result$OOBerror test_data_missForest <- as.data.frame(result$ximp) test_data$missForest_copy_SOC <- test_data_missForest$missForest_copy_SOC
Multiple Imputation
Multiple imputation is a more complex method, but is currently widely considered to be one of the best methods for dealing with missing data.
#There are many ways to implement multiple imputation using the mice package. This is a more complex but widely accepted method that creates multiple datasets and performs analysis on each dataset. library(mice) test_data$mice_copy_SOC <- test_data$copy_SOC # Perform interpolation imputed_test_data <- mice(test_data[c(8:14,22)], m = 5, maxit = 50, method = 'pmm', seed = 10) # m represents the number of generated data sets, the maximum iteration is 50 times, pmm method, other methods can also be used, specifically: # pmm any Predictive mean matching # midastouch any Weighted predictive mean matching # sample any Random sample from observed values # cart any Classification and regression trees # rf any Random forest imputations # mean numeric Unconditional mean imputation # norm numeric Bayesian linear regression # norm.nob numeric Linear regression ignoring model error # norm.boot numeric Linear regression using bootstrap # norm.predict numeric Linear regression, predicted values # lasso.norm numeric Lasso linear regression # lasso.select.norm numeric Lasso select + linear regression # quadratic numeric Imputation of quadratic terms #ri numeric Random indicator for nonignorable data #logreg binary Logistic regression # logreg.boot binary Logistic regression with bootstrap # lasso.logreg binary Lasso logistic regression # lasso.select.logreg binary Lasso select + logistic regression # polr ordered Proportional odds model #polyreg unordered Polytomous logistic regression # lda unordered Linear discriminant analysis # 2l.norm numeric Level-1 normal heteroscedastic # 2l.lmer numeric Level-1 normal homoscedastic, lmer # 2l.pan numeric Level-1 normal homoscedastic, pan # 2l.bin binary Level-1 logistic, glmer # 2lonly.mean numeric Level-2 class mean # 2lonly.norm numeric Level-2 class normal # 2lonly.pmm any Level-2 class predictive mean matching # interpolated data imputed_test_data$imp$mice_copy_SOC # Select the first data set completed_test_data <- mice::complete(imputed_test_data) test_data$mice_copy_SOC <- completed_test_data$mice_copy_SOC
Data interpolation effect display
Finally, we can use scatter plots to visually see how various imputation methods relate to the original data.
library(ggplot2) #Set drawing theme the <- theme_bw() + theme(legend.position = "none", axis.ticks = element_line(color = "black"), axis.text = element_text(color = "black", size=13), axis.title= element_text(color = "black", size=13), axis.line = element_line(color = "black"), panel.grid.minor = element_blank(), panel.grid.major = element_blank()) test_data%>% dplyr::select(SOC, "mean_copy_SOC", "median_copy_SOC", "Hmisc_copy_SOC", "lm_copy_SOC", "DMwR_copy_SOC", "missForest_copy_SOC", "mice_copy_SOC") %>% pivot_longer(cols = -1, ) %>% ggplot(aes(x=value,y=SOC)) + geom_point() + geom_smooth(method = 'lm',se=FALSE) + stat_poly_eq(use_label(c( "R2", "P"), sep = "*"; "*"), formula = y ~ x) + the + labs(x= 'fited', y= 'real') + facet_wrap(name~.,ncol=3) + geom_abline(intercept = 0, slope = 1) # 1:1 line
Missing data is a common problem in scientific research, but fortunately there are many ways to deal with it. The methods introduced in this article are only part of them. In fact, there are many other methods waiting for everyone to explore and practice. Hope this article can be helpful to everyone! If you have any questions or suggestions, please leave a message to communicate.