In the previous five sections, we used Alzheimer’s disease data to make a data preprocessing case, including the following:
GEO Bioinformatics Data Mining (1) Data Set Download and Preliminary Observation
GEO bioinformatics data mining (2) Download gene chip platform files and annotations
GEO Bioinformatics Data Mining (3) Chip Probe ID and Gene Name Mapping Processing
GEO Bioinformatics Data Mining (4) Data Cleaning (Outlier Processing, Low Expression Genes, Normalization, Log2 Processing)
GEO bioinformatics data mining (5) Extract clinical information to construct groups and visualize grouped data (drawing hierarchical clustering diagrams and PCA diagrams)
This section’s directory
Observation of tuberculosis gene expression data (GSE107994)
Clinical shape data preprocessing
Gene expression data preprocessing
Plot observation data
Tuberculosis gene expression data (GSE107994) observation
Since the data format you get during the data analysis process may be different, in this section we take the tuberculosis gene expression data (GSE107994) as an example to make a practical case. The clinical shape data and gene expression data of this data set are separated separately, and you need to modify the code to read and process them.
Let’s take a look at the gene expression data first. The probe annotation has been completed and does not need to be processed.
Looking at the clinical shape data again, we need to manually delete the previous annotations and retain the regular data in the second half.
Clinical Shape Data Preprocessing
# Manually delete the previous comments, read the file, and transpose
pdata <- t(read.delim("GSE107994_series_matrix_clean.txt", header = TRUE, sep = "\t"))
# Manually delete the previous comments, read the file, and transpose pdata <- t(read.delim("GSE107994_series_matrix_clean.txt", header = TRUE, sep = "\t")) pdata <-pdata[-1,] pdata_info = pdata[,c(1,7)] colnames(pdata_info) = c('geo_accession','type') #Observe what the values of the sample type are (tuberculosis, latent progress, control and latent) unique(pdata_info[,2]) #"Leicester_Active_TB" "Longitudnal_Leicester_LTBI_Progressor" "Leicester_Control" #"Leicester_LTBI" group_data = as.data.frame(pdata_info)
Before processing
After processing
Add different type labels and select experimental and control groups as needed
# Use the grepl function to determine whether the string contains 'Control' and modify it accordingly group_data$group_easy <- ifelse(grepl("Control", group_data$type ), "Control", "TB") # Use the grepl function to determine whether the string contains specific content, and then modify it accordingly group_data$group_easy <- ifelse(grepl("Control", group_data$type), "Control", ifelse(grepl("LTBI", group_data$type), "LTBI","TB")) # Use the grepl function to determine whether the string contains specific content, and then modify it accordingly group_data$group_more <- ifelse(grepl("Control", group_data$type), "Control", ifelse(grepl("LTBI_Progressor", group_data$type), "LTBI_Progressor", ifelse(grepl("LTBI", group_data$type), "LTBI","TB"))) #Try to exclude the progress group save(group_data,file = "group_data.Rdata")
For example, we can conduct comparative analysis of TB (tuberculosis) and LTBI (latent tuberculosis) experiments.
Gene expression data preprocessing
Read the data set
install.packages("openxlsx") library(openxlsx) # Read the gene expression matrix, the first column is the gene name ID gse_info<- as.data.frame(read.xlsx("GSE107994_Raw.xlsx", sheet = 1)) colnames(gse_info)
During the subsequent running of the code, it was found that there were all numbers in the gene name, and the deletion operation was performed here.
library(dplyr) dim(gse_info) #There are numbers in genes gse_info <- gse_info[!grepl("^\d + $", gse_info$ID), ] #valid #gene names are all empty gse_info = gse_info[gse_info$ID != "",] #No elimination dim(gse_info) #[1] 58023 176 #Negative value processing gse_info[gse_info <= 0] <- 0.0001 #Duplicate value check table(duplicated(gse_info$ID))
Group data condition filtering, TB (tuberculosis) and LTBI (latent tuberculosis)
# + ============================================= ======== #================================================ # + ========type grouped data condition filtering step3============ # + ==================================== #Before preprocessing, first filter out the data of TB group and LTBI group unique(group_data[,"group_more"]) #"TB" "LTBI_Progressor" "Control" "LTBI" #"TB" "LTBI" For comparison, remove "LTBI_Progressor" "Control" geo_accession_TB_LTBI <- group_data[group_data$group_more == "LTBI_Progressor" | group_data$group_more == "Control","geo_accession"] gse_TB_FTBI = gse_info[,!(names(gse_info) %in% geo_accession_TB_LTBI)] gse_TB_FTBI
Low expression filtering (average value less than 1)
# + ================================================ ===== #================================================ # + ========Delete low expression (average value less than 1) genes step4============ # + ==================================== # + ============================== #Add a new column to calculate the average gene_avg_expression <- rowMeans(gse_TB_FTBI[, -1]) # Calculate the average expression of each gene, excluding the first column (gene name) #Only remove genes whose expression level is zero in all samples (average value is less than 1) gse_TB_FTBI_filtered_genes_1 <- gse_TB_FTBI[gene_avg_expression >= 1, ]
Low expression filtering scheme 2 (retaining the top 50% of genes expressed in the sample)
# + ============================================= ==================== #================================================== =========== # + ========Delete low expression (top 50%) genes step5============ # + ========================================== # + ================================ #Retain only genes expressed in more than half of the samples # Calculate the average value of each gene in the gene expression matrix gene_means <- rowMeans(gse_TB_FTBI_filtered_genes_1[,-1]) # Calculate the ranking percentile of the gene mean gene_percentiles <- rank(gene_means) / length(gene_means) # Get the threshold threshold <- 0.25 # 25% threshold after deletion #threshold <- 0.5 # 50% threshold after deletion # Filter low expression genes based on threshold gse_TB_FTBI_filtered_genes_2 <- gse_TB_FTBI_filtered_genes_1[gene_percentiles > threshold, ] #Print the filtered gene expression matrix dim(gse_TB_FTBI_filtered_genes_2) #[1] 17049 176
Remove duplicate genes and average
# + ============================================= ==================== #================================================== =========== # + ========Duplicate genes, take the average step6============ # + ========================================== # + ================================ dim(filtered_genes_2) table(duplicated(filtered_genes_2$ID)) # Average the repeated Symbols averaged_data <- aggregate(.~ID, filtered_genes_2, mean, na.action = na.pass) ##Average the repeated Symbols #Name the row name SYMBOL row.names(averaged_data) <- averaged_data$ID dim(averaged_data) #Remove missing values matrix_na = na.omit(averaged_data) #Delete the Symbol column (usually the first column) matrix_final <- subset(matrix_na, select = -1) dim(matrix_final) #[1] 22687 175
Outlier handling
# + ================================================ ================= #================================================== =========== # + ========Outlier processing step7========================== # + ========================================== # + ================================ #Data outlier processing #Handle extreme values #Define vector extreme value processing function #Used to handle outliers and replace values outside a certain range with the median to reduce the impact of outliers on subsequent analysis. dljdz=function(x) { DOWNB=quantile(x,0.25)-1.5*(quantile(x,0.75)-quantile(x,0.25)) UPB=quantile(x,0.75) + 1.5*(quantile(x,0.75)-quantile(x,0.25)) x[which(x<DOWNB)]=quantile(x,0.5) x[which(x>UPB)]=quantile(x,0.5) return(x) } #The first column is set to the row name matrix_leave=matrix_final_TB_LTBI boxplot(matrix_leave,outline=FALSE, notch=T, las=2) ##Outline boxplot dim(matrix_leave) #Handle outliers matrix_leave_res=apply(matrix_leave,2,dljdz) boxplot(matrix_leave_res,outline=FALSE, notch=T, las=2) ##Outline boxplot dim(matrix_leave_res)
log2 processing
# + ============================================= ==================== #================================================== =========== # + ========log2 processing step8========================== # + ========================================== # + ================================ # limma function normalization, correction difference, automatic log2 expression matrix #1. Normalization is not absolutely necessary, but it is recommended. #In duplicate samples, external factors with no biological significance should affect the expression of a single sample. #For example, the samples prepared in the first batch will generally have higher expression than the samples prepared in the second batch. It is assumed that the range and distribution of expression values of all samples should be similar. #Normalization is required to ensure that the expression distribution of each sample is similar throughout the experiment. #2. Normalization should be done before log2 standardization library(limma) exprSet=normalizeBetweenArrays(matrix_leave_res) boxplot(exprSet,outline=FALSE, notch=T, las=2) ##Outline boxplot ## This step is very important to convert the matrix into a data frame. class(exprSet) ##Note: The format of the data at this time is a matrix (Matrix) exprSet <- as.data.frame(exprSet) #Standardize the expression matrix to be automatically log2-formed qx <- as.numeric(quantile(exprSet, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T)) LogC <- (qx[5] > 100) || (qx[6]-qx[1] > 50 & amp; & amp; qx[2] > 0) || (qx[2] > 0 & amp; & amp; qx[2] < 1 & amp; & amp; qx[4] > 1 & amp; & amp; qx[4] < 2) # Set all negative values to empty #exprSet[exprSet <= 0] <- 0.0001 #Remove missing values #exprSet = na.omit(exprSet) #15654 #save (exprSet,file = "waitlog_data_TB_LTBI.Rdata") ## Start judging if (LogC) { exprSet [which(exprSet <= 0)] <- NaN ## Get log2 exprSet_clean <- log2(exprSet + 1) #@@@@Whether to add one. If adding one, it will not produce a negative value@#@¥@#@#@%@%¥@@@@@@ print("log2 transform finished") }else{ print("log2 transform not needed") } boxplot(exprSet_clean,outline=FALSE, notch=T, las=2) ##Outline boxplot dataset_TB_LTBI =exprSet_clean
Drawing observation data
# + ================================================ ================= #================================================== =========== # + ======== Draw box plots in different colors for the control group step9========================== # + ========================================== # + ================================ # Use the grepl function to determine whether the string contains 'LTBI' and color mark it for drawing. group_data_TB_LTBI$group_color <- ifelse(grepl("LTBI", group_data_TB_LTBI$group_more), "yellow", "blue") #Draw a box plot to view data distribution group_list_color = group_data_TB_LTBI$group_color boxplot( data.frame(dataset_TB_LTBI),outline=FALSE,notch=T,col=group_list_color,las=2) dev.off() # + ================================================ ================= #================================================== =========== # + ========Draw hierarchical clustering diagram step10========================== # + ========================================== # + ================================ # + #Check whether the sample names of the expression matrix and the order of sample names of the sublease information are consistent. colnames(dataset_TB_LTBI) group_data_TB_LTBI$geo_accession exprSet =dataset_TB_LTBI #Modify the name of GSM to group information #colnames(exprSet)=paste(group_list,1:ncol(exprSet),sep = '') #define nodePar nodePar=list(lab.cex=0.6,pch=c(NA,19),cex=0.7,col='blue') #clustering hc=hclust(dist(t(exprSet))) #t() means transpose #drawing plot(as.dendrogram(hc),nodePar = nodePar,horiz = TRUE) dev.off() # + ================================================ ================= #================================================== =========== # + ========Draw PCA scatter sample visualization step11==================== # + ========================================== # + ================================ ##PCA plot #install.packages('ggfortify') library(ggfortify) df=as.data.frame(t(exprSet)) #After transposition, it becomes a matrix dim(df) #View data dimensions dim(exprSet) df$group=group_data_TB_LTBI$group_more #Add sample group information autoplot(prcomp(df[,1:ncol(df)-1]),data=df,colour='group') #PCA scatter plot dev.off()
At this point, we have preprocessed the two data sets, and now we can perform difference analysis on the processed data.
The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 377795 people are learning the system