GEO Bioinformatics Data Mining (6) Practical Case – Preprocessing Analysis of Four Classifications of Tuberculosis Gene Data

In the previous five sections, we used Alzheimer’s disease data to make a data preprocessing case, including the following:

GEO Bioinformatics Data Mining (1) Data Set Download and Preliminary Observation

GEO bioinformatics data mining (2) Download gene chip platform files and annotations

GEO Bioinformatics Data Mining (3) Chip Probe ID and Gene Name Mapping Processing

GEO Bioinformatics Data Mining (4) Data Cleaning (Outlier Processing, Low Expression Genes, Normalization, Log2 Processing)

GEO bioinformatics data mining (5) Extract clinical information to construct groups and visualize grouped data (drawing hierarchical clustering diagrams and PCA diagrams)

This section’s directory

Observation of tuberculosis gene expression data (GSE107994)

Clinical shape data preprocessing

Gene expression data preprocessing

Plot observation data


Tuberculosis gene expression data (GSE107994) observation

Since the data format you get during the data analysis process may be different, in this section we take the tuberculosis gene expression data (GSE107994) as an example to make a practical case. The clinical shape data and gene expression data of this data set are separated separately, and you need to modify the code to read and process them.

Let’s take a look at the gene expression data first. The probe annotation has been completed and does not need to be processed.

Looking at the clinical shape data again, we need to manually delete the previous annotations and retain the regular data in the second half.

Clinical Shape Data Preprocessing

# Manually delete the previous comments, read the file, and transpose
pdata <- t(read.delim("GSE107994_series_matrix_clean.txt", header = TRUE, sep = "\t"))

# Manually delete the previous comments, read the file, and transpose
pdata <- t(read.delim("GSE107994_series_matrix_clean.txt", header = TRUE, sep = "\t"))
pdata <-pdata[-1,]
pdata_info = pdata[,c(1,7)]
colnames(pdata_info) = c('geo_accession','type')

#Observe what the values of the sample type are (tuberculosis, latent progress, control and latent)
unique(pdata_info[,2])
#"Leicester_Active_TB" "Longitudnal_Leicester_LTBI_Progressor" "Leicester_Control" #"Leicester_LTBI"

group_data = as.data.frame(pdata_info)

Before processing

After processing

Add different type labels and select experimental and control groups as needed

# Use the grepl function to determine whether the string contains 'Control' and modify it accordingly
group_data$group_easy <- ifelse(grepl("Control", group_data$type ), "Control", "TB")

# Use the grepl function to determine whether the string contains specific content, and then modify it accordingly
group_data$group_easy <- ifelse(grepl("Control", group_data$type), "Control",
                                       ifelse(grepl("LTBI", group_data$type), "LTBI","TB"))
# Use the grepl function to determine whether the string contains specific content, and then modify it accordingly
group_data$group_more <- ifelse(grepl("Control", group_data$type), "Control",
                                ifelse(grepl("LTBI_Progressor", group_data$type), "LTBI_Progressor",
                                       ifelse(grepl("LTBI", group_data$type), "LTBI","TB")))

#Try to exclude the progress group

save(group_data,file = "group_data.Rdata")

For example, we can conduct comparative analysis of TB (tuberculosis) and LTBI (latent tuberculosis) experiments.

Gene expression data preprocessing

Read the data set


install.packages("openxlsx")
library(openxlsx)

# Read the gene expression matrix, the first column is the gene name ID
gse_info<- as.data.frame(read.xlsx("GSE107994_Raw.xlsx", sheet = 1))
colnames(gse_info)

During the subsequent running of the code, it was found that there were all numbers in the gene name, and the deletion operation was performed here.

library(dplyr)
dim(gse_info)
#There are numbers in genes
gse_info <- gse_info[!grepl("^\d + $", gse_info$ID), ] #valid

#gene names are all empty
gse_info = gse_info[gse_info$ID != "",] #No elimination
dim(gse_info) #[1] 58023 176

#Negative value processing
gse_info[gse_info <= 0] <- 0.0001

#Duplicate value check
table(duplicated(gse_info$ID))

Group data condition filtering, TB (tuberculosis) and LTBI (latent tuberculosis)

# + ============================================= ========
#================================================
# + ========type grouped data condition filtering step3============
# + ====================================

#Before preprocessing, first filter out the data of TB group and LTBI group
unique(group_data[,"group_more"]) #"TB" "LTBI_Progressor" "Control" "LTBI"

#"TB" "LTBI" For comparison, remove "LTBI_Progressor" "Control"
geo_accession_TB_LTBI <- group_data[group_data$group_more == "LTBI_Progressor" | group_data$group_more == "Control","geo_accession"]
gse_TB_FTBI = gse_info[,!(names(gse_info) %in% geo_accession_TB_LTBI)]



gse_TB_FTBI

Low expression filtering (average value less than 1)

# + ================================================ =====
#================================================
# + ========Delete low expression (average value less than 1) genes step4============
# + ====================================
# + ==============================

#Add a new column to calculate the average
gene_avg_expression <- rowMeans(gse_TB_FTBI[, -1]) # Calculate the average expression of each gene, excluding the first column (gene name)
#Only remove genes whose expression level is zero in all samples (average value is less than 1)
gse_TB_FTBI_filtered_genes_1 <- gse_TB_FTBI[gene_avg_expression >= 1, ]

Low expression filtering scheme 2 (retaining the top 50% of genes expressed in the sample)

# + ============================================= ====================
#================================================== ===========
# + ========Delete low expression (top 50%) genes step5============
# + ==========================================
# + ================================


#Retain only genes expressed in more than half of the samples

# Calculate the average value of each gene in the gene expression matrix
gene_means <- rowMeans(gse_TB_FTBI_filtered_genes_1[,-1])

# Calculate the ranking percentile of the gene mean
gene_percentiles <- rank(gene_means) / length(gene_means)

# Get the threshold
threshold <- 0.25 # 25% threshold after deletion
#threshold <- 0.5 # 50% threshold after deletion
# Filter low expression genes based on threshold
gse_TB_FTBI_filtered_genes_2 <- gse_TB_FTBI_filtered_genes_1[gene_percentiles > threshold, ]

#Print the filtered gene expression matrix
dim(gse_TB_FTBI_filtered_genes_2) #[1] 17049 176

Remove duplicate genes and average

# + ============================================= ====================
#================================================== ===========
# + ========Duplicate genes, take the average step6============
# + ==========================================
# + ================================


dim(filtered_genes_2)
table(duplicated(filtered_genes_2$ID))


# Average the repeated Symbols
averaged_data <- aggregate(.~ID, filtered_genes_2, mean, na.action = na.pass) ##Average the repeated Symbols

#Name the row name SYMBOL
row.names(averaged_data) <- averaged_data$ID
dim(averaged_data)

#Remove missing values
matrix_na = na.omit(averaged_data)

#Delete the Symbol column (usually the first column)
matrix_final <- subset(matrix_na, select = -1)
dim(matrix_final) #[1] 22687 175

Outlier handling

# + ================================================ =================
#================================================== ===========
# + ========Outlier processing step7==========================
# + ==========================================
# + ================================


#Data outlier processing
#Handle extreme values
#Define vector extreme value processing function
#Used to handle outliers and replace values outside a certain range with the median to reduce the impact of outliers on subsequent analysis.
dljdz=function(x) {
  DOWNB=quantile(x,0.25)-1.5*(quantile(x,0.75)-quantile(x,0.25))
  UPB=quantile(x,0.75) + 1.5*(quantile(x,0.75)-quantile(x,0.25))
  x[which(x<DOWNB)]=quantile(x,0.5)
  x[which(x>UPB)]=quantile(x,0.5)
  return(x)
}

#The first column is set to the row name
matrix_leave=matrix_final_TB_LTBI

boxplot(matrix_leave,outline=FALSE, notch=T, las=2) ##Outline boxplot
dim(matrix_leave)

#Handle outliers
matrix_leave_res=apply(matrix_leave,2,dljdz)

boxplot(matrix_leave_res,outline=FALSE, notch=T, las=2) ##Outline boxplot
dim(matrix_leave_res)

log2 processing

# + ============================================= ====================
#================================================== ===========
# + ========log2 processing step8==========================
# + ==========================================
# + ================================


# limma function normalization, correction difference, automatic log2 expression matrix

#1. Normalization is not absolutely necessary, but it is recommended.
#In duplicate samples, external factors with no biological significance should affect the expression of a single sample.
#For example, the samples prepared in the first batch will generally have higher expression than the samples prepared in the second batch. It is assumed that the range and distribution of expression values of all samples should be similar.
#Normalization is required to ensure that the expression distribution of each sample is similar throughout the experiment.
#2. Normalization should be done before log2 standardization


library(limma)

exprSet=normalizeBetweenArrays(matrix_leave_res)

boxplot(exprSet,outline=FALSE, notch=T, las=2) ##Outline boxplot

## This step is very important to convert the matrix into a data frame.
class(exprSet) ##Note: The format of the data at this time is a matrix (Matrix)
exprSet <- as.data.frame(exprSet)


#Standardize the expression matrix to be automatically log2-formed
qx <- as.numeric(quantile(exprSet, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
  (qx[6]-qx[1] > 50 & amp; & amp; qx[2] > 0) ||
  (qx[2] > 0 & amp; & amp; qx[2] < 1 & amp; & amp; qx[4] > 1 & amp; & amp; qx[4] < 2)

# Set all negative values to empty
#exprSet[exprSet <= 0] <- 0.0001
#Remove missing values
#exprSet = na.omit(exprSet) #15654
#save (exprSet,file = "waitlog_data_TB_LTBI.Rdata")

## Start judging
if (LogC) {
  exprSet [which(exprSet <= 0)] <- NaN
  ## Get log2
  exprSet_clean <- log2(exprSet + 1) #@@@@Whether to add one. If adding one, it will not produce a negative value@#@¥@#@#@%@%¥@@@@@@
  print("log2 transform finished")
}else{
  print("log2 transform not needed")
}


boxplot(exprSet_clean,outline=FALSE, notch=T, las=2) ##Outline boxplot

dataset_TB_LTBI =exprSet_clean

Drawing observation data

# + ================================================ =================
#================================================== ===========
# + ======== Draw box plots in different colors for the control group step9==========================
# + ==========================================
# + ================================

# Use the grepl function to determine whether the string contains 'LTBI' and color mark it for drawing.
group_data_TB_LTBI$group_color <- ifelse(grepl("LTBI", group_data_TB_LTBI$group_more), "yellow", "blue")

#Draw a box plot to view data distribution
group_list_color = group_data_TB_LTBI$group_color
boxplot( data.frame(dataset_TB_LTBI),outline=FALSE,notch=T,col=group_list_color,las=2)

dev.off()


# + ================================================ =================
#================================================== ===========
# + ========Draw hierarchical clustering diagram step10==========================
# + ==========================================
# + ================================
# +

#Check whether the sample names of the expression matrix and the order of sample names of the sublease information are consistent.
colnames(dataset_TB_LTBI)
group_data_TB_LTBI$geo_accession



exprSet =dataset_TB_LTBI
#Modify the name of GSM to group information
#colnames(exprSet)=paste(group_list,1:ncol(exprSet),sep = '')


#define nodePar
nodePar=list(lab.cex=0.6,pch=c(NA,19),cex=0.7,col='blue')
#clustering
hc=hclust(dist(t(exprSet))) #t() means transpose

#drawing
plot(as.dendrogram(hc),nodePar = nodePar,horiz = TRUE)

dev.off()





# + ================================================ =================
#================================================== ===========
# + ========Draw PCA scatter sample visualization step11====================
# + ==========================================
# + ================================



##PCA plot
#install.packages('ggfortify')
library(ggfortify)
df=as.data.frame(t(exprSet)) #After transposition, it becomes a matrix
dim(df) #View data dimensions
dim(exprSet)

df$group=group_data_TB_LTBI$group_more #Add sample group information
autoplot(prcomp(df[,1:ncol(df)-1]),data=df,colour='group') #PCA scatter plot

dev.off()

At this point, we have preprocessed the two data sets, and now we can perform difference analysis on the processed data.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 377795 people are learning the system