Basic process of bioinformatics analysis–single sample analysis

I’m currently learning single-cell bioinformatics analysis, and I’m a little confused. . . . Let’s make some notes first, and we’ll see what needs to be changed later.

Speaking of which, why is it that the B-site of biocredit analysis only ends with this basic process. . There should be more after reading the literature. . I don’t know how to learn.

Suggestion: After each step, it is recommended:

saveRDS(SZDX0224_val, file = "./Data/SZDX-0224_outs/SZDX0224_val.rds")

It is convenient to check at any time later:

SZDX0224_val <- readRDS(file = "./Data/SZDX-0224_outs/SZDX0224_val.rds")

The structure of this article: core code, explanation of some functions, and result graph

1. R package loading and installation

if(!require(multtest))BiocManager::install("multtest")
if(!require(Seurat))install.packages("Seurat")
if(!require(dplyr))install.packages("dplyr")
if(!require(patchwork))install.packages("patchwork")
if(!require(R.utils))install.packages("R.utils")
rm(list = ls());gc() #Delete all environment variables and organize memory space

2. Data reading

# Relative path to build data folder
relative_path <- file.path(base_path, "Data", "SZDX-0224_report", "SZDX-4-0224 starsolo", "SZDX-4-0224", "filtered")
# Automatically read 10X data, which are some tsv and mtx files.
SZDX0224 <- Read10X(relative_path)
# Single cell analysis is generally implemented using the Seurat package, so we also need to convert it into a Seurat object.
SZDX0224_project <- CreateSeuratObject(counts = SZDX0224, project = "SZDX0224", min.cells = 0, min.features = 200)

Here, we are using one of our own data, SZDX0224, obtained by 10x.

Among them, the parameters of the CreateSeuratObject function are:

CreateSeuratObject(counts = NULL, project = NULL, min.cells = 3, min.features = 200, sep = "_", group.by = NULL, block.by = NULL, merge.data = TRUE , do.normalize = TRUE, do.scale = TRUE, gene.column = 1, sample.names = NULL, expression.family = NegBinomial, nUMI = NULL, local.only = FALSE)
  • counts (required): A matrix or data frame containing raw count data from single-cell RNA sequencing, with rows representing genes and columns representing cells/Barcodes. The values in the matrix correspond to the number of cells containing the gene (we will have many cells when doing experiments)
  • project (optional): a string representing the name of the Seurat object you are using or the name of the project. Here is the variable name assigned by Read10x earlier.
  • min.cells (optional): An integer specifying the minimum number of genes a cell must have or be excluded from analysis. (A cell has multiple genes, and the settings can be modified to filter according to the actual situation)
  • min.feature (optional): an integer specifying the minimum number of UMIs (Unique Molecular Identifiers) that a cell must have, otherwise it will be excluded from analysis. (Minimum UMI number: UMI is the ID card of each transcribed RNA. When performing PCR, the UMI generated by each amplification (transcription) is different, that is, the ID number is different. How to calculate the UMI number? Amplified When the time comes, the Barcode will not change, that is, your nationality will not change, but there will be more people. So the same Barcode/nationality, but different UMI/ID card can be counted)
  • There are relatively few other parameters, you can learn about them by yourself

Get it from Read10, generally don’t pay attention to this:

i is the row index of non-zero values in the matrix, and p is the column index of non-zero values in the matrix, so i and p are the row and column coordinates of non-zero values. Dim is the matrix dimension. Dimnames are the row names (gene names) and column names (cell names) of this matrix. Dimnames[[1]] contains gene names, and Dimnames[[2]] contains cell names. x is a sparse matrix, and the contents inside are the values in the matrix. factors contains factor or grouping information associated with columns in the data matrix. These factors usually represent different samples, cell types, batches, etc., and little information is saved here.

Converted Seurat object SZDX0224_project:

A preliminary look at the Seurat object SZDX0224_project:

36601 Genes, 728 cells

Get the number of cells:

Both ncol(SZDX0224_project) and ncol(SZDX0224) are available

3. Calculate the proportion of mitochondrial genes

The reason for calculating the proportion of mitochondrial genes: When the mitochondrial RNA content is too high, it may mean that the cell’s nucleus is less active, which means that the normal cytoplasmic RNA content decreases. There are two situations: 1. The cell activity is not very good and needs to be filtered out. 2. The mitochondria content of the cell itself is relatively high. Here, percent is the RNA detected in this Barcode, and percent is mitochondrial RNA (the unit of percent is %)

SZDX0224_project[["percent.mt"]] <- PercentageFeatureSet(SZDX0224_project, pattern = "^MT-")




#### Check the effect ####
# Display the proportion and corresponding number of mitochondrial genes in the form of a bar chart
hist(SZDX0224_project[["percent.mt"]]$percent.mt)

# The human source data is MT, the rat source data needs to be replaced by mt, and the first 5 are selected.
head([email protected],5)

# Display the sequenced cells in the form of a violin plot
VlnPlot(SZDX0224_project, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) # ncol is how many pictures are displayed in each line

#Represented in the form of a scatter plot
plot1 <- FeatureScatter(SZDX0224_project, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(SZDX0224_project, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
if(!require(patchwork))install.packages("patchwork")
CombinePlots(plots = list(plot1, plot2)) 

PercentageFeatureSet function:

PercentageFeatureSet(object, features = NULL, pattern = NULL, slot = "RNAcounts", subset = NULL)
  • object (required): A Seurat object containing single-cell RNA sequencing data.
  • features (optional): A character vector containing the names of genes for which percentages are to be calculated. If you want to know the status of some genes, you can directly enter features = c(“gene name1”, “gene name2”, “gene name3”, “gene name4”, “gene Name 5”)
  • pattern (optional): One character used to filter gene names with a matching pattern. Here we use ^MT-, which stands for human mitochondrial gene. Generally speaking, just choose one of the two parameters features and pattern.

After searching, the meta.data of SZDX0224_project has changed:

Bar chart to view the proportion of mitochondrial genes:

Violin plot:

Scatter plot:

Each Each point is a cell, nFeature_RNA refers to the number of genes detected in each cell, and nCount_RNA represents the number of RNA recognized by each cell, which is UMI (several RNAs are detected for each gene). For the first picture, when the amount of RNA detected is very small, but the proportion of mitochondria is high (more towards the upper left corner), it means that the cell is likely to be dead.

It can be seen that the quality of our cells is still relatively good.

4. Filter unqualified data

SZDX0224_val <- subset(SZDX0224_project, subset = nFeature_RNA > 200 & nFeature_RNA < 7000 & percent.mt < 20)

subset function:

subset(object, subset, ...)
  • object (required): A Seurat object containing single-cell RNA sequencing data and other related information.
  • subset (required): A logical expression that specifies the conditions for extracting a subset of cells. Only cells that meet the criteria will be included in the subset.
  • …: Other parameters usually do not need to be used

Can be viewed Until 5 cells have been filtered out, 4 means that meta.data now has four data

5. Standardization

The purpose of normalization is to eliminate technical differences and batch effects between different samples, and to make expression values between different cells comparable in order of magnitude.

SZDX0224_val <- NormalizeData(SZDX0224_val, normalization.method = "LogNormalize", scale.factor = 10000) # Normalize to 10000
SZDX0224_val <- FindVariableFeatures(SZDX0224_val, selection.method = "vst", nGenes = 2000) # Find 2000 significantly variable gene features
SZDX0224_val <- ScaleData(SZDX0224_val, vars.to.regress = c("percent.mt")) # vars.to.regress is the proportion of mitochondrial genes removed

NormalizDatah function:

NormalizeData(object, normalization.method = "LogNormalize", scale.factor = 10000, do.scale = TRUE, genes.use = NULL)
  • object (required): A Seurat object containing raw single-cell RNA sequencing data.
  • normalization.method (optional): One character specifying the normalization method. Commonly used methods include: “LogNormalize” (default): logarithmically transforms the data to stabilize the variance and scales the data to a constant (default is 10,000). “CLR”: Performs Centered Log Ratio transformation. “SCT”: Use the SCTransform method for normalization.
  • scale.factor (optional): A number that determines the scaling factor for normalized data. The default value is 10,000.
  • do.scale (optional): A logical value indicating whether to scale the data. If TRUE (default), scaling is performed.
  • genes.use (optional): A character vector containing the gene names to use for normalization. If not specified, all genes will be normalized

FindVariableFeatures function:

FindVariableFeatures(object, selection.method = "vst", nfeatures = NULL, min.mean = 0.0125, max.mean = 3, min.disp = 0.5)
  • object (required): A Seurat object containing raw single-cell RNA sequencing data.
  • selection.method (optional): A character specifying the method used to select variant features. Commonly used methods include: “vst” (default value): Use the Variance Stabilizing Transformation (VST) method. “dispersion”: Use a dispersion-based method. “mean.var.plot”: Use mean variance plot.
  • nfeatures (optional) / is equivalent to nGene: an integer specifying the number of variant features to select. If not specified, the function selects features that meet other criteria.
  • Other parameters generally have default values.

ScaleData function:

ScaleData(object, features = NULL, vars.to.regress = NULL, model.use = "vars.to.regress", do.scale = TRUE, do.center = TRUE, block.size = 2000 )
  • object (required): A Seurat object containing raw single-cell RNA sequencing data.
  • vars.to.regress (optional): A character vector containing the names of technical variables to be removed in normalization. These variables often represent uninteresting technical effects, such as cell count or batch effects.
  • Other parameters generally have default values.

Standardized SZDX0224_val:

6. PCA dimensionality reduction/linear dimensionality reduction/principal component analysis

Generally, hypervariable genes are analyzed directly

SZDX0224_val <- RunPCA(SZDX0224_val, features = VariableFeatures(object = SZDX0224_val)) # Linear dimensionality reduction



# Displaying the first two dimensions in the form of a dot plot is of little significance
VizDimLoadings(SZDX0224_val, dims = 1:2, reduction = "pca")
# Obtain the scatter plot of the principal components. Based on this plot, you can select how many principal components are needed for dimensionality reduction.
ElbowPlot(SZDX0224_val, ndims = 50) # According to the picture here, you can see that it is enough to select 15 principal components.
# It doesn’t make much sense to determine which principal components to choose in the form of a heat map.
DimHeatmap(SZDX0224_val, dims = 1:20, cells = 500, balanced = T)
# Also used to select principal components
SZDX0224_val <- JackStraw(SZDX0224_val, num.replicate = 100)
SZDX0224_val <- ScoreJackStraw(SZDX0224_val, dims = 1:20)
JackStrawPlot(SZDX0224_val, dims = 1:20)

RunPCA function:

RunPCA(object, features = NULL, npcs = 50, ndims.print = 10, subset.row = NULL, do.print = FALSE, pcs.print = 1:5, seed.use = NULL)
  • object (required): A Seurat object containing single-cell RNA sequencing data.
  • features (optional): A character vector containing the names of the features (usually genes) to be used for PCA analysis. If not specified, PCA analysis will be performed on all features.
  • Other parameters can be left as default

Scatter plot:

What is obtained here is the difference between genes in each principal component. Generally, the way to choose the principal component depends on which point starts to smooth.

7. Nonlinear dimensionality reduction + clustering

SZDX0224_val <- FindNeighbors(SZDX0224_val, dims = 1:15)
SZDX0224_val <- FindClusters(SZDX0224_val, resolution = 0.6) # If you want to make the clustering more detailed (the groups can be divided into finer details), you can change it to a larger value, such as 1.2, etc. Generally, you can divide it into a coarser grouping first.
SZDX0224_val <- RunUMAP(SZDX0224_val, dims = 1:15) # Nonlinear dimensionality reduction
SZDX0224_val <- RunTSNE(SZDX0224_val, dims = 1:15)


#### Check the clustering effect
DimPlot(SZDX0224_val, reduction = "umap", label = T)
DimPlot(SZDX0224_val, reduction = "tsne", label = T)

I won’t explain each function in detail here. Basically, these parameters are used, and the others are kept as default.

tsne diagram:

The result of SZDX0224_val afterwards:

8. Remove multiple droplets

It is inevitable that there will be multiple packages. Use DoubletFinder to remove them here.

What is the specific removal method? . . I don’t know much about it, I just know that this can remove multiple bags of cells.

devtools::install_github('chris-mcginnis-ucsf/DoubletFinder')
library(DoubletFinder)
saveRDS(SZDX0224_val, file = "./Data/SZDX-0224_outs/SZDX0224.rds")
readRDS(SZDX0224_val, file = "./Data/SZDX-0224_outs/SZDX0224_val.rds")
SZDX0224_val_db <- SZDX0224_val
# Find the optimal pK value
sweep.res.list_SZDX0224_val <- paramSweep_v3(SZDX0224_val_db, PCs = 1:30, sct = F)
sweep.stats_SZDX0224_val <- summarizeSweep(sweep.res.list_SZDX0224_val, GT = F)
bcmvn_SZDX0224_val <- find.pK(sweep.stats_SZDX0224_val)
pK_value<-as.numeric(as.vector(bcmvn_SZDX0224_val$pK[which.max(bcmvn_SZDX0224_val$BCmetric)])) # Best pK value
# Double cell ratio calculation
annotations <- [email protected]$seurat_clusters
homotypic.prop <- modelHomotypic(annotations) # Proportion of cells of the same type
DoubletRate <- 0.004 # Look up table
nExp_poi <-round(DoubletRate*length(SZDX0224_val$seurat_clusters)) # It is better to provide celltype instead of seurat_clusters.
nExp_poi.adj <- round(nExp_poi*(1-homotypic.prop)) # Correction
#Set pN and pANN_value (this is usually F)
pN_value <- 0.25
pANN_value <- paste0("pANN_", pN_value, "_", pK_value, "_", nExp_poi)
# There are two methods for calculating multi-packets, whichever one works better, usually the second one
# SZDX0224_val_db <- doubletFinder_v3(SZDX0224_val, PCs = 1:30, pN = pN_value, pK = pK_value, nExp = nExp_poi, reuse.pANN = F)
SZDX0224_val_db <- doubletFinder_v3(SZDX0224_val, PCs = 1:30, pN = pN_value, pK = pK_value, nExp = nExp_poi.adj, reuse.pANN = F)
[email protected][1:5, ]
SZDX0224_val$doublets <- SZDX0224_val_db$DF.classifications_0.25_0.29_3 # According to the last column of the result of the above line of code, strictly speaking, it is the previous column



#### Draw tsne to view the effect
DimPlot(SZDX0224_val, reduction = "tsne", group.by = "doublets")
DimPlot(SZDX0224_val, reduction = "tsne", group.by = "seurat_clusters", label = T)

The multi-packet rate DoubletRate inside is checked according to the following table (here it is 10x):

Multipack cells picked out:

9. Highly expressed or specifically expressed genes

SZDX0224.markers <- FindAllMarkers(SZDX0224_val, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
# Save marker
write.csv(SZDX0224.markers, './Data/SZDX-0224_outs/SZDX0224.markers.csv')
# Obtain the data of the first two genes among the highly expressed genes of each cluster (including gene names, etc.)
SZDX0224.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)

# When you know the sample, you know which groups (which cells) there are, and based on the genes it has, check which group it is in.
#The following are the renderings of the following highly expressed genes in each group
#Features are the genes that you know a certain cell has
# tsne picture display
FeaturePlot(SZDX0224_val, features = c("HIST2H2AC", "HIST1H3B", "NEAT1", "MIR924HG", "BCLAF1", "HIST1H2BO", "MT-CO1\ ", "MT-ND5"), reduction = "tsne", label = T)

# The form of a violin plot shows the grouping of a certain gene in this sample (the form of a violin plot shows which group it is in)
VlnPlot(SZDX0224_val, features = c("HIST2H2AC", "HIST1H3B", "NEAT1", "MIR924HG", "BCLAF1", "HIST1H2BO", "MT-CO1\ ", "MT-ND5"), pt.size = 0)

# Display the expression of the top 10 genes among the highly expressed genes in each cluster in the form of a heat map
top10 <- SZDX0224.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
DoHeatmap(SZDX0224_val, features = top10$gene, label = F, slot = "scale.data")

FindAllMarkers function: only.pos = TRUE indicates that only highly expressed genes are displayed, min.pct = 0.25 indicates that the proportion in cells is at least 25%, logfc.threshold = 0.25 is the minimum log-fold change of differentially expressed genes (log-fold change).

The following are the highly expressed genes in each cluster

The following is the tsne diagram of each hypervariable gene. The darker the color, the higher the expression level.

# If you want to compare a certain group with another group, you can run the following code
marker <- FindMarkers(SZDX0224_val, ident.1 = "0", ident.2 = "1", only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

Here ident.1 = “0” and ident.2 = “1” represent group 0 and group 1. Others are the same as the FindAllMarkers function.

p_val is used to measure the degree of differential expression of genes. The smaller the value, the more significant the expression difference between different cell populations. p_val_adj is the corrected p_val. Generally use p_val_adj.

pct.1 and pct.2 are the expression rates of genes in the corresponding groups respectively.

10. Cell type labeling (that is, changing the original number to the corresponding cell name)

new.cluster.ids <- c("Cell Name0", "Cell Name1", "Cell Name2", "Cell Name3") # Cell type annotation vector
names(new.cluster.ids) <- levels(SZDX0224_val) # Match each element of the created new cell cluster name list to the cell cluster name (0, 1, 2, 3) in the original data
SZDX0224_val <- RenameIdents(SZDX0224_val, new.cluster.ids) # Change cell type labeling
DimPlot(SZDX0224_val, reduction = "tsne", label = T)
table(Idents(SZDX0224_val)) # Check how many cells there are
[email protected][1:5, ]
SZDX0224_val$cell_cluster_origin <- Idents(SZDX0224_val)
[email protected][1:5, ]
saveRDS(SZDX0224_val, file = "./Data/SZDX-0224_outs/SZDX0224_val.rds")