Bioinformatics analysis of miRNA sequencing data – Lecture 4, examples of bioinformatics analysis of unknown species

Bioinformatics analysis of miRNA sequencing data – Lecture 4, examples of bioinformatics analysis of unknown species
- 1. Download sequencing data
- 2. Raw data quality control – software fastqc
- 3. Annotate tRNA and rRNA, use Rfam database – software blast, Rfam_statistics.py script
- 4. Annotate miRNA, including species, sequence and quantification, target genes and mapping
- - 4.1 Identification, using the miRBase database – software blast
  - 4.2 Quantification and miRNA sequence extraction
  - 4.3 miRNA target genes, predicted using miRanda and TargetScan software
  - - 4.3.1 miRanda software
    - 4.3.2 TargetScan software
    - 4.3.3 Integrate the prediction results of two software-script Total_Target.py
  - 4.4 Drawing miRNA-target gene interaction diagram – software Cytoscape
- 5. Summary

Bioinformatics analysis of miRNA sequencing data – Lecture 4, examples of bioinformatics analysis of unknown species

Let me emphasize again: Unknown species here refer to species for which miRNA sequencing is performed, and no miRNA information exists in the miRBase, miRDB, or miRTarbase databases. rather than not knowing the species for which miRNA sequencing was performed

1. Download sequencing data

It is the same as the sequencing data in the blog post (bioinformatics analysis of miRNA sequencing data – Lecture 3, examples of bioinformatics analysis of known species). Assume that the species for this data is Oecanthus indicus.
SRA number: DRR463940 Single-end sequencing Sequencing type: miRNA-seq File DRR463940.fastq

2. Raw data quality control – software fastqc

3. Annotate tRNA and rRNA, use Rfam database – software blast, Rfam_statistics.py script

4. Annotate miRNA, including species, sequence and quantification, target genes and drawing

Sequenced species Oecanthus indicus (Oin). There is no miRNA information for this species in the miRBase, miRDB, or miRTarbase databases.

4.1 Identification, using the miRBase database – software blast

cd /home/zhaohuiyao/miRNA_seq/DRR463940/02miRNA/unknown/01miRBase
#Query whether the species Oecanthus indicus is in the database miRBase
grep “Oecanthus indicus” /home/zhaohuiyao/Database/miRBase/organisms.txt #No return

#The abbreviation for editing the three characters yourself is: oin
/home/zhaohuiyao/Biosoft/general/ncbi-blast-2.10.0 + /bin/makeblastdb -in /home/zhaohuiyao/Database/miRBase/mature.fa -dbtype nucl -out /home/zhaohuiyao/Database/miRBase/mature
#Keep only one comparison result
/home/zhaohuiyao/Biosoft/general/ncbi-blast-2.10.0 + /bin/blastn -task blastn-short -db /home/zhaohuiyao/Database/miRBase/mature -query …/…/00Rawdata/DRR463940.fasta – out DRR463940_miRBase.annotations -outfmt 6 -evalue 1e-5 -num_alignments 1
head DRR463940_miRBase.annotations

#statistics
wc -l ./DRR463940_miRBase.annotations #63113 comparison results (63113/311289=20.27%)
cut -f 2 ./DRR463940_miRBase.annotations | awk ‘{name=substr($1,5,length($1)); print name}’ | sort | uniq | wc -l #420 species of miRNA (not focused on species, only focused on miRNA species)

4.2 Quantification and miRNA sequence extraction

miRNA sequence extraction:
Step 1: Extract the sequence name of the sequencing Reads corresponding to a miRNA (not focusing on species), and perform sequence extraction in the sequencing Reads to obtain all possible sequences of the miRNA
Step 2: Multiple sequence alignment (MATTF) of all possible sequences
Step 3: Based on the MAFFT comparison results, extract the consensus sequence, which is the sequence of the sequenced species miRNA.
Step 4: Integrate all miRNA sequences
miRNA sequence quantification:
When running step one above, a sequence quantification file will be generated

Step 1: Extract the sequence name of the sequencing Reads corresponding to a miRNA (not focusing on species), and perform sequence extraction in the sequencing Reads to obtain the possible sequence of the miRNA.
cd /home/zhaohuiyao/miRNA_seq/DRR463940/02miRNA/unknown/02Sequence_Quantity
python ./miRBase_sequence_unknown.py -i …/01miRBase/DRR463940_miRBase.annotations -data /home/zhaohuiyao/miRNA_seq/DRR463940/00Rawdata/DRR463940.fasta -s oin -o ./

#miRNA quantification results

#A certain miRNA corresponds to all sequencing Reads sequences

Step 2: Multiple sequence alignment (MATTF) of all possible sequences
#Install MAFFT
Download the latest installation package from the official website: https://mafft.cbrc.jp/alignment/software/
cd /home/zhaohuiyao/Biosoft/general
wget https://mafft.cbrc.jp/alignment/software/mafft-7.505-with-extensions-src.tgz
tar -zxvf ./mafft-7.505-with-extensions-src.tgz
cd mafft-7.505-with-extensions/core/
#Edit files, vim Makefile. Modify the first line of content, PREFIX = /usr/local ? PREFIX = /home/zhaohuiyao/Biosoft/general/mafft-7.505-with-extensions
make clean
make
make install
#Executable file:/home/zhaohuiyao/Biosoft/general/mafft-7.505-with-extensions/bin/mafft

cd /home/zhaohuiyao/miRNA_seq/DRR463940/02miRNA/unknown/02Sequence_Quantity
mkdir MAFFT & amp; & amp; cd MAFFT/
#Edit script mafft_run.sh
/bin/bash ./mafft_run.sh
#Each miRNA sequence file corresponds to a .mafft result file

Step 3: Based on the MAFFT comparison results, extract the consensus sequence, which is the miRNA sequence of this species.
cd /home/zhaohuiyao/miRNA_seq/DRR463940/02miRNA/unknown/02Sequence_Quantity/MAFFT/
mkdir consensus & amp; & amp; cd consensus/
#Edit script consensus_run.sh and consensus_seq.py
Standard for consistent sequences: Based on the results of multiple sequence alignments, count the frequency of bases at each position. The base with the highest frequency is the base at that position.
Possible issues involved: ① When the most frequent character in this position is the “-” character, ignore this position. ② When there are multiple characters with the highest frequency at this position, it involves the problem of degenerate bases. We will not consider it this time and select one randomly.

/bin/bash ./consensus_run.sh
#Each .mafft result file corresponds to a .concensus file

Step 4: Integrate all miRNA sequences
cd /home/zhaohuiyao/miRNA_seq/DRR463940/02miRNA/unknown/02Sequence_Quantity
cat ./MAFFT/consensus/oin-* > DRR463940_miRBase.annotations.fa

4.3 miRNA target genes, predicted using miRanda and TargetScan software

#Three subdirectories miranda/, TargetScan/ and Total/
#The operation of both software requires the preparation of two files. miRNA sequence files and mRNA sequence files.
miRNA sequence file: ① Mature miRNA sequence, 20-24nt in length; ② Seed sequence, 2-8 nucleotides at the 5’ end of mature miRNA.
mRNA sequence file: ① It can include other ncRNA sequences (for example, lncRNA can also target and bind to miRNA); ② It can be a complete mRNA sequence, or it can be only the 3` UTR sequence of the mRNA; ③ One gene can correspond to one mRNA sequence, There can also be multiple (will eventually be integrated to the target gene level. However, the mRNA-Gene relationship file needs to be provided)

#Preparing Files
#miRNA sequence file: the miRNA sequence file obtained in 4.2 above
#mRNA sequence file: This file includes other ncRNA sequences, complete mRNA sequences, and one gene corresponding to multiple mRNA sequences.
#mRNA-Gene relationship file

cd /home/zhaohuiyao/miRNA_seq/DRR463940/02miRNA/unknown/03Target/
#mRNA sequence file
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_rna.fna.gz
gunzip ./GCF_000001405.40_GRCh38.p14_rna.fna.gz
#mRNA-Gene relationship file
grep ">" GCF_000001405.40_GRCh38.p14_rna.fna | awk -v OFS="\t" '{split($0,arr," "); name=substr(arr[1], 2); split($0,arr1,")"); split(arr1[1],arr2,"("); print name,arr2[2]}' > mrna_gene.info

#Convert fasta from multiple lines to single line
/home/zhaohuiyao/Biosoft/seqkit seq -i -w 0 GCF_000001405.40_GRCh38.p14_rna.fna > GCF_000001405.40_GRCh38.p14_rna.fna.tmp
rm GCF_000001405.40_GRCh38.p14_rna.fna
mv GCF_000001405.40_GRCh38.p14_rna.fna.tmp GCF_000001405.40_GRCh38.p14_rna.fna

4.3.1 miRanda software

#Install softwaremiRanda
cd /home/zhaohuiyao/Biosoft/general
wget http://cbio.mskcc.org/microrna_data/miRanda-aug2010.tar.gz
tar -zxvf ./miRanda-aug2010.tar.gz
cdmiRanda-3.3a/
./configure –prefix=/home/zhaohuiyao/Biosoft/general/miRanda-3.3a
make
make install
#Executable file:/home/zhaohuiyao/Biosoft/general/miRanda-3.3a/bin/miranda

cd /home/zhaohuiyao/miRNA_seq/DRR463940/02miRNA/unknown/03Target/miRanda
ln -s …/GCF_000001405.40_GRCh38.p14_rna.fna oin_mrna.fasta
ln -s …/mrna_gene.info oin_gene_mrna.info.txt
/home/zhaohuiyao/Biosoft/general/miRanda-3.3a/bin/miranda …/…/02Sequence_Quantity/DRR463940_miRBase.annotations.fa oin_mrna.fasta -quiet -out DRR463940_miRBase.annotations.miRanda.original
#Parameter-quiet: Indicates not to output the miRNA-mRNA relationship that has not been targeted.
#Result file DRR463940_miRBase.annotations.miRanda.original. Only the lines starting with “>>” in this file are the predicted miRNA-target mRNA information lines.
grep “>>” DRR463940_miRBase.annotations.miRanda.original > DRR463940_miRBase.annotations.miRanda.tmp
# Combine the mRNA-Gene relationship file (oin_gene_mrna.info.txt) to obtain the mRNA-target gene relationship file
python3 ./miRanda_Target.py -i ./DRR463940_miRBase.annotations.miRanda.tmp -db ./oin_gene_mrna.info.txt -o ./
#Result fileDRR463940_miRBase.annotations.miRanda

4.3.2 TargetScan software

#Install software TargetScan
cd /home/zhaohuiyao/Biosoft/general
mkdir targetscan_50 & amp; & amp; cd targetscan_50/
wget https://www.targetscan.org/vert_50/vert_50_data_download/targetscan_50.zip
unzip ./targetscan_50.zip & amp; & rm ./targetscan_50.zip
#Executable file:/home/zhaohuiyao/Biosoft/general/targetscan_50/targetscan_50.pl

cd /home/zhaohuiyao/miRNA_seq/DRR463940/02miRNA/unknown/03Target/TargetScan
ln -s …/GCF_000001405.40_GRCh38.p14_rna.fna oin_mrna.fasta
ln -s …/mrna_gene.info oin_gene_mrna.info.txt

#TargetScan has requirements for the input file format. Please modify the files used by miRDB as required to obtain the new input files DRR463940_miRBase.annotations.fa.TargetScan and oin_mrna.fasta.TargetScan
#miRNA only requires the seed sequence (the 2nd to 8th nucleotides at the 5’ end of the miRNA), species number: View 1982312 in NCBI’s Taxonomy database
awk ‘{if($0~/^>/){printf “%s\t”, substr($0,2) } else { printf “%s\t%s\\
”, substr($0,2 ,7),“1982312”}}’ …/…/02Sequence_Quantity/DRR463940_miRBase.annotations.fa > RR463940_miRBase.annotations.fa.TargetScan

awk ‘{if($0~/^>/){printf “%s\t”, substr($0,2) } else { printf “%s\t%s\\
”, “1982312”,$0 }}’ oin_mrna.fasta > oin_mrna.fasta.TargetScan

perl /home/zhaohuiyao/Biosoft/general/targetscan_50/targetscan_50.pl DRR463940_miRBase.annotations.fa.TargetScan oin_mrna.fasta.TargetScan DRR463940_miRBase.annotations.TargetScan.original
# Combine the mRNA-Gene relationship file (oin_gene_mrna.info.txt) to obtain the mRNA-target gene relationship file
python ./TargetScan_Target.py -i ./DRR463940_miRBase.annotations.TargetScan.original -db ./oin_gene_mrna.info.txt -o ./
#Result fileDRR463940_miRBase.annotations.TargetScan

4.3.3 Integrate the prediction results of two software-script Total_Target.py

#Take the union to obtain the final miRNA-Gene relationship file
cd /home/zhaohuiyao/miRNA_seq/DRR463940/02miRNA/unknown/03Target/Total
python ./Total_Target.py -db1 …/miRanda/DRR463940_miRBase.annotations.miRanda -db2 …/TargetScan/DRR463940_miRBase.annotations.TargetScan -o ./
#Result fileDRR463940_miRBase.annotations.target

4.4 Drawing miRNA-target gene interaction diagram – software Cytoscape

5. Summary

The above is the analysis of miRNA for unknown species. There was overlap with analyzes of known species, focusing on the use of two prediction software, miRanda and TargetScan. There are many scripts involved in the above steps, but they are all very simple file content extraction and comparison.