Installation and use of TrimGalore software

Table of Contents

  • Installation and use of TrimGalore software
      • 1. Software installation
        • Install dependent software fastqc
        • Install dependent software cutadapt
        • Install TrimGalore
      • 2. Example of use
        • Rawdata download
        • fastqc quality control
        • TrimGalore filter
        • Detailed parameters
        • Result files (four files for each end reads)

Installation and use of TrimGalore software

Packaging for FastQC and Cutadapt. Applicable to all high-throughput sequencing, including paired-end and single-end data from RRBS, Illumina, Nextera and smallRNA sequencing platforms. The software will process the data in the following 4 steps, but whether to filter the Raw data and what the filtering criteria are are closely related to the results of FastQC.

Usually the filtering criteria are as follows:
①Remove the base at the 5′ end: not because of the poor quality of the base at the 5′ end, but because of the instability at the beginning of sequencing and the existence of adapter dimers. (It will be reflected in the Per base sequence content module of the FastQC result).

②Removal of low-quality bases at the 3′ end: With the sequencing-by-synthesis process, the quality of the 3′-end bases is poor. (It will be reflected in the Per base sequence quality module of the FastQC result) Usually QV20 is stuck.

③Remove adapter sequence: Normally, we need to specify the corresponding adapter sequence (parameter –adapter), if not specified, trim_galore will automatically search for the following 3 types of adapters: Illumina, Small RNA, Nextera . Selection step: Read the first one million sequences by default, judge which of the above three types the adapter belongs to, and then remove it. If you do not want the software to automatically judge, you can also specify the corresponding adapter type through the –illumina, –nextera, –small_rna parameters. (See the three classic adapter types in the Adapter Content module of the FastQC results)

④Remove sequences that are too short: By default, if the sequence length is less than 20nt, this sequence will be discarded.

1. Software installation

Official website: https://github.com/FelixKrueger/TrimGalore/

cd /home/zhaohuiyao/Biosoft/general

Install dependent software fastqc

wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip ./fastqc_v0.11.9.zip
rm ./fastqc_v0.11.9.zip
cd FastQC/
chmod +x ./fastqc
#Add to environment variables
echo 'export PATH=/home/zhaohuiyao/Biosoft/general/FastQC/:$PATH' >> ~/.bashrc
source ~/.bashrc

Install dependent software cutadapt

conda create -n cutadapt
conda activate cutadapt
conda install cutadapt

Install TrimGalore

After downloading the installation package from the official website, upload it

tar -zxvf TrimGalore-0.6.10.tar.gz
cd ./TrimGalore-0.6.10/

#Executable file location: /home/zhaohuiyao/Biosoft/general/TrimGalore-0.6.10/trim_galore

2. Usage example

Rawdata download

Using Aspera software

/home/zhaohuiyao/.aspera/connect/bin/ascp -v -QT -l 400m -P33001 -k1 -i /home/zhaohuiyao/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp. sra.ebi.ac.uk:/vol1/fastq/SRR225/074/SRR22522274/SRR22522274_1.fastq.gz ./
/home/zhaohuiyao/.aspera/connect/bin/ascp -v -QT -l 400m -P33001 -k1 -i /home/zhaohuiyao/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected] .ac.uk:/vol1/fastq/SRR225/074/SRR22522274/SRR22522274_2.fastq.gz ./

fastqc quality control

fastqc -t 2 -o ./SRR22522274_1.fastq SRR22522274_2.fastq

#Observe the results of 10 modules and prepare for subsequent filtering
#After observation, it was found that: ① The base at the 5′ end, cut off 4bp. ② The base at the 3′ end does not need to be trimmed, and the quality is very high. ③Adapt sequence is Illumina Universal Adapter, trimmed. ④ There are too many repetitive sequences that need to be removed (maybe a large number of adapt dimers exist, and the value will decrease after trimming adapt. If it is a repeat of other sequences, it can be completed by other software. Trim_galore can be performed in –clock mode, but I did not try.)
Please add a picture description

TrimGalore filter

/home/zhaohuiyao/Biosoft/general/TrimGalore-0.6.10/trim_galore -q 20 –phred33 –fastqc –stringency 3 –length 20 -e 0.1 –paired –dont_gzip –clip_R1 4 –clip_R2 4 –illumina -o ./…/SRR22522274_1.fastq…/SRR22522274_2.fastq

Parameter details

**-q/–quality:** Set the Phred quality score threshold, the default is 20.
**–phred33/–phred64:** Select –phred33 or –phred64. default-phred33
**–fastqc:** Run fastqc on the filtered result file. If you need to specify the parameters to run fastqc, use –fastqc_args. For example –fastqc_args “-t 10 -o ./”
**–adapter:** Enter the adapter sequence of R1. You can also leave it blank, and Trim Galore will automatically find the adapter corresponding to the platform with the highest probability (the first million sequences are read by default). Three platforms are automatically searched and selected, and these three platforms are also directly and explicitly input, namely –illumina, –nextera and –small_rna.
**–adapter2:** Enter the adapter sequence of R2. Consistent with –adapter
–illumina: Illumina universal adapter AGATCGGAAGAGC
–stranded_illumina: Illumina stranded mRNA or Total RNA adapter ACTGTCTCTTATA
–nextera: Nextera Transposase adapter CTGTCTCTTATA
–small_rna: Illumina Small RNA 3’ Adapter TGGAATTCTCGG
**–max_length: **Maximum sequence length, used for smallRNA sequencing to remove non-small RNA sequences
**–stringency:** Set the number of bases that can be tolerated before and after the adapter overlaps, the default is 1
**-e: **Maximum error tolerance rate. default 0.1
**–length:** Set the output reads length threshold, which will be discarded if it is less than the set value. Default is 20
**–paired:** For paired-end sequencing results, if one of a pair of reads is eliminated, the other will be discarded as well, regardless of whether the standard is met.
**–retain_unpaired:** For paired-end sequencing results, in a pair of reads, if one read meets the standard, but the corresponding other is discarded, the read that meets the standard will be saved as a separate file.
**–gzip and –dont_gzip: **The cleaned data is zip packed or not packed.
**–output_dir:** Input directory. The directory needs to be created in advance, otherwise an error will be reported when running.
–clip_R1/–clip_R2: The number of bases removed at the 5’ end of R1/R2
–three_prime_clip_R1/–three_prime_clip_R2: The number of bases removed at the 3’ end of R1/R2
**-j: **Number of cores
**–rrbs:** Dedicated to RRBS data processed with MspI

Result files (four files for each reads)

Filtration description file: SRR22522274_1.fastq_trimming_report.txt
Filtered sequence file: SRR22522274_1_val_1.fq
Fastqc results of filtered sequence files: SRR22522274_1_val_1_fastqc.zip and SRR22522274_1_val_1_fastqc.html
# The result is fine after filtering. Calculate the filtration rate
#Statistics separately

awk 'BEGIN{number=0;len=0} {if(NR%4 == 1){number + =1}} {if(NR%4 == 2){len + =length($0 )}} END{printf "%s%d\t%s%d\\
", "Total number: ",number,"Total length: ",len }' ./SRR22522274_1_val_1 .fq

# Calculate the percentage

echo "17159381/17207575" | bc –l
echo "1036068944 + 1036202087" | bc -l

Please add a picture description