Identification and differentiation time of genomic WGD

1. Introduction to WGD

Whole genome duplications (WGD) is one of the important factors in biological evolution (factors leading to genome amplification include whole genome duplications and transposon TEs), so WGD analysis is also a common factor in genome analysis. An analysis method used.

There are two methods for ancient WGD detection, one is collinearity analysis, and the other is based on Ks distribution map. Among them, Ks is defined as the average number of synonymous substitutions at each synonymous site, and there is a corresponding Ka, which refers to the average number of non-synonymous substitutions at each non-synonymous site.

If there is no WGD or large segment duplication, then the synonymous substitution of paralogous genes in the genome conforms to the exponential distribution (exponential distribution). On the contrary, in the Ks distribution diagram A normal distributed peak due to WGD will appear. The age of ancient WGD can be predicted by analyzing the number of homologous substitutions in these peaks (Tiley et al., 2018).

2.Ka/Ks positive selection

Ka/Ks represents the ratio between non-synonymous substitutions (Ka) and synonymous substitutions (Ks). This ratio can determine whether the gene encoding the protein has suffered selection pressure.

Synonymous mutation Ks means that the mutation does not affect the amino acid sequence (codon merging), and thus does not affect the protein structure and function.

Non-synonymous mutation Ka will affect the amino acid sequence, may change its structure and function, and may be subject to natural selection.

We generally believe that synonymous mutations are not subject to natural selection, while non-synonymous mutations are subject to natural selection. In the analysis of biological evolution, it is very meaningful to know the rate at which synonymous and non-synonymous mutations occur in a species. The frequency of synonymous mutations is the Ks value, the frequency of non-synonymous mutations is the Ka value, and the ratio of the non-synonymous mutation rate to the synonymous mutation rate is the Ka/Ks value. If Ka/Ks > 1, it is considered that there is a positive selection effect (positive selection); if Ka/Ks = 1, it is considered that there is a neutral selection effect; if Ka/Ks < 1, it is considered that there is a negative selection effect, that is, a purification effect. Or purifying selection.

3. MCScanX

MCScanX is a software that analyzes collinear blocks within the genome of a species or between genomes of different species. It uses the intra- or inter-species protein blastp comparison results and combines the results encoding these proteins. The position coordinates of genes in the genomeobtain the collinear blocks of genomes within or between species.

For the installation and detailed use of MCScanX software, please refer to the official website. It is relatively user-friendly to install and use. http://chibba.pgml.uga.edu/mcscan2/#tm

3.1 Intra-species collinearity analysis

MCScanX requires two input files, sample.gff (four columns of data) and sample.blast, for synteny analysis.
1.sample.gff

sample.gff contains four columns of data, the first column is the chromosome ID, the second column is the gene ID, and the third and fourth columns are the start and end positions respectively.

##Prepare the sample.gff file from the gff3 file
cat sample.gene.gff3 |awk '{if($3=="gene"){print $1,$9,$4,$5}}'|sed "s/;.*;//g" |sed "s/ID=//g"|sed "s/ /\t/g" >sample.gff 

2.sample.blast

##Build a library for protein sequences
makeblastdb -in sample.pep.fa -dbtype prot -out index/sample.pep

##Perform self-comparison and generate comparison result sample.blast in No. 6 format
blastp -query sample.pep.fa -db index/sample.pep -out sample.blast -evalue 1e-5 -num_threads 12 -outfmt 6 -num_alignments 5 &

3. Run MCScanX

In the directory with two files, sample.gff and sample.blast, specify the prefix sample and run MCScanX sample:

MCScanX sample

Explanation of important parameters:

-s MATCH_SIZE,default: 5. The lower limit on the number of genes contained in each collinear block.

-m MAX_GAPS, default: 25. The maximum number of gaps allowed in a collinear block.

-b patterns of collinear blocks. 0: intra- and inter-species (default); 1: intra-species; 2: inter-species.

3.2 Interspecies synteny analysis MCScanX_h

  1. sample.gff
  2. sample.homology: is a tab-delimited list of paired gene IDs (as shown below)—can be extracted from species identified by software such as orthofinder or OrthoMCL.

3. Run MCScanX

The results are RUF_JAP.collinearity, RUF_JAP.tandem files and RUF_JAP.html folder, in which the information we need is in the Citrus_sinensis.collinearity result file.

(1) RUF_JAP.collinearity

The collinearity result file includes three parts:

parameters

Basic statistics: the total number of collinear genes, the total number of genes, and the proportion of collinear genes.

Collinear block (block) information: An Alignment represents a collinear block (0 starting number). This is followed by the information about the gene pairs of this collinear block. The first column: block number; the second column: the gene pair number; the third and fourth columns: the gene pair name; the fifth column: the e_value of the blast comparison.

(2) The folder where the web page file is located contains a RUF_JAP.html file for each chromosome. The html file is opened with a browser and contains three columns of information.

The first column is the replication depth.

The second column is the arrangement order of all genes on this chromosome. The background of tandemly repeated genes is red.

The third column and subsequent columns are the corresponding gene names in the alignment.

(3) RUF_JAP.tandem

This file contains a list of tandem repeat gene IDs within the genome.

Note: MCScanX will divide the chromosomes into different species based on the prefix of the chromosome number in the gff file (the first 2 characters: RUF_;JAP_). If MCScanX recognizes that the input data contains multiple species, it will not generate a tandem file.

3.3 Extract collinear blocks (gene pairs)

cat RUF.collinearity | grep "RUF" | awk '{print $3"\t"$4}' > RUF.homolog

4.Ka, Ks and 4Dtv value calculation

For details, please refer to the calculation of Ka/Ks and 4Dtv values.

Software involved: KaKs_Calculator2.0 and ParaAT

Proposal:ParaAT.pl + KaKs_Calculator2.0 Refer to the detailed process of Ka/Ks and 4Dtv value calculation

ParaAT.pl is used to generate aligned gene pair cds sequences based on homologous gene pair lists, and can specify the output format, such as axt format;

KaKs_Calculator is used to calculate the kaks of gene pairs

Two scripts will be used:

  • axt2one-line.py converts axt format into a single line
  • calculate_4DTV_correction.pl calculates 4dtv.

4.1 Use KaKs_Calculator to calculate ka and ks values. The -m parameter specifies the calculation method of kaks value as YN model

# Utilize loops
#! /bin/bash
for i in `ls *.axt`;do KaKs_Calculator -i $i -o ${i}.kaks -m YN;done # -m parameter specifies the model

# Convert multi-line axt file into single line

for i in `ls *.axt`;do axt2one-line.py $i ${i}.one-line;done

4.2 4Dtv (four times degenerate site transversion rate)

Biological significance: The 4Dtv value of gene pairs in the collinear region can reflect the relative differentiation events and WGD events of species in the evolutionary process to a certain extent.

5. Results visualization

It is recommended to use Tbtools software to visualize the results.

Collinearity analysis and visualization using Tbtools software

Reference:

Tiley, G.P., Barker, M.S., and Burleigh, J.G. (2018). Assessing the Performance of Ks Plots for Detecting Ancient Whole Genome Duplications. Genome Biol Evol 10, 2882–2898.

Huang, S., Li, R., Zhang, Z. et al. The genome of the cucumber, Cucumis sativus L.. Nat Genet41, 1275–1281 (2009). https://doi.org/10.1038/ng.475

About KaKs_Calculator2.0 operation model selection: KaKs_Calculator2.0_manual

https://www.jianshu.com/p/9d28de3d18e6

Identification and time estimation of genomic WGD events – MCScanX, KaKa_Calculator | Bioinformatics Technology