Gene family expansion and contraction analysis-CAFE5

CAFE (Computational Analysis of gene Family Evolution) is a software that analyzes gene family size changes in a way that explains phylogenetic history. This analysis is often called a gene family. Gene family expansions and contractions analysis.

CAFE uses the birth and death process to simulate gene acquisition and loss in a user-specified phylogenetic tree. It can calculate the gene family size transfer rate from parent node to child node, and can also infer the gene family size of ancestral species. Under this model The resulting gene family size distributions can provide a basis for assessing the significance of observed family size differences between taxa.

Since the Hahn research group proposed an algorithm to evaluate the evolutionary speed and pattern of gene families in 2005, after the first version of CAFE was published in 2006, the latest version CAFE5 was launched in 2020. The basic model of the previous old version assumed that all gene families have the same evolution. rate. The new version supports gamma distributed rate classes to explicitly model rate variation between families. Published in 2020, it is more convenient to operate than cafe4 and has a new Model (Gamma).

CAFE5 download

1.Installation

git clone https://github.com/hahnlab/CAFE5.git
cdCAFE5
./configure
make

Usage of 2.cafe5

There must be at least two input files, one is the gene family number statistics file Genefamilies_Count.tsv, and the other is the tree file tree.txt (with differentiation time). You can also add another lambda file

2.1 Common parameters

--infile, -i

Path to tab delimited gene families file to be analyzed - Required for estimation.

--n_gamma_cats, -k

Number of gamma categories to use. If specified, the Gamma model will be used to run calculations; otherwise the Base model will be used.

--pvalue, -P

P-value to use for determining significance of family size change, Default=0.05.

--tree, -t

Path to file containing newick formatted tree - Required for estimation.

--lambda_tree, -y

Path to lambda tree, for use with multiple lambdas.

--output_prefix, -o

Output directory - Name of directory automatically created for output. Default=results.

2.2 Input file preparation

2.2.1 Data preparation

2.2.1.1 Identifying gene families:

(1) Only representative transcripts of each gene are retained, and alternative splicing and redundant genes are removed;
(2) Establish a BLAST database and use blastp for all-by-all comparison;
(3) Use MCL to cluster based on blastp results. Gene sequences with similar gene sequences are usually a gene family;
(4) Parse the output results of MCL and use them as input to CAFE.

(1) Combine all longest transcripts into a single file

#Extract the longest transcript of each gene
python python_scripts/cafetutorial_longest_iso.py -d twelve_spp_proteins/
#Merge the longest transcripts of each species into 1 file
cat twelve_spp_proteins/longest_*.fa | /data1/spider/ytbiosoft/seqkit rmdup -> makeblastdb_input.fa

(2) All-by-all BLAST

##My blast tool is installed in a sub-environment of miniconda. First, I need to activate this environment.
source/data1/spider/miniconda3/bin/activate
conda activate bioinforspace

##First use makeblastdb to create the BLAST database
makeblastdb -in makeblastdb_input.fa -dbtype prot -out blastdb

##Afterwards, use blastp to perform sequence search to obtain similar sequences for each sequence.
blastp -num_threads 20 -db blastdb -query makeblastdb_input.fa -outfmt 7 -seg yes > blast_output.txt &
#parameter:
-seg parameter filters low complexity sequences (i.e. amino acid coded as X)
-num_threads number of threads, set here to 20

(3) Use MCL for sequence clustering
Find clusters based on sequence similarity information in the BLAST output, and these clusters will be used for gene families in subsequent CAFE analyses.

##Use shell command to convert BLAST into ABC format recognized by MCL grep -v "#" blast_output.txt | cut -f 1,2,11 > blast_output.abc

##Create network files (.mci) and dictionary files (.tab), and then perform clustering
source /data1/spider/miniconda3/bin/activateconda activate orthofinder

##mcxload is installed in orthofinder environment
mcxload -abc blast_output.abc --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)' -o blast_output.mci -write-tab blast_output.tab &

#parameter:
--stream-mirror: Create a mirror for the input, that is, every X-Y has a Y-X
--stream-neg-log10: Perform -log10 conversion on the input value
-stream-tf: Convert the input value to a one-variable function. Ceil(200) is used here.

#Clustering based on the mci file, the main parameter to adjust is -I, which determines the granularity of the clustering. The smaller the value, the greater the clustering density. Generally set to 3.
mcl blast_output.mci -I 3
mcxdump -icl out.blast_output.mci.I30 -tabr blast_output.tab -o dump.blast_output.mci.I30

(4) Organize the output results of MCL
The output of MCL in the previous step cannot be directly used in CAFE, and it needs to be parsed and filtered.

##1. Convert the original mcl format to a format that CAFE can use
python python_scripts/cafetutorial_mcl2rawcafe.py -i dump.blast_output.mci.I30 -o unfiltered_cafe_input.txt -sp "ENSG00 ENSPTR ENSPPY ENSPAN ENSNLE ENSMMU ENSCJA ENSRNO ENSMUS ENSFCA ENSECA ENSBTA"
##2. Eliminate those gene families with particularly large gene copy number variation, because it will cause errors in parameter prediction. The following script filters out gene families that have more than 100 gene copies in one or more species.
python python_scripts/cafetutorial_clade_and_size_filter.py -i unfiltered_cafe_input.txt -o filtered_cafe_input.txt -s
##3. Replace the original number with a meaningful species name
sed -i -e 's/ENSPAN/baboon/' -e 's/ENSFCA/cat/' -e 's/ENSBTA/cow/' -e 's/ENSNLE/gibbon/' -e 's/ENSECA/ horse/' -e 's/ENSG00/human/' -e 's/ENSMMU/macaque/' -e 's/ENSCJA/marmoset/' -e 's/ENSMUS/mouse/' -e 's/ENSPPY/ orang/' -e 's/ENSRNO/rat/' -e 's/ENSPTR/chimp/' filtered_cafe_input.txt
sed -i -e 's/ENSPAN/baboon/' -e 's/ENSFCA/cat/' -e 's/ENSBTA/cow/' -e 's/ENSNLE/gibbon/' -e 's/ENSECA/ horse/' -e 's/ENSG00/human/' -e 's/ENSMMU/macaque/' -e 's/ENSCJA/marmoset/' -e 's/ENSMUS/mouse/' -e 's/ENSPPY/ orang/' -e 's/ENSRNO/rat/' -e 's/ENSPTR/chimp/' large_filtered_cafe_input.txt

2.2.1.2 Species tree inference

Building a species tree is mainly divided into two steps: multi-sequence alignment and phylogenetic tree. Subsequently, a hypermetric tree is constructed based on the existing evolutionary tree and used as the input of CAFE.
①Multiple sequence alignment generally uses single copies of orthologous genes. Multiple sequence alignments are performed separately and then merged into a single file;
②Use phylogenetic tree prediction software to build the tree (Software: RAxML, PhyML, FastTree; MrBayes), and save the generated phylogenetic tree in NEWICK format as twelve_spp_raxml_cat_tree_midpoint_rooted.txt;

cat twelve_spp_raxml_cat_tree_midpoint_rooted.txt
(((cow:0.09289 ,(cat:0.07151,horse:0.05727):0.00398):0.02355,((((orang:0.01034,(chimp:0.00440,human:0.00396):0.00587):0.00184,gibbon:0.01331): 0.00573,(macaque:0.00443,baboon:0.00422):0.01431):0.01097,marmoset:0.03886):0.04239):0.03383,(rat:0.04110,mouse:0.03854):0.10918);

③Infer hypermetric tree
An ultrametric tree is also called a time tree, which means changing the scale of the phylogenetic tree to time, so that the distance from the root to all species is the same. There are many construction methods, the more commonly used one is r8s. Please refer to the article for details

#Use cafetutorial_prep_r8s.py to build the batch running script of r8s, and then extract the hypermetric tree
python python_scripts/cafetutorial_prep_r8s.py -i twelve_spp_raxml_cat_tree_midpoint_rooted.txt -o r8s_ctl_file.txt -s 35157236 -p 'human,cat' -c '94'
/data1/spider/liupiao/biosoft/r8s1.81/src/r8s -b -f r8s_ctl_file.txt > r8s_tmp.txt & amp;
tail -n 1 r8s_tmp.txt | cut -c 16- > twelve_spp_r8s_ultrametric.txt

2.2.2. Genefamilies_Count.tsv

Tab-delimited gene family count file, usually OrthoMCL,
Software such as OrthoFinder obtains counting information.
Example format

Desc Family ID human chimp orang baboon gibbon macaque marmoset rat mouse cat horse cow
ATPase ORTHOMCL1 52 55 54 57 54 56 56 53 52 57 55 54
(null) ORTHOMCL2 76 51 41 39 45 36 37 67 79 37 41 49
HMG box ORTHOMCL3 50 49 48 48 46 49 48 55 52 51 47 55
(null) ORTHOMCL4 43 43 47 53 44 47 46 59 58 51 50 55
Dynamin ORTHOMCL5 43 40 43 44 31 46 33 79 70 43 49 50
...
....
..
DnaJ ORTHOMCL10016 45 46 50 46 46 47 46 48 49 45 44 48

We first use the Orthogroups.GeneCount.tsv file of OrthoFinder to generate an input file that meets the requirements.

cp Results_May02/Orthogroups/Orthogroups.GeneCount.tsv CAFE/
awk 'OFS="\t" {$NF=""; print}' Orthogroups.GeneCount.tsv > tmp & amp; & amp; awk '{print "(null)""\t"$0}' tmp > cafe .input.tsv & amp; & amp; sed -i '1s/(null)/Desc/g' cafe.input.tsv & amp; & amp; rm tmp

View file format

Desc Orthogroup Aof.pro Ath.pro Atr.pro Cba.pro Cri.pro Csa.pro Csu.pro Kle.pro Mpo.pro Nco.pro Osa.pro Ppa.pro Smo.pro Tpl.pro Vca.pro Vvi.pro Zma.pro
(null) OG0000000 145 112 95 5 372 129 3 1 2 217 126 16 206 419 4 177 117
(null) OG0000001 9 4 3 1691 9 96 2 56 2 4 21 0 2 5 3 2 0
(null) OG0000002 32 117 62 1 92 117 2 0 20 81 119 77 40 193 5 107 161
(null) OG0000003 37 104 54 3 89 76 4 5 10 74 144 22 47 134 8 79 154
(null) OG0000004 73 104 51 4 40 80 2 10 12 76 87 33 22 136 5 97 135
(null) OG0000005 28 46 36 11 37 47 0 3 50 81 81 32 48 120 0 54 73
(null) OG0000006 41 43 74 6 38 57 0 4 25 57 52 19 33 155 0 87 40
(null) OG0000007 58 52 60 0 18 42 0 0 12 50 56 17 57 99 1 82 52
(null) OG0000008 38 57 26 7 52 47 4 6 19 40 59 43 20 29 1 41 80
(null) OG0000009 46 57 26 1 25 46 1 2 11 52 65 29 13 50 1 48 87

After generation, you need to eliminate gene families with too large copy number differences between different species, otherwise an error will be reported. There is a built-in script that can be used. I need to remove the first line when running it to use it.

python ~/soft/CAFE5/tutorial/clade_and_size_filter.py -i cafe.input.tsv -o gene_family_filter.txt -s

Another way:

awk 'NR==1 || $3<100 & amp; & amp; $4<100 & amp; & amp; $5<100 & amp; & amp; $6<100 & amp; & amp; $7<100 & amp; & amp; $8<100 & amp; & amp; $9<100 & amp; & amp; $10<100 & amp; & amp; $11<100 & amp; & amp; $12<100 & amp; & amp ; $13<100 & & $14<100 & & $15<100 & & $16<100 & & $17<100 & & $18<100& amp; & amp; $19<100 {print $0}' cafe.input.tsv >gene_family_filter.txt

For the final file format, ensure that the species name in the first line is consistent with the evolutionary tree.

Desc Orthogroup Aof Ath Atr Cba Cri Csa Csu Kle Mpo Nco Osa Ppa Smo Tpl Vca Vvi Zma
(null) OG0000020 26 37 23 4 35 28 0 1 24 28 43 24 27 47 0 35 42
(null) OG0000021 49 41 31 7 30 31 8 2 7 26 49 15 11 31 0 36 45
(null) OG0000022 27 25 31 0 27 34 2 1 23 25 46 18 27 44 1 39 45
(null) OG0000024 37 40 27 0 22 30 1 11 9 33 38 18 25 43 0 37 39
(null) OG0000029 28 26 23 2 24 25 1 2 5 32 34 31 17 35 1 30 40
(null) OG0000030 23 30 16 1 23 27 1 1 27 26 26 16 15 49 1 28 35
(null) OG0000031 28 36 26 3 27 23 8 1 3 17 37 10 18 38 3 34 28
(null) OG0000032 18 16 25 0 24 19 0 5 4 25 36 6 38 49 1 38 35
(null) OG0000033 19 17 20 0 12 16 4 6 18 35 42 4 23 39 3 45 28
(null) OG0000035 17 37 17 8 28 24 2 2 5 30 41 8 6 26 2 32 37
(null) OG0000036 22 15 17 3 22 19 7 12 13 20 24 20 30 38 3 35 22
(null) OG0000039 14 27 36 0 34 24 0 2 2 12 41 4 47 13 0 37 26
(null) OG0000040 15 30 9 1 19 35 0 2 11 25 26 19 12 48 0 39 27

2.2.2. tree.txt

This step directly uses the FigTree.tre file generated by mcmctree and can be used after modification.

grep "UTREE 1 =" FigTree.tre | sed -E -e "s/\[[^]]*\]//g" -e "s/[ \t]//g" -e " /^$/d" -e "s/UTREE1=//" > tree.txt

3.1 Running CAFE5

 cafe5 -i gene_family_filter.txt -t tree.txt -o out -c 1
## If using Gamma model and Poisson distribution
 cafe5 -i gene_family_filter.txt -t tree.txt -o out -c 1 -k 2 -p ##Note that -k can be adjusted, usually 2-5

3.2 CAFE5 output results

3.2.1 Result files

Gamma_asr.tre ## Tree file for each gene family
Gamma_branch_probabilities.tab ## Probability calculated for each branch
Gamma_category_likelihoods.txt
Gamma_change.tab ## The number of contractions and expansions of each gene family at each node
Gamma_clade_results.txt ##The number of expansion/contraction of gene families at each node
Gamma_count.txt ## The number of each gene family in each node
Gamma_family_likelihoods.txt
Gamma_family_results.txt ## p value of gene family changes and whether they are significant results
Gamma_results.txt ## Model, final likelihood value, final Lambda value and other parameter information.

Note: The files we mainly use are Gamma_asr.tre (mainly corresponding to the nodes in the following table), Gamma_change.tab (see which gene families change at which node), Gamma_clade_results.txt(Reflected on the tree, the number of contractions/expanions of gene families at each node), Gamma_family_results.txt(Gene families with significant expansion/contraction)

3.2.2 Reflection of the contraction/expansion number of gene families at each node

In fact, there is a drawing script, but it has not been updated for a long time. It may not be suitable for CAFE5. We can draw it ourselves.
To reflect the expansion/contraction number of a gene family on the tree, two files are required, Gamma_asr.tre, Gamma_clade_results.txt

cat Gamma_clade_results.txt
#Taxon_ID Increase Decrease
Mpo<21> 232 1298
Ppa<20> 2231 371
<31> 134 65
<25> 949 220
<23> 134 209
Atr<13> 516 922
Kle<29> 493 741
<12> 245 56
<28> 314 340
<4> 118 176
Cri<17> 1669 287
<22> 214 184
<19> 445 93
<16> 291 352
Osa<3> 579 572
Aof<6> 935 840
<8> 142 112
Csa<1> 326 834
<7> 640 138
Tpl<15> 1147 395
<14> 204 273
Nco<11> 631 559
Zma<2> 1776 232
Vvi<5> 958 433
<10> 413 112
<9> 345 66
Smo<18> 842 1315
Cba<24> 744 1664
<30> 23 17
Ath<0> 1090 291
Csu<27> 305 1560
Vca<26> 438 1040

less Gamma_asr.tre
BEGIN TREES;
  TREE OG0000021 = ((Kle<29>*_2:820.007,(((Mpo<21>*_7:428.285,Ppa<20>*_15:428.285)<23>_12:70.3982,((Cri<17>_30: 404.796,(Tpl<15>_31:308.175,(Atr<13>_31:208.47,(Nco<11>*_26:176.909,(((Osa<3>*_49:45.2652,Zma<2>_45:45.2652) <7>_45:60.4652,Aof<6>*_49:105.73)<9>*_44:25.8965,(Vvi<5>_36:101.378,(Csa<1>*_31:83.0358,Ath<0>*_41: 83.0358)<4>_36:18.3421)<8>_36:30.2489)<10>*_36:45.2824)<12>_31:31.5605)<14>_31:99.7053)<16>*_30:96.6205)<19>* _27:48.1939,Smo<18>_11:452.99)<22>*_13:45.694)<25>*_12:190.787,Cba<24>_7:689.47)<28>*_7:130.537)<31>_5:177.864 ,(Csu<27>*_8:874.03,Vca<26>*_0:874.03)<30>_5:123.841)<32>_5

Visualization:

Gene families shrink and expand

3.2.3 Other arrangements

Compared with CAFE4, these results do not directly reflect significantly expanded/shrunk gene families, or if we want to find the specific expanded genes of a certain node, we can further organize them based on the output files we have obtained so far.

cat Gamma_family_results.txt |grep "y"|cut -f1 >p0.05.significant
#Extract significantly expanded or contracted orthogroupsID
grep -f p0.05.significant Gamma_change.tab > Gamma_p0.05change.tab
#The number of contractions and expansions of significantly expanded/contracted gene families at each node
cat Gamma_p0.05change.tab | cut -f1,2 | grep " + [1-9]" | cut -f1 > node0significant.expand
#Ath significant expansion of orthogroupsID
wc -l node0significant.expand
#AthNumber of significantly expanded gene families
cp ../../Results_May02/Orthogroups/Orthogroups.tsv ./
grep -f node0significant.expand Orthogroups.tsv | cut -f3 | sed "s/ /\
/g" | sed "s/\t/\
/g" | sed "s/,//g" | sort | uniq > node0significant.expand.genes
#Extract genes with significantly expanded Ath, method one
cp ../../Results_May02/Orthogroups/Orthogroups.txt ./
grep -f node0significant.expand Orthogroups.txt |sed "s/ /\
/g" | grep "Ath" | sort | uniq > node0significant.expand.genes
#Extract genes with significantly expanded Ath, method 2

Reference tutorial:

Tutorial 2

Download, install and use CAFE5

Analysis of gene family contraction and expansion using cafe software

Orthologous gene analysis using OrthoFinder

How to visualize the results of cafes