LinSNPGT: Genotyping specified SNP loci on Linux systems
-
- General introduction
- background
- Test Data
- Install
- SNPGT
- SNPGT-build
- contact us
General introduction
- We have developed a toolkit WinSNPGT for calling variant sites on Windows systems, which is very friendly to those with little Linux operating experience. It obtains genotypes from raw sequencing data for a specified SNP site in our dataset. LinSNPGT is the Linux platform version of this toolkit.
- Below are the installation and usage instructions for the kit.
Background
- We developed a phenotype prediction platform,CropGS-Hub, which contains multiple high-quality important crop datasets (such as rice, corn, etc.). These datasets were used as training sets to build phenotypic prediction models. Users can upload the genotypes of their own samples to the platform for online phenotype prediction.
- The LinSNPGT toolkit was developed to ensure that the genotypes uploaded by users match the genotypes in the modeling training set, thereby avoiding biased prediction results. Users can run the program on a Linux system and implement the entire process from file sequencing to obtaining genotypes through simple operations.
Test data
- The sample data files are not included in the installation package and can be downloaded by clicking: Sample Data
tar -zxvf example-data.tar.gz
The species of the sample data file is Oryza sativa (rice). You can select rice-related data sets in the toolkit when genotyping, such as GSTP007 ~ GSTP009.
- In the process of using LinSNPGT, you need to download RefDataSetFile. Listed below are the download links and the RefDataSet_File name that needs to be filled in for the file.
- Maize (Zea mays):
- GSTP001_8652_Hybrid : Maize_8652_Hybrid
- GSTP002_5820_Hybrid : Maize_5820_Hybrid
- GSTP003_1458_Inbred : Maize_1458_Inbred
- GSTP004_1404_Inbred : Maize_1404_Inbred
- GSTP005_350_Inbred : Maize_350_Inbred
- GSTP006_1604_Inbred : Maize_1604_Inbred
- Rice (Oryza sativa):
- GSTP007_1495_Hybrid : Rice_1495_Hybrid
- GSTP008_705_Inbred : Rice_705_Inbred
- GSTP009_378_Inbred : Rice_378_Inbred
- Cotton (Gossypium hirsutum):
- GSTP010_1245_Inbred : Cotton_1245_Inbred
- Millet (Setaria italica):
- GSTP011_827_Inbred : Millet_827_Inbred
- Chickpea (Cicer arietinum):
- GTP012_2921_Inbred : Chickpea_2921_Inbred
- Rapeseed (Brassica napus):
- GSTP013_991_Inbred : Rapeseed_991_Inbred
- Soybean (Glycine max):
- GSTP014_2795_Inbred : Soybean_2795_Inbred
- Maize (Zea mays):
Installation
-
LinSNPGT depends on environment & software
- Python3
- bowtie2
- samtools
- java8
-
Configuration & Installation
git clone https://gitee.com/Min-Zer0/LinSNPGT.git cd LinSNPGT chmod + x ./install.sh & amp; & amp; ./install.sh #installjava8 ./install.Java8.sh # install bowtie2 sudo apt install bowtie2 # install samtools sudo apt install samtools # install seqtk if you want to use **SNPGT-build** sudo apt-get install seqtk
SNPGT
-
LinSNPGT has 3 subfolders and 3 files after installation
- .sys/
- 01.Reference_Genome/
- 02.Input_Fastq/
SNPGT-build.py
SNPGT.config
SNPGT.py
-
To run LinSNPGT via
SNPGT.py
, users need to fill inSNPGT.config
and place the file in the corresponding path.
- Fill in the
config.txt
file-
Software Path: This content generally does not need to be changed.
-
Project_Name: Enter your project name, which will be used as the output file prefix.
-
RefDataSetFile: Enter the data set corresponding to the model to be fitted.
- Available species and datasets are listed above, including their download links.
- The RefDataSet’s species should match your raw sequencing data.
-
Thread_Count: Enter the number of threads available to run the program
-
Samples_list: Fill in your original sequencing data and its corresponding sample name.
- Must follow the format
| SAMPLE NAME | RAW READS NAME | RAW READS NAME|
- Separated by vertical bars, each sample is filled in a separate line, and each Reads file is represented in a separate column; Read1 comes first, Read2 comes last.
- Must follow the format
-
config.txt (example)
#==================== Software Path ===================== ==# Java_Path=./jdk/bin/java Bowtie2_Path=bowtie2 Samtools_Path=samtools #==================== LinSNPGT Config =======================# *[Project] Project_Name=Rice *[Species and Dataset] RefDataSet_File=Rice_705_Inbred * [Running LinSNPGT Thread] Thread_Count=10 *[Samples_list] > =========================================== > |sample | Read1 | Read2 | >------------------------------------------------ | Line1 | test1_1.fastq.gz | test1_2.fastq.gz | | Line2 | test2_1.fastq.gz | test2_2.fastq.gz |
-
- Download the RefDataSetFile file
(*.tar.gz)
and move it to the path: ./01.Reference_Genome/ - Move raw sequencing data
(*.fastq.gz)
or(*.fastq)
to the path: ./02.Input_Fastq - Run command:
python SNPGT.py
- output & follow-up
-
After the program is completed, the result file will be output in the Result/ directory.
- Standard format VCF (variant call format) file
- *.Genotype.txt (sample genotyping matrix, the format is as follows)
CHROM POS Line 1 line 2 … line N Chr1 128960 A . … C Chr1 133137 C C … T … … … … … … Chr12 321216 A A … A Chr12 364257 A C … C Chr12 364755 . . … . … … … … … … -
Upload *.Genotype.txt to CropGS-Hub to complete subsequent analysis.
- Phenotype prediction PhenotypePrediction
- Crossing Design CrossingDesign
-
SNPGT-build
-
If users have genotyping needs beyond the 14 crop data sets we provide, you can try to use the
SNPGT-build
script to make your own RefDataSetFile, and then use SNPGT to complete it Genotyping.python SNPGT-build.py -h
usage: SNPGT-build.py [-h] [-F FASTA] [-B BIM] [-S SPECIES] [-N STRAIN] [-L BINLEN] [--JavaPath JAVAPATH] [ --SamtoolsPath SAMTOOLSPATH] [--SeqtkPath SEQTKPATH] [--Bowtie2Path BOWTIE2PATH] SNPGT-build (Tools for make RefGenome) optional arguments: -h, --help show this help message and exit -F FASTA, --fasta FASTA Whole genome reference sequence -B BIM, --bim BIM SNP site information(bim file) -S SPECIES, --species SPECIES Specify species name; eg. Rice -N STRAIN, --strain STRAIN Specify strain name; eg. 378_Inbred -L BINLEN, --binlen BINLEN The simplified of the genome retains base length on both sides of the SNP. The default value is 400 --JavaPath JAVAPATH Path to java8. The default is ./jdk/bin/java --SamtoolsPath SAMTOOLSPATH Path to samtools. --SeqtkPath SEQTKPATH Path to Seqtk. --Bowtie2Path BOWTIE2PATH Path to bowtie2-build.
-
Run the example
python SNPGT-build.py -F path_to/Rice.fa -B path_to/Rice_378_Inbred.bim -S Rice -N 378_Inbre`
Contact us
- Jie Qiu ([email protected])
- Min Zhu ([email protected])
- Jiaxin Chen ([email protected])