LinSNPGT: Genotyping specified SNP loci on Linux systems

LinSNPGT: Genotyping specified SNP loci on Linux systems

    • General introduction
    • background
    • Test Data
    • Install
    • SNPGT
    • SNPGT-build
    • contact us

General introduction

  • We have developed a toolkit WinSNPGT for calling variant sites on Windows systems, which is very friendly to those with little Linux operating experience. It obtains genotypes from raw sequencing data for a specified SNP site in our dataset. LinSNPGT is the Linux platform version of this toolkit.
  • Below are the installation and usage instructions for the kit.

Background

  • We developed a phenotype prediction platform,CropGS-Hub, which contains multiple high-quality important crop datasets (such as rice, corn, etc.). These datasets were used as training sets to build phenotypic prediction models. Users can upload the genotypes of their own samples to the platform for online phenotype prediction.
  • The LinSNPGT toolkit was developed to ensure that the genotypes uploaded by users match the genotypes in the modeling training set, thereby avoiding biased prediction results. Users can run the program on a Linux system and implement the entire process from file sequencing to obtaining genotypes through simple operations.

Test data

  • The sample data files are not included in the installation package and can be downloaded by clicking: Sample Data
tar -zxvf example-data.tar.gz

The species of the sample data file is Oryza sativa (rice). You can select rice-related data sets in the toolkit when genotyping, such as GSTP007 ~ GSTP009.

  • In the process of using LinSNPGT, you need to download RefDataSetFile. Listed below are the download links and the RefDataSet_File name that needs to be filled in for the file.
    • Maize (Zea mays):
      • GSTP001_8652_Hybrid : Maize_8652_Hybrid
      • GSTP002_5820_Hybrid : Maize_5820_Hybrid
      • GSTP003_1458_Inbred : Maize_1458_Inbred
      • GSTP004_1404_Inbred : Maize_1404_Inbred
      • GSTP005_350_Inbred : Maize_350_Inbred
      • GSTP006_1604_Inbred : Maize_1604_Inbred
    • Rice (Oryza sativa):
      • GSTP007_1495_Hybrid : Rice_1495_Hybrid
      • GSTP008_705_Inbred : Rice_705_Inbred
      • GSTP009_378_Inbred : Rice_378_Inbred
    • Cotton (Gossypium hirsutum):
      • GSTP010_1245_Inbred : Cotton_1245_Inbred
    • Millet (Setaria italica):
      • GSTP011_827_Inbred : Millet_827_Inbred
    • Chickpea (Cicer arietinum):
      • GTP012_2921_Inbred : Chickpea_2921_Inbred
    • Rapeseed (Brassica napus):
      • GSTP013_991_Inbred : Rapeseed_991_Inbred
    • Soybean (Glycine max):
      • GSTP014_2795_Inbred : Soybean_2795_Inbred

Installation

  • LinSNPGT depends on environment & software

    • Python3
    • bowtie2
    • samtools
    • java8
  • Configuration & Installation

    git clone https://gitee.com/Min-Zer0/LinSNPGT.git
    
    cd LinSNPGT
    chmod + x ./install.sh & amp; & amp; ./install.sh
    
    #installjava8
    ./install.Java8.sh
    
    # install bowtie2
    sudo apt install bowtie2
    
    # install samtools
    sudo apt install samtools
    
    # install seqtk if you want to use **SNPGT-build**
    sudo apt-get install seqtk
    

SNPGT

  • LinSNPGT has 3 subfolders and 3 files after installation

    • .sys/
    • 01.Reference_Genome/
    • 02.Input_Fastq/
    • SNPGT-build.py
    • SNPGT.config
    • SNPGT.py
  • To run LinSNPGT via SNPGT.py, users need to fill in SNPGT.config and place the file in the corresponding path.

  1. Fill in the config.txt file
    • Software Path: This content generally does not need to be changed.

    • Project_Name: Enter your project name, which will be used as the output file prefix.

    • RefDataSetFile: Enter the data set corresponding to the model to be fitted.

      • Available species and datasets are listed above, including their download links.
      • The RefDataSet’s species should match your raw sequencing data.
    • Thread_Count: Enter the number of threads available to run the program

    • Samples_list: Fill in your original sequencing data and its corresponding sample name.

      • Must follow the format

        | SAMPLE NAME | RAW READS NAME | RAW READS NAME|

      • Separated by vertical bars, each sample is filled in a separate line, and each Reads file is represented in a separate column; Read1 comes first, Read2 comes last.
    • config.txt (example)

      #==================== Software Path ===================== ==#
      Java_Path=./jdk/bin/java
      Bowtie2_Path=bowtie2
      Samtools_Path=samtools
      
      #==================== LinSNPGT Config =======================#
      *[Project]
      Project_Name=Rice
      
      *[Species and Dataset]
      RefDataSet_File=Rice_705_Inbred
      
      * [Running LinSNPGT Thread]
      Thread_Count=10
      
      *[Samples_list]
      > ===========================================
      > |sample | Read1 | Read2 |
      >------------------------------------------------
        | Line1 | test1_1.fastq.gz | test1_2.fastq.gz |
        | Line2 | test2_1.fastq.gz | test2_2.fastq.gz |
      
  2. Download the RefDataSetFile file (*.tar.gz) and move it to the path: ./01.Reference_Genome/
  3. Move raw sequencing data (*.fastq.gz) or (*.fastq) to the path: ./02.Input_Fastq
  4. Run command:
    python SNPGT.py
    
  • output & follow-up
    • After the program is completed, the result file will be output in the Result/ directory.

      • Standard format VCF (variant call format) file
      • *.Genotype.txt (sample genotyping matrix, the format is as follows)
      CHROM POS Line 1 line 2 line N
      Chr1 128960 A . C
      Chr1 133137 C C T
      Chr12 321216 A A A
      Chr12 364257 A C C
      Chr12 364755 . . .
    • Upload *.Genotype.txt to CropGS-Hub to complete subsequent analysis.

      • Phenotype prediction PhenotypePrediction
      • Crossing Design CrossingDesign

SNPGT-build

  • If users have genotyping needs beyond the 14 crop data sets we provide, you can try to use the SNPGT-build script to make your own RefDataSetFile, and then use SNPGT to complete it Genotyping.

    python SNPGT-build.py -h
    
    usage: SNPGT-build.py [-h] [-F FASTA] [-B BIM] [-S SPECIES] [-N STRAIN] [-L BINLEN] [--JavaPath JAVAPATH] [ --SamtoolsPath SAMTOOLSPATH] [--SeqtkPath SEQTKPATH]
                      [--Bowtie2Path BOWTIE2PATH]
    
    SNPGT-build (Tools for make RefGenome)
    
    optional arguments:
      -h, --help show this help message and exit
      -F FASTA, --fasta FASTA
                            Whole genome reference sequence
      -B BIM, --bim BIM SNP site information(bim file)
      -S SPECIES, --species SPECIES
                            Specify species name; eg. Rice
      -N STRAIN, --strain STRAIN
                            Specify strain name; eg. 378_Inbred
      -L BINLEN, --binlen BINLEN
                            The simplified of the genome retains base length on both sides of the SNP. The default value is 400
      --JavaPath JAVAPATH Path to java8. The default is ./jdk/bin/java
      --SamtoolsPath SAMTOOLSPATH
                            Path to samtools.
      --SeqtkPath SEQTKPATH
                            Path to Seqtk.
      --Bowtie2Path BOWTIE2PATH
                            Path to bowtie2-build.
    
  • Run the example

    python SNPGT-build.py -F path_to/Rice.fa -B path_to/Rice_378_Inbred.bim -S Rice -N 378_Inbre`
    

Contact us