Genome-wide fragmentation abundance profile analysis of cfDNA following NC

Continuing with our NC Science series, what we shared earlier is about 16S amplicon sequencing and metagenomic data analysis. Considering that many of our partners are working on the human genome, this time I will share an analysis framework for early cancer screening, blood DELFI whole-genome fragmentation abundance spectrum detection. The title is: Detection and characterization of lung cancer using cell-free DNA fragmentomes.

Although the article is not particularly new, it was published in 2021. The code and data are disclosed quite thoroughly, even including modeling code and model, which is suitable for us to learn. The main content of the article is that the researchers performed low-depth whole-genome sequencing on blood sample cfDNA of the LUCAS discovery cohort (365 subjects), and calculated the whole-genome fragmentation abundance profile , and based on this, a DELFI lung cancer diagnostic model was constructed using machine learning and cross-validated. An independent validation cohort (431 subjects) was used to evaluate the performance of the lung cancer diagnostic model, demonstrating the diagnostic utility of DELFI in early-stage lung cancer. The research idea is shown in the figure below:

There are already many interpretations of the content of the article. I won’t repeat them here. You can search for them. For example, the tweets of this official account have a clear interpretation of the algorithm principle. Here I will focus on sharing the analysis of the code.

Overall structure of repo

There are 4 folders in this workflowr.

  • (1) analysis – Contains the code needed to generate each figure in the paper.

  • (2) code – this contains rpcr and rLucas (two packages widely used in the analysis), some individual r scripts with necessary functions, and 4 folders related to model creation.

    Models were trained and evaluated on 3 cohorts: a. LUCAS cohort (158 non-cancer, 129 cancer). b. LUCAS cohort excluding prior cancer. c. The LUCAS cohort does not include previous cancers and is limited to smokers aged 50-80 years.

    MODEL_CODE has the code needed to train the model on each cohort and perform external validation of the model from the first cohort. Models_c1, Models_c2 and Models_c3 contain trained models. For the first two cohorts, two model architectures were trained, for the third cohort, only one was trained.

  • (3) data – This contains the raw data used to train the model and generate graphs.

  • (4) docs – Contains markdown and html from the analysis, and the resulting graphics.

This repository is available on Github and can be run as a workflowr to generate a web page with all the code and graphics linked.

Code Overview

Preprocessing

The code is mainly concentrated in the code/preprocessing folder.
First, the precess.sh script calls the following scripts, which do the following:

  • fastp.sh — use fastp quality control for FASTQ.
  • align.sh –bowtie2 and samtools generate bam files.
  • post_alignment.sh –sambamba and samtools generate bed files.
  • bed_to_granges.sh –Convert the bed file generated in the previous steps to Granges in R.
  • gc_count ts.sh –Creates a table of fragment counts for each GC layer. Used for GC correction at the fragment level.
  • bin_right ted.sh — Create bin-level data for each sample.

Then, run getZcore res.sh, getCoverage.sh respectively to generate z-Score and Coverage features.
Finally, FeatureMatrix.sh generates the final feature matrix that can be used for subsequent modeling.
create-training-set.r and create-testing-set.r generate complete feature matrices, including clinical data, for training and validating the models used in this paper.
The following libraries must be installed from GitHub:
It seems that different sequencers also need corresponding calibration, which shows that the research is still very rigorous.
PlamaToolsNovaseq.hg19 for target GC distributions and bins for novelseq reference samples (not used in paper).
https://github.com/cancer-genomics/PlasmaToolsNovaseq.hg19.
PlamaToolsHiseq.hg19 for target GC distributions and bins (using) for HiSeq reference samples.
https://github.com/cancer-genomics/PlasmaToolsHiseq.hg19
Take precess.sh and fastp.sh as examples to appreciate the code, it is very standardized and neat, the qsub task is submitted, and the comments are clear, it is worth learning. To reproduce the results of the article, basically you only need to modify the paths of the software and reference files.
First look at precess.sh:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -l mem_free=1G
#$ -l h_vmem=1G
# Job resource option: max runtime
#$ -l h_rt=96:00:00
qsub fastp.sh
qsub -hold_jid_ad fastp.sh align.sh
qsub -hold_jid_ad align.sh post_alignment.sh
qsub -hold_jid_ad post_alignment.sh,align.sh bed_to_granges.sh
qsub -hold_jid_ad post_alignment.sh, bed_to_granges.sh, align.sh gc_counts.sh
qsub -hold_jid post_alignment.sh, bed_to_granges.sh, gc_counts.sh, align.sh bin_corrected.sh

Look at fasp.sh again:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -l h_fsize=100G
#$ -l mem_free=10G
#$ -l h_vmem=10G
#$ -l h_rt=96:00:00
## #$ -pe local 8
#$ -t 1-41

module load bowtie2 # used the module, interested students can search for relevant tutorials to learn
module load samtools

PATH="/users/scristia/.linuxbrew/bin:$PATH"
FASTP=/users/scristia/fastp
CORES=1 ##CORES=8

umask g + w

# Inputs, the folder settings are very clean
fqdir="../fastq"
outdir="../bam"
beddir="../bed"
scratchdir="tmp"
outdirfastp=$outdir/fastp

#Main
mkdir -p $outdir $outdir/dup_metrics $outdir/fastp $beddir $scratchdir

samplepath=$(find $fqdir -maxdepth 1 -name "*.fastq.gz" | \
    sed 's/.\{12\}$//' | \
    sort -u | \
    head -n $SGE_TASK_ID | \
    tail -n 1)
sample=$(basename $samplepath | awk '{ gsub("_WGS", "") ; print $0 }')

echo $sample

read1fq=$samplepath\_R1.fastq.gz
read2fq=$samplepath\_R2.fastq.gz

# Trimming with fastp
echo Trimming $sample with fastp
read1fqpaired=$scratchdir/$sample.paired.R1.fastq.gz
read2fqpaired=$scratchdir/$sample.paired.R2.fastq.gz

if [ -f $read1fqpaired ]; then
   echo Sample $sample has been processed. Exiting.
   exit 0
the fi

json=$outdirfastp/$sample.json
html=$outdirfastp/$sample.html
$FASTP -i $read1fq -I $read2fq -o $read1fqpaired -O $read2fqpaired \
    -Q -z 1 --detect_adapter_for_pe -j $json -h $html -w $CORES

Modeling

It is also a very detailed code, which is fully disclosed step by step, so it will not be listed here. Interested students can directly read the original text and check the code/model_code/ folder of the repo. Of course, the author also published an NC in 2019, and the code disclosure is also very complete, and it is also placed here (references 4 and 5 at the end of the article) for your reference and comparison.

Let’s stop here first for sharing, let’s learn! Although the era of artificial intelligence is coming, it may be for us to learn more and make good use of AI. If you don’t want to lie down, come on everyone!

Attachment:

1. Introduction to workflowr

An organized + reproducible + shareable data science framework in R, Workflowr combines programming (knitr and rmarkdown) and version control (Git via git2r) to produce an Stamping, versioning, and documenting the results page. Any R user can use it quickly and easily. It is designed to help researchers facilitate effective project management, reproducible analysis, collaboration and sharing of results.
workflowr sample web page

2. A missing file processing

In the process of learning and using, I found that the file cytosine_ref.rds is missing in code/preprocessing/01-bed-to-granges.r. If you are not familiar with the genome, it may not be easy to solve. The handsome guy raised this issue. The author also thoughtfully gave the solution and reason (the file size is too large, there are 3G), attached here:

# Most packages in the repo are installed with bioconductor
library(GenomicRanges)
library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg19)
library(Homo. sapiens)
cytosines <- vmatchPattern("C", Hsapiens)
saveRDS(cytosines, 'cytosine_ref.rds')

References:

[1] Mathios, D. et al. Detection and characterization of lung cancer using cell-free DNA fragmentomes. Nat Commun 12, 5060, doi:10.1038/s41467-021-24994-w (2021).
[2] https://www.sohu.com/a/575968290_100148274?qq-pf-to=pcqq.group
[3] https://github.com/cancer-genomics/reproduce_lucas_wflow
[4] Ulz, P. et al. Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection. Nat Commun 10, 4666, doi:10.1038/s41467-019-12714-4 (2019).
[5] https://github.com/cancer-genomics/delfi_scripts