DNAscope LongRead¶
Introduction¶
This document describes using Sentieon® DNAscope to call germline variants from PacBio® HiFi or Oxford Nanopore (ONT) long reads.
Sentieon® DNAscope is able to take advantage of the long read length of PacBio® HiFi and ONT reads to perform quick and accurate variant calling using specially calibrated machine learning models.
In order to run this pipeline you need to use Sentieon software package version 202308 or higher; in addition, you will need to use a set of scripts that can be obtained from, https://github.com/Sentieon/sentieon-scripts/tree/master/dnascope_LongRead. You will also need a model bundle for your platform. The latest model bundles can be found at, https://github.com/Sentieon/sentieon-models.
The pipeline also requires python version >2.7 or >3.3, bcftools version 1.10 or
higher, and bedtools. The sentieon
, python
, bcftools
, and bedtools
executables will be accessed through the user's PATH
environment variable.
If you have any additional questions, please contact the technical support at Sentieon® Inc. at support@sentieon.com.
Input data requirements¶
Aligned reads - PacBio HiFi¶
As input, the pipeline will take PacBio® HiFi reads that have been aligned to a
reference genome with pbmm2
or minimap2
. When aligning reads with pbmm2,
setting -c 0 -y 70 --preset HIFI
is recommend. These settings turn off pbmm2's
legacy mapped concordance filter in favor of a gap compressed sequence identity
filter for output alignments and turn on PacBio's® recommended settings for
alignment of HiFi reads. When aligning reads with Sentieon minimap2, using the Sentieon
model for HiFi reads is recommended. When aligning reads with minimap2, setting
-x map-hifi
is recommend. This turns on the recommended minimap2 settings for
alignment of HiFi reads.
Aligned reads - ONT¶
The pipeline will accept Oxford Nanopore (ONT) long reads that have been aligned to the
reference genome with minimap2
. When aligning reads with Sentieon minimap2, using
the Sentieon model for ONT is recommended. When aligning reads with the open-source
minimap2, -x map-ont
is recommended.
The Reference genome¶
DNAscope will call variants present in the sample relative to a high quality reference
genome sequence. Besides the reference genome file, a samtools
fasta index file
(.fai) needs to be present. We recommend aligning to a reference genome without
alternate contigs.
The Sentieon® DNAscope LongRead pipeline¶
Pipeline overview¶
The pipeline will perform two passes of variant calling from the input alignment file and will merge the generated VCFs to produce the final output file. The steps of the pipeline are:
A first pass of variant calling, in which the pipeline will call variants present in the sample of interest.
SNVs called in the first pass are then phased using the long-read information.
A second pass of variant calling.
Across phased regions, variants are called from each haplotype separately.
Across unphased regions, variants are called with a more accurate diploid model.
Variants from the first and second passes are then combined to generate a final callset.
The pipeline requires a DNAscope machine learning model. Please refer to Sentieon's GitHub page to download the latest model.
Running the pipeline¶
Running the DNAscope LongRead pipeline is done via a script that makes the necessary calls to individual Sentieon commands. Different scripts are used for the HiFi and ONT pipelines.
A single command is run to call variants from PacBio HiFi reads and to apply the machine
learning models. The input alignment file should be an indexed BAM or CRAM file of HiFi
reads aligned with pbmm2
or minimap2
.
dnascope_HiFi.sh [-h] -r REFERENCE -i INPUT_BAM -m MODEL_BUNDLE [-d DBSNP_VCF] [-b DIPLOID_BED] [-t NUMBER_THREADS] [-g] [--] VARIANT_VCF
A single command is run to call variants from ONT reads and to apply the machine
learning models. The input alignment file should be an indexed BAM or CRAM file of ONT
reads aligned with minimap2
.
dnascope_ONT.sh [-h] -r REFERENCE -i INPUT_BAM -m MODEL_BUNDLE [-d DBSNP_VCF] [-b DIPLOID_BED] [-t NUMBER_THREADS] [-g] [--] VARIANT_VCF
The Sentieon® DNAscope LongRead pipeline requires the following arguments:
-r REFERENCE
: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
-i INPUT_BAM
: the location of the input BAM or CRAM file.
-m MODEL_BUNDLE
: the location of the model bundle.
The Sentieon® DNAscope LongRead pipeline accepts the following optional arguments:
-d dbSNP
: the location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants. Only one file is supported. Supplying this file will annotate variants with their dbSNP refSNP ID numbers.
-b INTERVAL
: interval in the reference that will be used in the software, in BED file format. Supplying this file will limit variant calling to the intervals inside the BED file.
-t NUMBER_THREADS
: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted, the pipeline will use as many threads as the server has.
-g
: output variants in the gVCF format, in addition to the VCF output file. The tool will output a bgzip compressed gVCF file with a corresponding index file.
-h
: print the command-line help and exit.
The Sentieon® DNAscope LongRead pipeline requires the following positional arguments:
VARIANT_VCF
: the location and filename of the variant calling output. The tool will output a bgzip compressed VCF file with a corresponding index file.
Pipeline output¶
The DNAscope LongRead pipeline will output a bgzip compressed file (.vcf.gz) containing variant calls in the standard VCFv4.2 format along with a tabix index file (.vcf.gz.tbi). If the -g option is used, the pipeline will also output a bgzip compressed file (.g.vcf.gz) containing variant calls in the gVCF format along with a tabix index file (.g.vcf.gz.tbi).
Other considerations¶
Currently, the pipeline is only recommended for use with samples from diploid organisms.
For samples with both diploid and haploid chromosomes, the -b INTERVAL
option can be
used to limit variant calling to diploid chromosomes.