Typical usage for DNAscope Pangenome

The Sentieon® Genomics software includes a pipeline for pangenome alignment and germline variant calling. Compared to the linear alignment Typical usage for DNAseq® or Typical usage for DNAscope pipelines, the Sentieon® DNAscope Pangenome pipeline utilizes pangenome graph data structures to improve short-read alignment and variant calling accuracy. The DNAscope Pangenome pipeline is recommended for datasets sequenced from human samples.

General

Pipeline Overview

The Sentieon® DNAscope Pangenome pipeline will process the input reads through the following steps:

Extract the k-mer spectra: This step generates a k-mer spectrum from the input reads as a KFF file.
Generate a personalized pangenome: This step generates a personalized pangenome using the sample’s k-mer spectrum. The personalized pangenome is then converted into a personalized reference genome.
Map reads to reference: This step aligns the reads contained in the FASTQ files to a reference genome in FASTA format. This step is skipped with aligned input data in BAM or CRAM format.
Map reads to the personalized reference: This step aligns a subset of reads to the sample’s personalized reference genome and lifts these alignments back to the standard reference genome.
Duplicate marking and metrics collection: This step identifies DNA molecules that were sequenced multiple times and marks these reads as duplicate to exclude them from downstream analysis. This step also generates a statistical summary of the data quality from the aligned read data. If multiqc is available, a report will be generated from the collected metrics. The duplicate marked reads are output in BAM or CRAM format.
Small variant calling using DNAscope: This step identifies small variants (SNVs and indels) present in the sample relative to the reference genome and calculates sample genotypes.

These steps can be run as a single command using the sentieon-cli.

Input files

The Sentieon® DNAscope Pangenome pipeline is implemented in the sentieon-cli. Instructions for installing the sentieon-cli can be found on GitHub at, https://github.com/Sentieon/sentieon-cli. A detailed description of the Sentieon® DNAscope Pangenome pipeline can be found at DNAscope Pangenome.

In this bioinformatics pipeline you will need the following inputs:

The FASTA file containing the nucleotide sequence of the GRCh38 reference genome corresponding to the sample you will analyze. Samtools and bwa index files for the reference genome are also required.
The pangenome file in GBZ format with a corresponding haplotype (.hapl) file. The pipeline requires that the pangenome incorporates GRCh38 contigs as a starting graph.
One or multiple GZIP-compressed FASTQ files containing the nucleotide sequence of the sample to be analyzed. These files contain the raw reads from the DNA sequencing. The software only supports files containing quality scores in Sanger format (Phred+33).
A population VCF file containing allele frequency information from population databases.
A pipeline and sequencing platform-specific model bundle file. Model bundle files can be accessed from https://github.com/Sentieon/sentieon-models.
(Optional) A BED file containing variant calling intervals. Recommended for whole-genome sequencing data to restrict variant calling to the canonical contigs. Recommended for whole-exome sequencing data to restrict variant calling to target regions.
(Optional) The Single Nucleotide Polymorphism database (dbSNP) data that you want to include in the pipeline. The data is used in the form of a VCF file; you can use a VCF file compressed with bgzip and indexed.

Third-party tools

In addition to the Sentieon® software, the bioinformatics pipeline requires the following third-party tools:

samtools version 1.16 or higher. samtools is used to manipulate the aligned read data. samtools installation instructions can be found at, https://github.com/samtools/samtools?tab=readme-ov-file#building-samtools.
bcftools version 1.22 or higher. bcftools is used to manipulate called variants. bcftools installation instructions can be found at, https://samtools.github.io/bcftools/howtos/install.html.
vg. vg is used to generate a personalized pangenome. x86 executables for vg can be downloaded from GitHub, https://github.com/vgteam/vg/releases/download/v1.68.0/vg.
KMC version 3 or higher. kmc is used for k-mer counting of the input sample. x86 executables for kmc can be downloaded from GitHub, https://github.com/Sentieon/KMC/releases/download/v3.2.4-pipe2/KMC3.2.4.linux.x64.tar.gz.

The following third-party tools are optional and will add additional functionality to the pipeline:

MultiQC. multiqc is used to generate a report from the pipeline metrics. multiqc installation instructions can be found at, https://github.com/MultiQC/MultiQC?tab=readme-ov-file#installation.

Tools will be called from the user’s PATH.

Warning

vg version 1.69.0 introduced breaking changes to the haplotype index (.hapl) file format, as described in the vg 1.69.0 release notes. Newer versions of vg cannot read haplotype indexes generated with earlier versions and will fail with an error similar to:

[vg haplotypes] Loading haplotype information from /pan/hprc-v2.0-mc-grch38.hapl
error: [Haplotypes] Loading from /pan/hprc-v2.0-mc-grch38.hapl failed: Haplotypes::simple_sds_load(): Expected version 5 to 6, got version 4

To resolve this error, either downgrade vg to a version compatible with your existing .hapl file, or regenerate the haplotype index for your pangenome with your current vg version. An updated .hapl file for the public HPRC v2.0 pangenome is also available on request - please contact Sentieon® support at support@sentieon.com.

Example usage

The following example shows step-by-step usage of the Sentieon® DNAscope Pangenome pipeline using recommended input files to process a test sample.

Download input files

The following shell command is run to download the GRCh38, pangenome and population VCF input files required by the pipeline:

curl -L \
  -O 'https://ftp.sentieon.com/public/GRCh38/hg38_canonical.bed' \
  -O 'https://human-pangenomics.s3.amazonaws.com/pangenomes/freeze/release2/minigraph-cactus/hprc-v2.0-mc-grch38.gbz' \
  -O 'https://human-pangenomics.s3.amazonaws.com/pangenomes/freeze/release2/minigraph-cactus/hprc-v2.0-mc-grch38.hapl' \
  -O 'https://ftp.sentieon.com/public/GRCh38/population/pop-v20-20260528.vcf.gz' \
  -O 'https://ftp.sentieon.com/public/GRCh38/population/pop-v20-20260528.vcf.gz.tbi' \
  -o 'hg38_ucsc.fa' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa' \
  -o 'hg38_ucsc.fa.fai' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa.fai' \
  -o 'hg38_ucsc.fa.amb' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa.amb' \
  -o 'hg38_ucsc.fa.ann' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa.ann' \
  -o 'hg38_ucsc.fa.bwt' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa.bwt' \
  -o 'hg38_ucsc.fa.pac' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa.pac' \
  -o 'hg38_ucsc.fa.sa' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa.sa' \
  -O 'https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.gz' \
  -O 'https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.gz.tbi'

A model bundle file for the pipeline can be downloaded from the sentieon models page.

Run the DNAscope Pangenome pipeline with fastq input

A single command is run to execute the DNAscope Pangenome pipeline from the input files.

sentieon-cli dnascope-pangenome \
  -r hg38_ucsc.fa \
  --hapl hprc-v2.0-mc-grch38.hapl \
  --gbz hprc-v2.0-mc-grch38.gbz \
  -m SentieonIlluminaPangenomeRealignWGS1.2.bundle \
  --pop_vcf pop-v20-20260528.vcf.gz \
  --r1_fastq HG002.novaseq.pcr-free.30x.R1.fastq.gz \
  --r2_fastq HG002.novaseq.pcr-free.30x.R2.fastq.gz \
  --readgroup "@RG\tID:HG002-1\tSM:HG002\tLB:HG002-LB-1\tPL:ILLUMINA" \
  -b hg38_canonical.bed \
  --dbsnp Homo_sapiens_assembly38.dbsnp138.vcf.gz \
  --pcr_free \
  HG002_pangenome.vcf.gz

Depending on whether PCR is involved, DNAscope uses different priors for finding significant INDEL variants. The default setting is appropriate for samples sequenced with standard library preps. Setting --pcr_free uses a prior appropriate for samples sequenced using a PCR-free library prep.

Output files

The following files are output by the pipeline with all optional features:

HG002_pangenome.vcf.gz: SNV and indel calls in VCF format.
HG002_pangenome_bwa_deduped.cram: bwa aligned, coordinate-sorted, and duplicate-marked read data from the input FASTQ file.
HG002_pangenome_mm2_deduped.cram: pangenome aligned, coordinate-sorted, and duplicate-marked read data from the input FASTQ file. The reads in this file are aligned to the pangenome and lifted back to GRCh38.
HG002_pangenome_metrics/: A directory containing QC metrics for the analyzed sample.

Limitations of the Sentieon® DNAscope Pangenome pipeline

The Sentieon® DNAscope Pangenome pipeline currently supports Minigraph-Cactus pangenomes built against the GRCh38 or CHM13 (T2T) reference genomes, such as those generated by the Human Pangenome Reference Consortium (HPRC). The pipeline is also compatible with other Minigraph-Cactus pangenomes; please reach out to Sentieon® support for information on using the pipeline with other pangenomes.