Typical usage for Pangenome

The Sentieon® Genomics software includes a pipeline for pangenome alignment and germline variant calling. Compared to the linear alignment Typical usage for DNAseq® or Typical usage for DNAscope pipelines, the Sentieon pangenome pipeline utilizes pangenome graph data structures to improve short-read alignment and variant calling accuracy. The Pangenome pipeline is recommended for datasets sequenced from human samples.

General

Pipeline Overview

The Sentieon® Pangenome pipeline will process the input reads through the following steps:

  1. Extract the k-mer spectra: This step generates a k-mer spectrum from the input reads as a KFF file.

  2. Generate a personalized pangenome: This step generates a personalized pangenome using the sample’s k-mer spectrum. The personalized pangenome is then converted into a personalized reference genome.

  3. Map reads to reference: This step aligns the reads contained in the FASTQ files to a reference genome in FASTA format. This step is skipped with aligned input data in BAM or CRAM format.

  4. Map reads to the personalized reference: This step aligns a subset of reads to the sample’s personalized reference genome and lifts these alignments back to the standard reference genome.

  5. Duplicate marking and metrics collection: This step identifies DNA molecules that were sequenced multiple times and marks these reads as duplicate to exclude them from downstream analysis. This step also generates a statistical summary of the data quality from the aligned read data. If multiqc is available, a report will be generated from the collected metrics. The duplicate marked reads are output in BAM or CRAM format.

  6. Small variant calling using DNAscope: This step identifies small variants (SNVs and indels) present in the sample relative to the reference genome and calculates sample genotypes.

These steps can be run as a single command using the sentieon-cli.

Input files

The Sentieon® Pangenome pipeline is implemented in the sentieon-cli. Instructions for installing the sentieon-cli can be found on GitHub at, https://github.com/Sentieon/sentieon-cli. A detailed description of the Sentieon® Pangenome pipeline can be found at Sentieon Pangenome.

In this bioinformatics pipeline you will need the following inputs:

  • The FASTA file containing the nucleotide sequence of the GRCh38 reference genome corresponding to the sample you will analyze. Samtools and bwa index files for the reference genome are also required.

  • The pangenome file in GBZ format with a corresponding haplotype (.hapl) file. The pipeline requires that the pangenome incorporates GRCh38 contigs as a starting graph.

  • One or multiple GZIP-compressed FASTQ files containing the nucleotide sequence of the sample to be analyzed. These files contain the raw reads from the DNA sequencing. The software only supports files containing quality scores in Sanger format (Phred+33).

  • A population VCF file containing allele frequency information from population databases.

  • A pipeline and sequencing platform-specific model bundle file. Model bundle files can be accessed from https://github.com/Sentieon/sentieon-models.

  • (Optional) A BED file containing variant calling intervals. Recommended for whole-genome sequencing data to restrict variant calling to the canonical contigs. Recommended for whole-exome sequencing data to restrict variant calling to target regions.

  • (Optional) The Single Nucleotide Polymorphism database (dbSNP) data that you want to include in the pipeline. The data is used in the form of a VCF file; you can use a VCF file compressed with bgzip and indexed.

Third-party tools

In addition to the Sentieon® software, the bioinformatics pipeline requires the following third-party tools:

The following third-party tools are optional and will add additional functionality to the pipeline:

Tools will be called from the user’s PATH.

Example usage

The following example shows step-by-step usage of the Sentieon® pangenome pipeline using recommended input files to process a test sample.

Download input files

The following shell command is run to download the GRCh38, pangenome and population VCF input files required by the pipeline:

curl -L \
  -O 'https://ftp.sentieon.com/public/GRCh38/hg38_canonical.bed' \
  -O 'https://human-pangenomics.s3.amazonaws.com/pangenomes/freeze/release2/minigraph-cactus/hprc-v2.0-mc-grch38.gbz' \
  -O 'https://human-pangenomics.s3.amazonaws.com/pangenomes/freeze/release2/minigraph-cactus/hprc-v2.0-mc-grch38.hapl' \
  -O 'https://ftp.sentieon.com/public/GRCh38/population/pop-v20g41-20251216.vcf.gz' \
  -O 'https://ftp.sentieon.com/public/GRCh38/population/pop-v20g41-20251216.vcf.gz.tbi' \
  -o 'hg38_ucsc.fa' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa' \
  -o 'hg38_ucsc.fa.fai' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa.fai' \
  -o 'hg38_ucsc.fa.amb' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa.amb' \
  -o 'hg38_ucsc.fa.ann' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa.ann' \
  -o 'hg38_ucsc.fa.bwt' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa.bwt' \
  -o 'hg38_ucsc.fa.pac' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa.pac' \
  -o 'hg38_ucsc.fa.sa' 'https://ngi-igenomes.s3.amazonaws.com/igenomes/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa.sa' \
  -O 'https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.gz' \
  -O 'https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.gz.tbi'

A model bundle file for the pipeline can be downloaded from the sentieon models page.

Run the pangenome pipeline with fastq input

A single command is run to execute the pangenome pipeline from the input files.

sentieon-cli sentieon-pangenome \
  -r hg38_ucsc.fa \
  --hapl hprc-v2.0-mc-grch38.hapl \
  --gbz hprc-v2.0-mc-grch38.gbz \
  -m SentieonIlluminaPangenomeRealignWGS1.0.bundle \
  --pop_vcf pop-v20g41-20251216.vcf.gz \
  --r1_fastq HG002.novaseq.pcr-free.30x.R1.fastq.gz \
  --r2_fastq HG002.novaseq.pcr-free.30x.R2.fastq.gz \
  --readgroup "@RG\tID:HG002-1\tSM:HG002\tLB:HG002-LB-1\tPL:ILLUMINA" \
  -b hg38_canonical.bed \
  --dbsnp Homo_sapiens_assembly38.dbsnp138.vcf.gz \
  --pcr_free \
  HG002_pangenome.vcf.gz

Depending on whether PCR is involved, DNAscope uses different priors for finding significant INDEL variants. The default setting is appropriate for samples sequenced with standard library preps. Setting --pcr_free uses a prior appropriate for samples sequenced using a PCR-free library prep.

Output files

The following files are output by the pipeline with all optional features:

  • HG002_pangenome.vcf.gz: SNV and indel calls in VCF format.

  • HG002_pangenome_bwa_deduped.cram: bwa aligned, coordinate-sorted, and duplicate-marked read data from the input FASTQ file.

  • HG002_pangenome_mm2_deduped.cram: pangenome aligned, coordinate-sorted, and duplicate-marked read data from the input FASTQ file. The reads in this file are aligned to the pangenome and lifted back to GRCh38.

  • HG002_pangenome_metrics/: A directory containing QC metrics for the analyzed sample.

Limitations of the Sentieon® pangenome pipeline

The Sentieon® pangenome pipeline currently only supports Minigraph-Cactus pangenomes with a GRCh38 reference sequence, such as those generated by the Human Pangenome Reference Consortium (HPRC). Please reach out to Sentieon® support for information on using the pipeline with other pangenomes.