Sentieon Pangenome
Sentieon® Pangenome is a pipeline for alignment and variant calling from short-read DNA sequence data using pangenome graphs. The Pangenome pipeline leverages graph-based reference representations to improve alignment and variant calling accuracy, particularly in complex genomic regions with high sequence diversity.
The pipeline accepts as input un-aligned reads in FASTQ format or aligned reads in BAM or CRAM format. The pipeline will output variants in the VCF format and aligned reads in BAM or CRAM formats.
Setup
Prerequisites
Sentieon® software package version 202503.02 or higher.
samtools version 1.16 or higher.
vg for personalized pangenome creation and pangenome manipulation.
KMC version 3 or higher for k-mer counting.
bcftools version 1.22 or higher for VCF operations.
MultiQC.
MultiQCis used to generate a report from the pipeline metrics.
The executables will be accessed through the user’s PATH environment
variable.
The Reference genome
The Sentieon® Pangenome will call variants relative to a high quality reference genome sequence. Besides the reference genome file, a samtools fasta index file (.fai) needs to be present. Read alignment also requires bwa index files.
We recommend using a reference genome without alternate contigs.
Currently, only GRCh38 with UCSC-style contig names (chr1, chr2, etc.)
is supported.
Pangenome files
The pipeline requires the following pangenome files:
GBZ file: The pangenome graph in GBZ format.Haplotype file: Pangenome haplotype information.
The pangenome needs to incorporate GRCh38 as a reference sequence.
The sentieon-cli will check that the .gbz and .hapl files follow the
HPRC naming convention, ending with grch38.gbz and grch38.hapl,
respectively. This check can be disabled using the hidden argument,
--skip_pangenome_name_checks.
Model bundle
A Sentieon model bundle containing machine learning models for variant calling is required. Model bundle files can be found in the sentieon-models repository.
Population VCF
The pipeline requires a population VCF file as input. The population VCF must match the model bundle that is used by the pipeline.
Optional input files
Additional input files are required for optional functionality:
A BED file containing variant calling intervals. Recommended for whole-genome sequencing data to restrict variant calling to the canonical contigs. Recommended for whole-exome sequencing data to restrict variant calling to target regions.
The Single Nucleotide Polymorphism database (dbSNP) data that you want to include in the pipeline. The data is used in the form of a VCF file; you can use a VCF file compressed with bgzip and indexed.
Usage
Alignment and variant calling from FASTQ
A single command is run to perform alignment, preprocessing and metrics collection, and SNV and indel calling from FASTQ files:
sentieon-cli sentieon-pangenome [-h] \
-r REFERENCE \
--hapl HAPL \
--gbz GBZ \
--r1_fastq R1_FASTQ ... \
--r2_fastq R2_FASTQ ... \
--readgroups READGROUP \
--pop_vcf POP_VCF \
-m MODEL_BUNDLE \
[-b INTERVAL_FILE] \
[--bam_format] \
[-d DBSNP] \
[--dry_run] \
[-t NUMBER_THREADS] \
[--pcr_free] \
SAMPLE_VCF
With FASTQ input, the Sentieon® pangenome pipeline requires the following arguments:
-r REFERENCE: the location of the reference FASTA file. A reference fasta index, “.fai” file, and bwa index files, are also required.--hapl HAPL: the location of the pangenome haplotype file.--gbz GBZ: the location of the pangenome graph file in GBZ format.--r1_fastq R1_FASTQ: the R1 input FASTQ. Can be used multiple times. EachR1_FASTQmust have a correspondingR2_FASTQfile. All fastq are expected to be from the same sample.--r2_fastq R2_FASTQ: the R2 input FASTQ. Can be used multiple times.--readgroup READGROUP: readgroup information for the sample. A single readgroup can be supplied for the sample. An example argument is,--readgroup "@RG\tID:HG002-1\tSM:HG002\tLB:HG002-LB-1\tPL:ILLUMINA"--pop_vcf POP_VCF: the location of the population VCF file.-m MODEL_BUNDLE: the location of the model bundle. Model bundle files can be found in the sentieon-models repository.SAMPLE_VCF: the location of the output VCF file for SNVs and indels. The pipeline requires the output file end with the suffix,.vcf.gz. The file path without the suffix will be used as the basename for other output files.
The Sentieon® pangenome pipeline accepts the following optional arguments:
-b INTERVAL_FILE: interval in the reference to restrict variant calling, in BED file format. Supplying this file will limit variant calling to the intervals inside the INTERVAL_FILE. We recommend using anINTERVAL_FILEto restrict variant calling to canonical contigs.--bam_format: use BAM format instead of CRAM for output aligned files.-d DBSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants in VCF (.vcf) or bgzip compressed VCF (.vcf.gz) format. Only one file is supported. Supplying this file will annotate variants with their dbSNP refSNP ID numbers. A VCF index file is required.--dry_run: print the pipeline commands, but do not actually execute them.-t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted, the pipeline will use as many threads as the server has.--pcr_free: Call variants using--pcr_indel_model NONE, which is appropriate for libraries prepared with a PCR-free library prep. Deduplication is still performed to identify optical duplicates.-h: print the command-line help and exit.
Variant calling from sorted BAM or CRAM
A single command is run to perform alignment, preprocessing and metrics collection, and SNV and indel calling from BAM or CRAM files:
sentieon-cli sentieon-pangenome [-h] \
-r REFERENCE \
--hapl HAPL \
--gbz GBZ \
-i SAMPLE_INPUT ... \
--pop_vcf POP_VCF \
-m MODEL_BUNDLE \
[-b INTERVAL_FILE] \
[-d DBSNP] \
[--dry_run] \
[-t NUMBER_THREADS] \
SAMPLE_VCF
With BAM or CRAM input, the pangenome pipeline requires the following new arguments:
-i SAMPLE_INPUT: the input sample file in uBAM or uCRAM format. One or more files can be supplied by passing multiple files after the-iargument.
Pipeline output
List of output files
The following files are output when processing WGS FASTQ with all features if
the output file is sample.vcf.gz:
sample.vcf.gz: SNV and indel variant calls across the regions of the genome as defined in the-b BEDfile.sample_bwa_deduped.cramorsample_bwa_deduped.bam: bwa aligned, coordinate-sorted and duplicate-marked read data from the input FASTQ file(s).sample_mm2_deduped.cramorsample_mm2_deduped.bam: pangenome aligned, coordinate-sorted and duplicate-marked read data from the input FASTQ files. The reads in this file are aligned to the pangenome and lifted back to the reference genome.sample_metrics/: A directory containing QC metrics for the analyzed sample.sample_metrics/sample.txt.alignment_stat.txt: Metrics from the Sentieon® AlignmentStat algo.sample_metrics/sample.txt.base_distribution_by_cycle.txt: Metrics from the Sentieon® BaseDistributionByCycle algo.sample_metrics/coverage*: Coverage metrics from the Sentieon® CoverageMetrics algo.sample_metrics/sample.txt.gc_bias*: Metrics from the Sentieon® GCBias algo.sample_metrics/sample.txt.insert_size.txt: Metrics from the Sentieon® InsertSizeMetricAlgo algo.sample_metrics/sample.txt.mean_qual_by_cycle.txt: Metrics from the Sentieon® MeanQualityByCycle algo.sample_metrics/sample.txt.qual_distribution.txt: Metrics from the Sentieon® QualDistribution algo.sample_metrics/sample.txt.wgs.txt: Metrics from the Sentieon® WgsMetricsAlgo algo.sample_metrics/multiqc_report.html: Collected QC metrics aggregated by MultiQC.
Limitations of the Sentieon® pangenome pipeline
The Sentieon® pangenome pipeline currently only supports Minigraph-Cactus pangenomes with a GRCh38 reference sequence, such as those generated by the Human Pangenome Reference Consortium (HPRC). Please reach out to Sentieon® support for information on using the pipeline with other pangenomes.