Sentieon-cli

The sentieon-cli provides complete implementations of Sentieon® pipelines in single commands. The sentieon-cli is designed to improve ease-of-use for complex multi-stage alignment and variant calling pipelines.

Instructions for installing the sentieon-cli can be found on GitHub at, https://github.com/Sentieon/sentieon-cli.

DNAscope

Sentieon® DNAscope is a pipeline for alignment and germline variant calling (SNVs, SVs, indels, and CNVs) from short-read DNA sequence data. The sentieon-cli provides a complete implementation of the DNAscope pipeline in a single command. The pipeline is designed for accuracy, efficiency, and ease-of-use.

The DNAscope pipeline uses a combination of Bayesian statistical approaches and machine learning to achieve high variant calling accuracy. The DNAscope pipeline supports samples sequenced using whole-genome or targeted (hybrid-capture) enrichment library preps.

The pipeline accepts as input aligned reads in BAM or CRAM format, or un-aligned reads in FASTQ, uBAM, or uCRAM format. The pipeline will output variants in the VCF (or gVCF) formats and aligned reads in BAM or CRAM formats.

Setup

Prerequisites

  • Sentieon® software package version 202308 or higher.

  • samtools version 1.16 or higher for

    alignment of reads in uBAM or uCRAM format or re-alignment of previously aligned reads.

  • MultiQC version 1.18 or higher for metrics

    report generation.

The sentieon, samtools, and multiqc executables will be accessed through the user’s PATH environment variable.

The Reference genome

DNAscope will call variants present in the sample relative to a high quality reference genome sequence. Besides the reference genome file, a samtools fasta index file (.fai) needs to be present. Read alignment also requires bwa index files.

We recommend aligning to a reference genome without alternate contigs. If alternate contigs are present in the genome, please also supply a “.alt” file to activate alt-aware alignment in bwa.

Usage

Alignment and variant calling from FASTQ

A single command is run to align, preprocess, and call SNVs, indels, and structural variants from FASTQ:

sentieon-cli dnascope [-h] \
  -r REFERENCE \
  --r1_fastq R1_FASTQ ... \
  --r2_fastq R2_FASTQ ... \
  --readgroups READGROUPS ... \
  -m MODEL_BUNDLE \
  [-d DBSNP] \
  [-b INTERVAL_FILE] \
  [--interval_padding INTERVAL_PADDING] \
  [-t NUMBER_THREADS] \
  [--pcr_free] \
  [-g] \
  [--duplicate_marking DUP_MARKING] \
  [--assay ASSAY] \
  [--consensus] \
  [--dry_run] \
  [--bam_format] \
  SAMPLE_VCF

With FASTQ input, the DNAscope pipeline requires the following arguments:

  • -r REFERENCE: the location of the reference FASTA file. A reference fasta index, “.fai” file, and bwa index files, are also required.

  • --r1_fastq R1_FASTQ: the R1 input FASTQ. Can be used multiple times. R1_FASTQ files without a corresponding R2_FASTQ are assumed to be single-ended. Be aware that the pipeline performs single-sample processing, and all fastq are expected to be from the same sample.

  • --r2_fastq R2_FASTQ: the R2 input FASTQ. Can be used multiple times.

  • --readgroups READGROUPS: readgroup information for each FASTQ. The pipeline will expect the same number of arguments to --r1_fastq and --readgroups arguments.

    An example argument is, --readgroups "@RG\tID:HG002-1\tSM:HG002\tLB:HG002-LB-1\tPL:ILLUMINA"

  • -m MODEL_BUNDLE: the location of the model bundle. Model bundle files can be found in the sentieon-models repository.

  • SAMPLE_VCF: the location of the output VCF file for SNVs and indels. The pipeline requires the output file end with the suffix, .vcf.gz. The file path without the suffix will be used as the basename for other output files.

The DNAscope pipeline accepts the following optional arguments:

  • -d DBSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants in VCF (.vcf) or bgzip compressed VCF (.vcf.gz) format. Only one file is supported. Supplying this file will annotate variants with their dbSNP refSNP ID numbers. A VCF index file is required.

  • -b INTERVAL_FILE: interval in the reference to restrict variant calling, in BED file format. Supplying this file will limit variant calling to the intervals inside the BED file. If a BED file is not supplied, the software will process the whole genome.

  • --interval_padding INTERVAL_PADDING: adds INTERVAL_PADDING bases padding to the edges of the input intervals. The default value is 0.

  • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted, the pipeline will use as many threads as the server has.

  • --pcr_free: Call variants using --pcr_indel_model NONE, which is appropriate for libraries prepared with a PCR-free library prep. Deduplication is still performed to identify optical duplicates.

  • -g: output variants in the gVCF format, in addition to the VCF output file. The tool will output a bgzip compressed gVCF file with a corresponding index file.

  • --duplicate_marking DUP_MARKING: setting for duplicate marking. markdup will mark duplicate reads. rmdup will remove duplicate reads. none will skip duplicate marking. The default setting is markdup.

  • --assay ASSAY: assay setting for metrics collection WGS or WES. The default setting is WGS.

  • --consensus: generate consensus reads during duplicate marking.

  • -h: print the command-line help and exit.

  • --dry_run: print the pipeline commands, but do not actually execute them.

  • --bam_format: use BAM format instead of CRAM for output aligned files.

Alignment and variant calling from uBAM or uCRAM

A single command is run to align, preprocess, and call SNVs, indels, and structural variants from uBAM or uCRAM files:

sentieon-cli dnascope [-h] \
  -r REFERENCE \
  -i SAMPLE_INPUT ... \
  --align \
  [--input_ref INPUT_REF] \
  -m MODEL_BUNDLE \
  [-d DBSNP] \
  [-b BED] \
  [--interval_padding INTERVAL_PADDING] \
  [-t NUMBER_THREADS] \
  [--pcr_free] \
  [-g] \
  [--duplicate_marking DUP_MARKING] \
  [--assay ASSAY] \
  [--consensus] \
  [--dry_run] \
  [--bam_format] \
  SAMPLE_VCF

With uBAM or uCRAM input, the DNAscope pipeline requires the following new arguments:

  • -i SAMPLE_INPUT: the input input sample file in uBAM or uCRAM format. One or more files can be supplied by passing multiple files after the -i argument.

  • --align: directs the pipeline to align the input reads.

The DNAscope pipeline accepts the following new optional arguments:

  • --input_ref INPUT_REF: a reference fasta used for decoding the input file(s). Required with uCRAM input. Can be different from the fasta used with the -r argument.

Alignment and variant calling from coordinate-sorted BAM or CRAM

A single command is run to align, preprocess, and call SNVs, indels, and structural variants from BAM or CRAM files:

sentieon-cli dnascope [-h] \
  -r REFERENCE \
  -i SAMPLE_INPUT ... \
  --collate_align \
  [--input_ref INPUT_REF] \
  -m MODEL_BUNDLE \
  [-d DBSNP] \
  [-b BED] \
  [--interval_padding INTERVAL_PADDING] \
  [-t NUMBER_THREADS] \
  [--pcr_free] \
  [-g] \
  [--duplicate_marking DUP_MARKING] \
  [--assay ASSAY] \
  [--consensus] \
  [--dry_run] \
  [--bam_format] \
  SAMPLE_VCF

With BAM or CRAM input, the DNAscope pipeline requires the following new arguments:

  • --collate_align: directs the pipeline to collate and then align the input reads.

Variant calling from sorted BAM or CRAM

A single command is run to preprocess, and call SNVs, indels, and structural variants from BAM or CRAM files:

sentieon-cli dnascope [-h] \
  -r REFERENCE \
  -i SAMPLE_INPUT ... \
  -m MODEL_BUNDLE \
  [-d DBSNP] \
  [-b BED] \
  [--interval_padding INTERVAL_PADDING] \
  [-t NUMBER_THREADS] \
  [--pcr_free] \
  [-g] \
  [--duplicate_marking DUP_MARKING] \
  [--assay ASSAY] \
  [--consensus] \
  [--dry_run] \
  [--bam_format] \
  SAMPLE_VCF

Not supplying the --align and --collate_align arguments will direct the pipeline to call variants directly from the input reads.

Pipeline output

List of output files

The following files are output when processing WGS FASTQ with default arguments if the output file is sample.vcf.gz:

  • sample.vcf.gz: SNV and indel variant calls across the regions of the genome as defined in the -b BED file.

  • sample_deduped.cram or sample_deduped.bam: aligned, coordinate-sorted and duplicate-marked read data from the input FASTQ files.

  • sample_svs.vcf.gz: structural variant calls from DNAscope and SVSolver.

  • sample_metrics: a directory containing QC metrics for the analyzed sample.

  • sample_metrics/coverage*: coverage metrics for the processed sample. Only available for WGS samples.

  • sample_metrics/{sample}.txt.alignment_stat.txt: Metrics from the AlignmentStat algo.

  • sample_metrics/{sample}.txt.base_distribution_by_cycle.txt: Metrics from the BaseDistributionByCycle algo.

  • sample_metrics/{sample}.txt.dedup_metrics.txt: Metrics from the Dedup algo.

  • sample_metrics/{sample}.txt.gc_bias*: Metrics from the GCBias algo. Only available for WGS samples.

  • sample_metrics/{sample}.txt.insert_size.txt: Metrics from the InsertSizeMetricAlgo algo.

  • sample_metrics/{sample}.txt.mean_qual_by_cycle.txt: Metrics from the MeanQualityByCycle algo.

  • sample_metrics/{sample}.txt.qual_distribution.txt: Metrics from the QualDistribution algo.

  • sample_metrics/{sample}.txt.wgs.txt: Metrics from the WgsMetricsAlgo algo. Only available for WGS samples.

  • sample_metrics/{sample}.txt.hybrid-selection.txt: Metrics from the HsMetricAlgo algo.

  • sample_metrics/multiqc_report.html: collected QC metrics aggregated by MultiQC.

DNAscope LongRead

Sentieon® DNAscope LongRead is a pipeline for alignment and germline variant calling (SNVs, SVs, CNVs, and indels) from long-read sequence data. The DNAscope LongRead pipeline is able to take advantage of longer read lengths to perform quick and accurate variant calling using specially calibrated machine learning models. The sentieon-cli provides a complete implementation of the DNAscope LongRead pipeline in a single command. The pipeline is designed for accuracy, efficiency, and ease-of-use.

The pipeline will accept as input aligned reads in BAM or CRAM format, or un-aligned reads in FASTQ, uBAM, or uCRAM format. The pipeline will output variants in the VCF (or gVCF) formats and aligned reads in BAM or CRAM formats.

Setup

Prerequisites

  • Sentieon® software package version 202308.01 or higher.

  • Python version 3.8 or higher.

  • bcftools version 1.10 or higher for the SNV and indel calling pipeline.

  • bedtools for the SNV and indel calling pipeline.

  • samtools version 1.16 or higher for alignment of read data in uBAM or uCRAM format or re-alignment of previously aligned reads.

  • mosdepth version 0.2.6 or higher for coverage metrics of long-read data.

  • hificnv version 1.0.0 or higher for CNV calling.

The sentieon, python, bcftools, bedtools, samtools, hificnv, and mosdepth executables will be accessed through the user’s PATH environment variable.

The Reference genome

DNAscope LongRead will call variants present in the sample relative to a high quality reference genome sequence. Besides the reference genome file, a samtools fasta index file (.fai) needs to be present.

We recommend aligning to a reference genome without alternate contigs.

Usage

Alignment and variant calling from FASTQ

A single command is run to align and call SNVs, indels and structural variants from PacBio HiFi or ONT reads in the FASTQ format:

sentieon-cli dnascope-longread [-h] \
  -r REFERENCE \
  --fastq INPUT_FASTQ ... \
  --readgroups READGROUP ... \
  -m MODEL_BUNDLE \
  [-d DBSNP] \
  [-b DIPLOID_BED] \
  [-t NUMBER_THREADS] \
  [-g] \
  --tech HiFi|ONT \
  [--haploid_bed HAPLOID_BED] \
  [--cnv_excluded_regions CNV_EXCLUDE_BED] \
  SAMPLE_VCF

With FASTQ input, the DNAscope LongRead pipeline requires the following arguments:

  • -r REFERENCE: the location of the reference FASTA file. A reference fasta index, “.fai” file, is also required.

  • --fastq INPUT_FASTQ: the input sample file in FASTQ format. One or more files can be supplied by passing multiple files after the --fastq argument.

  • --readgroups READGROUP: readgroup information for the read data. The --readgroups argument is required if the input data is in the FASTQ format. This argument expects complete readgroup strings and these strings will be passed to minimap2 through the -R argument.

    An example argument is, --readgroups '@RG\tID:foo\tSM:bar'.

  • -m MODEL_BUNDLE: the location of the model bundle. Model bundle files can be found in the sentieon-models repository.

  • --tech HiFi|ONT: Sequencing technology used to generate the reads. Supported arguments are ONT or HiFi.

  • SAMPLE_VCF: the location of the output VCF file for SNVs and indels. The pipeline requires the output file end with the suffix, “.vcf.gz”. The file path without the suffix will be used as the basename for other output files.

The Sentieon® LongRead pipeline accepts the following optional arguments:

  • -d DBSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants in VCF (.vcf) or bgzip compressed VCF (.vcf.gz) format. Only one file is supported. Supplying this file will annotate variants with their dbSNP refSNP ID numbers. A VCF index file is required.

  • -b DIPLOID_BED: interval in the reference to restrict diploid variant calling, in BED file format. Supplying this file will limit diploid variant calling to the intervals inside the BED file.

  • --haploid_bed HAPLOID_BED: interval in the reference to restrict haploid variant calling, in BED file format. Supplying this file will perform haploid variant calling across the intervals inside the BED file.

  • --cnv_excluded_regions: a BED file of excluded CNV regions passed to hificnv. See the hificnv documentation for more details, https://github.com/PacificBiosciences/HiFiCNV.

  • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted, the pipeline will use as many threads as the server has.

  • -g: output variants in the gVCF format, in addition to the VCF output file. The tool will output a bgzip compressed gVCF file with a corresponding index file.

  • -h: print the command-line help and exit.

  • --dry_run: print the pipeline commands, but do not actually execute them.

Alignment and variant calling from uBAM, uCRAM, BAM, or CRAM

A single command is run to align and call SNVs, indels, and structural variants from PacBio HiFi or ONT reads in the uBAM, uCRAM, BAM, or CRAM formats:

sentieon-cli dnascope-longread [-h] \
  -r REFERENCE \
  -i SAMPLE_INPUT ... \
  --align \
  -m MODEL_BUNDLE \
  [-d DBSNP] \
  [-b DIPLOID_BED] \
  [-t NUMBER_THREADS] \
  [-g] \
  --tech HiFi|ONT \
  [--haploid_bed HAPLOID_BED] \
  [--cnv_excluded_regions CNV_EXCLUDE_BED] \
  [--input_ref INPUT_REF] \
  SAMPLE_VCF

With uBAM, uCRAM, BAM, or CRAM input, the DNAscope LongRead pipeline requires the following new arguments:

  • -i SAMPLE_INPUT: the input sample file in uBAM or uCRAM format. One or more files can be supplied by passing multiple files after the -i argument.

  • --align: re-align the input read data to the reference genome using Sentieon® minimap2.

The DNAscope LongRead pipeline accepts the following new optional arguments:

  • --input_ref INPUT_REF: a reference fasta used for decoding the input file(s). Required with uCRAM or CRAM input. Can be different from the fasta used with the -r argument.

Variant calling from BAM or CRAM

A single command is run to call SNVs, indels, and structural variants from PacBio HiFi or ONT reads in the BAM, or CRAM formats:

sentieon-cli dnascope-longread [-h] \
  -r REFERENCE \
  -i SAMPLE_INPUT ... \
  -m MODEL_BUNDLE \
  [-d DBSNP] \
  [-b DIPLOID_BED] \
  [-t NUMBER_THREADS] \
  [-g] \
  --tech HiFi|ONT \
  [--haploid_bed HAPLOID_BED] \
  [--cnv_excluded_regions CNV_EXCLUDE_BED] \
  SAMPLE_VCF

Not supplying the --align argument will direct the pipeline to call variants directly from the input reads.

Pipeline output

List of output files

The following files are output when processing FASTQ data or uBAM, uCRAM, BAM, or CRAM files with the --align argument:

  • sample.vcf.gz: SNV and indel variant calls across the regions of the genome as defined in the -b DIPLOID_BED file.

  • sample.sv.vcf.gz: structural variant calls from the Sentieon® LongReadSV tool.

  • sample_mm2_sorted_fq_*.cram: aligned and coordinate-sorted reads from the input FASTQ files.

  • sample_mm2_sorted_*.cram: aligned and coordinate-sorted reads from the input uBAM, uCRAM, BAM, or CRAM files.

  • sample.hificnv: the base name of HiFiCNV output files (with HiFi input).

Other considerations

Diploid and haploid variant calling

The default pipeline is recommended for use with samples from diploid organisms. For samples with both diploid and haploid chromosomes, the -b DIPLOID_BED option can be used to limit diploid variant calling to diploid chromosomes and the --haploid_bed HAPLOID_BED argument can be used to perform haploid variant calling across haploid chromosomes.

Diploid and haploid BED files for the human hg38 reference genome (with male samples) can be found in the /data folder in this repository.

Modification

Scripts in this repository are made available under the BSD 2-Clause license.

The Python scripts in the sentieon_cli/scripts folder perform low-level manipulation of intermediate gVCF and VCF files generated by the pipeline. Due to the low-level data handling performed by these scripts, modification of these files by users is discouraged.

References

Sentieon DNAscope LongRead – A highly Accurate, Fast, and Efficient Pipeline for Germline Variant Calling from PacBio HiFi reads - A preprint describing the DNAscope LongRead pipeline for calling variants from PacBio HiFi data.

DNAscope Hybrid

Sentieon® DNAscope Hybrid is a pipeline for germline variant calling using combined short-read and long-read data from a single sample. The DNAscope Hybrid pipeline is able to utilize the strengths of both short and long-read technologies to generate variant callsets that are more accurate than either short-read or long-read data alone. The sentieon-cli provides a complete implementation of the DNAscope hybrid pipeline in a single command.

The pipeline supports input data in the following formats; both short-read and long-read input are required:

  • Unaligned short-read data in gzipped FASTQ format.

  • Aligned short-reads in BAM or CRAM format.

  • Unaliged long-read data in the uBAM or uCRAM format.

  • Aligned long-read data in BAM or CRAM format.

By default, the pipeline will generate the following output files:

  • Small variants (SNVs and indels) in the VCF format.

  • Structural variants in the VCF format.

  • Copy-number variants in the VCF format.

If unaligned reads are used as input, the pipeline will also output aligned reads in BAM or CRAM format.

Setup

Prerequisites

  • Sentieon® software package version 202503.01 or higher.

  • Python version 3.8 or higher.

  • bcftools version 1.10 or higher.

  • bedtools

  • MultiQC version 1.18 or higher for metrics report generation.

  • samtools version 1.16 or higher.

  • mosdepth version 0.2.6 or higher for coverage metrics collection from long-read data.

The sentieon, python, bcftools, bedtools, samtools, multiqc, and mosdepth executables will be accessed through the user’s PATH environment variable.

The Reference genome

DNAscope LongRead will call variants present in the sample relative to a high quality reference genome in FASTA format. Besides the reference genome file, a samtools fasta index file (.fai) needs to be present. Short-read alignment also requires bwa index files.

We recommend aligning to a reference genome without alternate contigs. If alternate contigs are present in the genome and the pipeline is performing short-read alignment, please also supply a “.alt” file to activate alt-aware alignment in bwa.

Usage

Germline variant calling from aligned short and long-read data

A single command is run to call SNVs, indels, SVs and CNVs from aligned short and long reads:

sentieon-cli dnascope-hybrid \
  -r REFERENCE \
  --sr_aln SR_ALN [SR_ALN ...] \
  --lr_aln LR_ALN [LR_ALN ...] \
  -m MODEL_BUNDLE \
  [-d DBSNP] \
  [-b DIPLOID_BED] \
  [-t NUMBER_THREADS] \
  sample.vcf.gz

The DNAscope Hybrid pipeline requires the following arguments:

  • -r REFERENCE: the location of the reference FASTA file. A reference fasta index, “.fai” file, is also required.

  • --sr_aln: the input short-read data in BAM or CRAM format. One or more files can be supplied by passing multiple files after the --sr_aln argument.

  • --lr_aln: the input long-read data in BAM or CRAM format. One or more files can be supplied by passing multiple files after the --lr_aln argument.

  • -m MODEL_BUNDLE: the location of the model bundle. Model bundle files can be found in the sentieon-models repository.

  • sample.vcf.gz: the location of the output VCF file for SNVs and indels. The pipeline requires the output file end with the suffix, “.vcf.gz”.

The DNAscope Hybrid pipeline accepts the following optional arguments:

  • -d DBSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants in VCF (.vcf) or bgzip compressed VCF (.vcf.gz) format. Only one file is supported. Supplying this file will annotate variants with their dbSNP refSNP ID numbers. A VCF index file is required.

  • -b DIPLOID_BED: interval in the reference to restrict diploid variant calling, in BED file format. Supplying this file will limit diploid variant calling to the intervals inside the BED file.

  • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted, the pipeline will use as many threads as the server has.

  • -h: print the command-line help and exit.

  • --dry_run: print the pipeline commands, but do not actually execute them.

Germline variant calling from unaligned short and long-read data

A single command is run to call SNVs, indels, SVs and CNVs from unaligned short and long reads:

sentieon-cli dnascope-hybrid \
  -r REFERENCE \
  --sr_r1_fastq SR_R1_FQ [SR_R1_FQ ...] \
  --sr_r2_fastq SR_R2_FQ [SR_R2_FQ ...] \
  --sr_readgroups SR_READGROUP [SR_READGROUP ...] \
  --lr_aln LR_ALN [LR_ALN ...] \
  --lr_align_input \
  -m MODEL_BUNDLE \
  [-d DBSNP] \
  [-b DIPLOID_BED] \
  [-t NUMBER_THREADS] \
  sample.vcf.gz

The DNAscope Hybrid pipeline requires the following arguments:

  • --sr_r1_fastq: the input R1 short-read data in gzipped FASTQ format. One or more files can be supplied by passing multiple files after the --sr_r1_fastq argument.

  • --sr_r2_fastq: the input R2 short-read data in gzipped FASTQ format. One or more files can be supplied by passing multiple files after the --sr_r2_fastq argument.

  • --sr_readgroups: readgroup information for each FASTQ. The pipeline will expect the same number of arguments to --sr_r1_fastq and --sr_readgroups. An example argument is, --sr_readgroups "@RG\tID:HG002-1\tSM:HG002\tLB:HG002-LB-1\tPL:ILLUMINA"

  • --lr_aln: the input long-read data in uBAM or uCRAM format. One or more files can be supplied by passing multiple files after the --lr_aln argument. - --lr_align_input: directs the pipeline to align the input long-reads.

The DNAscope Hybrid pipeline accepts the following optional arguments:

  • --sr_duplicate_marking: setting for duplicate marking. markdup will mark duplicate reads. rmdup will remove duplicate reads. none will skip duplicate marking. The default setting is markdup.

  • --lr_input_ref: a reference fasta used for decoding the input long-read file(s). Required with long-read uCRAM or CRAM input. Can be different from the fasta used with the -r argument.

  • --bam_format: use BAM format instead of CRAM for output aligned files.

Pipeline output

List of output files

The following files are output by the DNAscope Hybrid pipeline:

  • sample.vcf.gz: SNV and indel variant calls across the regions of the genome as defined in the -b DIPLOID_BED file.

  • sample.sv.vcf.gz: structural variant calls from the Sentieon® LongReadSV tool.

  • sample.cnv.vcf.gz: copy-number variant calls from the Sentieon® CNVscope tool.

  • sample_deduped.cram: aligned, coordinate-sorted and duplicate-marked

    short-read data from the input FASTQ files.

  • sample_mm2_sorted_*.cram: aligned and coordinate-sorted long-reads

    from the input uBAM, uCRAM, BAM, or CRAM files.

  • sample_metrics: a directory containing QC metrics for the analyzed

    sample.

Troubleshooting

The pipeline complains, Input has a different RG-SM tag

This error will occur if the pipeline detects that the input files have (or will have) different readgroup SM tags. To fix this error, please use the --rgsm argument to adjust the SM tags of the input files during variant calling. Note that with this argument, all reads in the input files will be used during variant calling.

Sentieon Pangenome

Sentieon® Pangenome is a pipeline for alignment and variant calling from short-read DNA sequence data using pangenome graphs. The Pangenome pipeline leverages graph-based reference representations to improve alignment and variant calling accuracy, particularly in complex genomic regions with high sequence diversity.

The pipeline accepts as input un-aligned reads in FASTQ format or aligned reads in BAM or CRAM format. The pipeline will output variants in the VCF format and aligned reads in BAM or CRAM formats.

Setup

Prerequisites

  • Sentieon® software package version 202503.02 or higher.

  • samtools version 1.16 or higher.

  • vg for personalized pangenome creation and pangenome manipulation.

  • KMC version 3 or higher for k-mer counting.

  • bcftools version 1.22 or higher for VCF operations.

  • MultiQC. MultiQC is used to generate a report from the pipeline metrics.

The executables will be accessed through the user’s PATH environment variable.

The Reference genome

The Sentieon® Pangenome will call variants relative to a high quality reference genome sequence. Besides the reference genome file, a samtools fasta index file (.fai) needs to be present. Read alignment also requires bwa index files.

We recommend using a reference genome without alternate contigs.

Currently, only GRCh38 with UCSC-style contig names (chr1, chr2, etc.) is supported.

Pangenome files

The pipeline requires the following pangenome files:

  • GBZ file: The pangenome graph in GBZ format.

  • Haplotype file: Pangenome haplotype information.

The pangenome needs to incorporate GRCh38 as a reference sequence.

The sentieon-cli will check that the .gbz and .hapl files follow the HPRC naming convention, ending with grch38.gbz and grch38.hapl, respectively. This check can be disabled using the hidden argument, --skip_pangenome_name_checks.

Model bundle

A Sentieon model bundle containing machine learning models for variant calling is required. Model bundle files can be found in the sentieon-models repository.

Population VCF

The pipeline requires a population VCF file as input. The population VCF must match the model bundle that is used by the pipeline.

Optional input files

Additional input files are required for optional functionality:

  • A BED file containing variant calling intervals. Recommended for whole-genome sequencing data to restrict variant calling to the canonical contigs. Recommended for whole-exome sequencing data to restrict variant calling to target regions.

  • The Single Nucleotide Polymorphism database (dbSNP) data that you want to include in the pipeline. The data is used in the form of a VCF file; you can use a VCF file compressed with bgzip and indexed.

Usage

Alignment and variant calling from FASTQ

A single command is run to perform alignment, preprocessing and metrics collection, and SNV and indel calling from FASTQ files:

sentieon-cli sentieon-pangenome [-h] \
  -r REFERENCE \
  --hapl HAPL \
  --gbz GBZ \
  --r1_fastq R1_FASTQ ... \
  --r2_fastq R2_FASTQ ... \
  --readgroups READGROUP \
  --pop_vcf POP_VCF \
  -m MODEL_BUNDLE \
  [-b INTERVAL_FILE] \
  [--bam_format] \
  [-d DBSNP] \
  [--dry_run] \
  [-t NUMBER_THREADS] \
  [--pcr_free] \
  SAMPLE_VCF

With FASTQ input, the Sentieon® pangenome pipeline requires the following arguments:

  • -r REFERENCE: the location of the reference FASTA file. A reference fasta index, “.fai” file, and bwa index files, are also required.

  • --hapl HAPL: the location of the pangenome haplotype file.

  • --gbz GBZ: the location of the pangenome graph file in GBZ format.

  • --r1_fastq R1_FASTQ: the R1 input FASTQ. Can be used multiple times. Each R1_FASTQ must have a corresponding R2_FASTQ file. All fastq are expected to be from the same sample.

  • --r2_fastq R2_FASTQ: the R2 input FASTQ. Can be used multiple times.

  • --readgroup READGROUP: readgroup information for the sample. A single readgroup can be supplied for the sample. An example argument is, --readgroup "@RG\tID:HG002-1\tSM:HG002\tLB:HG002-LB-1\tPL:ILLUMINA"

  • --pop_vcf POP_VCF: the location of the population VCF file.

  • -m MODEL_BUNDLE: the location of the model bundle. Model bundle files can be found in the sentieon-models repository.

  • SAMPLE_VCF: the location of the output VCF file for SNVs and indels. The pipeline requires the output file end with the suffix, .vcf.gz. The file path without the suffix will be used as the basename for other output files.

The Sentieon® pangenome pipeline accepts the following optional arguments:

  • -b INTERVAL_FILE: interval in the reference to restrict variant calling, in BED file format. Supplying this file will limit variant calling to the intervals inside the INTERVAL_FILE. We recommend using an INTERVAL_FILE to restrict variant calling to canonical contigs.

  • --bam_format: use BAM format instead of CRAM for output aligned files.

  • -d DBSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants in VCF (.vcf) or bgzip compressed VCF (.vcf.gz) format. Only one file is supported. Supplying this file will annotate variants with their dbSNP refSNP ID numbers. A VCF index file is required.

  • --dry_run: print the pipeline commands, but do not actually execute them.

  • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted, the pipeline will use as many threads as the server has.

  • --pcr_free: Call variants using --pcr_indel_model NONE, which is appropriate for libraries prepared with a PCR-free library prep. Deduplication is still performed to identify optical duplicates.

  • -h: print the command-line help and exit.

Variant calling from sorted BAM or CRAM

A single command is run to perform alignment, preprocessing and metrics collection, and SNV and indel calling from BAM or CRAM files:

sentieon-cli sentieon-pangenome [-h] \
  -r REFERENCE \
  --hapl HAPL \
  --gbz GBZ \
  -i SAMPLE_INPUT ... \
  --pop_vcf POP_VCF \
  -m MODEL_BUNDLE \
  [-b INTERVAL_FILE] \
  [-d DBSNP] \
  [--dry_run] \
  [-t NUMBER_THREADS] \
  SAMPLE_VCF

With BAM or CRAM input, the pangenome pipeline requires the following new arguments:

  • -i SAMPLE_INPUT: the input sample file in uBAM or uCRAM format. One or more files can be supplied by passing multiple files after the -i argument.

Pipeline output

List of output files

The following files are output when processing WGS FASTQ with all features if the output file is sample.vcf.gz:

  • sample.vcf.gz: SNV and indel variant calls across the regions of the genome as defined in the -b BED file.

  • sample_bwa_deduped.cram or sample_bwa_deduped.bam: bwa aligned, coordinate-sorted and duplicate-marked read data from the input FASTQ file(s).

  • sample_mm2_deduped.cram or sample_mm2_deduped.bam: pangenome aligned, coordinate-sorted and duplicate-marked read data from the input FASTQ files. The reads in this file are aligned to the pangenome and lifted back to the reference genome.

  • sample_metrics/: A directory containing QC metrics for the analyzed sample.

    • sample_metrics/sample.txt.alignment_stat.txt: Metrics from the Sentieon® AlignmentStat algo.

    • sample_metrics/sample.txt.base_distribution_by_cycle.txt: Metrics from the Sentieon® BaseDistributionByCycle algo.

    • sample_metrics/coverage*: Coverage metrics from the Sentieon® CoverageMetrics algo.

    • sample_metrics/sample.txt.gc_bias*: Metrics from the Sentieon® GCBias algo.

    • sample_metrics/sample.txt.insert_size.txt: Metrics from the Sentieon® InsertSizeMetricAlgo algo.

    • sample_metrics/sample.txt.mean_qual_by_cycle.txt: Metrics from the Sentieon® MeanQualityByCycle algo.

    • sample_metrics/sample.txt.qual_distribution.txt: Metrics from the Sentieon® QualDistribution algo.

    • sample_metrics/sample.txt.wgs.txt: Metrics from the Sentieon® WgsMetricsAlgo algo.

    • sample_metrics/multiqc_report.html: Collected QC metrics aggregated by MultiQC.

Limitations of the Sentieon® pangenome pipeline

The Sentieon® pangenome pipeline currently only supports Minigraph-Cactus pangenomes with a GRCh38 reference sequence, such as those generated by the Human Pangenome Reference Consortium (HPRC). Please reach out to Sentieon® support for information on using the pipeline with other pangenomes.