Sentieon-cli
The sentieon-cli provides complete implementations of Sentieon® pipelines
in single commands. The sentieon-cli is designed to improve ease-of-use for
complex multi-stage alignment and variant calling pipelines.
Instructions for installing the sentieon-cli can be found on GitHub at,
https://github.com/Sentieon/sentieon-cli.
DNAscope
Sentieon® DNAscope is a pipeline for alignment and germline variant calling (SNVs, SVs, indels, and CNVs) from short-read DNA sequence data. The sentieon-cli provides a complete implementation of the DNAscope pipeline in a single command. The pipeline is designed for accuracy, efficiency, and ease-of-use.
The DNAscope pipeline uses a combination of Bayesian statistical approaches and machine learning to achieve high variant calling accuracy. The DNAscope pipeline supports samples sequenced using whole-genome or targeted (hybrid-capture) enrichment library preps.
The pipeline accepts as input aligned reads in BAM or CRAM format, or un-aligned reads in FASTQ, uBAM, or uCRAM format. The pipeline will output variants in the VCF (or gVCF) formats and aligned reads in BAM or CRAM formats.
Setup
Prerequisites
Sentieon® software package version 202308 or higher.
- samtools version 1.16 or higher for
alignment of reads in uBAM or uCRAM format or re-alignment of previously aligned reads.
- MultiQC version 1.18 or higher for metrics
report generation.
The sentieon, samtools, and multiqc executables will be
accessed through the user’s PATH environment variable.
The Reference genome
DNAscope will call variants present in the sample relative to a high quality reference genome sequence. Besides the reference genome file, a samtools fasta index file (.fai) needs to be present. Read alignment also requires bwa index files.
We recommend aligning to a reference genome without alternate contigs. If alternate contigs are present in the genome, please also supply a “.alt” file to activate alt-aware alignment in bwa.
Usage
Alignment and variant calling from FASTQ
A single command is run to align, preprocess, and call SNVs, indels, and structural variants from FASTQ:
sentieon-cli dnascope [-h] \
-r REFERENCE \
--r1_fastq R1_FASTQ ... \
--r2_fastq R2_FASTQ ... \
--readgroups READGROUPS ... \
-m MODEL_BUNDLE \
[-d DBSNP] \
[-b INTERVAL_FILE] \
[--interval_padding INTERVAL_PADDING] \
[-t NUMBER_THREADS] \
[--pcr_free] \
[-g] \
[--duplicate_marking DUP_MARKING] \
[--assay ASSAY] \
[--consensus] \
[--dry_run] \
[--bam_format] \
SAMPLE_VCF
With FASTQ input, the DNAscope pipeline requires the following arguments:
-r REFERENCE: the location of the reference FASTA file. A reference fasta index, “.fai” file, and bwa index files, are also required.--r1_fastq R1_FASTQ: the R1 input FASTQ. Can be used multiple times.R1_FASTQfiles without a correspondingR2_FASTQare assumed to be single-ended. Be aware that the pipeline performs single-sample processing, and all fastq are expected to be from the same sample.--r2_fastq R2_FASTQ: the R2 input FASTQ. Can be used multiple times.--readgroups READGROUPS: readgroup information for each FASTQ. The pipeline will expect the same number of arguments to--r1_fastqand--readgroupsarguments.An example argument is,
--readgroups "@RG\tID:HG002-1\tSM:HG002\tLB:HG002-LB-1\tPL:ILLUMINA"-m MODEL_BUNDLE: the location of the model bundle. Model bundle files can be found in the sentieon-models repository.SAMPLE_VCF: the location of the output VCF file for SNVs and indels. The pipeline requires the output file end with the suffix,.vcf.gz. The file path without the suffix will be used as the basename for other output files.
The DNAscope pipeline accepts the following optional arguments:
-d DBSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants in VCF (.vcf) or bgzip compressed VCF (.vcf.gz) format. Only one file is supported. Supplying this file will annotate variants with their dbSNP refSNP ID numbers. A VCF index file is required.-b INTERVAL_FILE: interval in the reference to restrict variant calling, in BED file format. Supplying this file will limit variant calling to the intervals inside the BED file. If a BED file is not supplied, the software will process the whole genome.--interval_padding INTERVAL_PADDING: addsINTERVAL_PADDINGbases padding to the edges of the input intervals. The default value is 0.-t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted, the pipeline will use as many threads as the server has.--pcr_free: Call variants using--pcr_indel_model NONE, which is appropriate for libraries prepared with a PCR-free library prep. Deduplication is still performed to identify optical duplicates.-g: output variants in the gVCF format, in addition to the VCF output file. The tool will output a bgzip compressed gVCF file with a corresponding index file.--duplicate_marking DUP_MARKING: setting for duplicate marking.markdupwill mark duplicate reads.rmdupwill remove duplicate reads.nonewill skip duplicate marking. The default setting ismarkdup.--assay ASSAY: assay setting for metrics collectionWGSorWES. The default setting isWGS.--consensus: generate consensus reads during duplicate marking.-h: print the command-line help and exit.--dry_run: print the pipeline commands, but do not actually execute them.--bam_format: use BAM format instead of CRAM for output aligned files.
Alignment and variant calling from uBAM or uCRAM
A single command is run to align, preprocess, and call SNVs, indels, and structural variants from uBAM or uCRAM files:
sentieon-cli dnascope [-h] \
-r REFERENCE \
-i SAMPLE_INPUT ... \
--align \
[--input_ref INPUT_REF] \
-m MODEL_BUNDLE \
[-d DBSNP] \
[-b BED] \
[--interval_padding INTERVAL_PADDING] \
[-t NUMBER_THREADS] \
[--pcr_free] \
[-g] \
[--duplicate_marking DUP_MARKING] \
[--assay ASSAY] \
[--consensus] \
[--dry_run] \
[--bam_format] \
SAMPLE_VCF
With uBAM or uCRAM input, the DNAscope pipeline requires the following new arguments:
-i SAMPLE_INPUT: the input input sample file in uBAM or uCRAM format. One or more files can be supplied by passing multiple files after the-iargument.--align: directs the pipeline to align the input reads.
The DNAscope pipeline accepts the following new optional arguments:
--input_ref INPUT_REF: a reference fasta used for decoding the input file(s). Required with uCRAM input. Can be different from the fasta used with the-rargument.
Alignment and variant calling from coordinate-sorted BAM or CRAM
A single command is run to align, preprocess, and call SNVs, indels, and structural variants from BAM or CRAM files:
sentieon-cli dnascope [-h] \
-r REFERENCE \
-i SAMPLE_INPUT ... \
--collate_align \
[--input_ref INPUT_REF] \
-m MODEL_BUNDLE \
[-d DBSNP] \
[-b BED] \
[--interval_padding INTERVAL_PADDING] \
[-t NUMBER_THREADS] \
[--pcr_free] \
[-g] \
[--duplicate_marking DUP_MARKING] \
[--assay ASSAY] \
[--consensus] \
[--dry_run] \
[--bam_format] \
SAMPLE_VCF
With BAM or CRAM input, the DNAscope pipeline requires the following new arguments:
--collate_align: directs the pipeline to collate and then align the input reads.
Variant calling from sorted BAM or CRAM
A single command is run to preprocess, and call SNVs, indels, and structural variants from BAM or CRAM files:
sentieon-cli dnascope [-h] \
-r REFERENCE \
-i SAMPLE_INPUT ... \
-m MODEL_BUNDLE \
[-d DBSNP] \
[-b BED] \
[--interval_padding INTERVAL_PADDING] \
[-t NUMBER_THREADS] \
[--pcr_free] \
[-g] \
[--duplicate_marking DUP_MARKING] \
[--assay ASSAY] \
[--consensus] \
[--dry_run] \
[--bam_format] \
SAMPLE_VCF
Not supplying the --align and --collate_align arguments will
direct the pipeline to call variants directly from the input reads.
Pipeline output
List of output files
The following files are output when processing WGS FASTQ with default
arguments if the output file is sample.vcf.gz:
sample.vcf.gz: SNV and indel variant calls across the regions of the genome as defined in the-b BEDfile.sample_deduped.cramorsample_deduped.bam: aligned, coordinate-sorted and duplicate-marked read data from the input FASTQ files.sample_svs.vcf.gz: structural variant calls from DNAscope and SVSolver.sample_metrics: a directory containing QC metrics for the analyzed sample.sample_metrics/coverage*: coverage metrics for the processed sample. Only available for WGS samples.sample_metrics/{sample}.txt.alignment_stat.txt: Metrics from the AlignmentStat algo.sample_metrics/{sample}.txt.base_distribution_by_cycle.txt: Metrics from the BaseDistributionByCycle algo.sample_metrics/{sample}.txt.dedup_metrics.txt: Metrics from the Dedup algo.sample_metrics/{sample}.txt.gc_bias*: Metrics from the GCBias algo. Only available for WGS samples.sample_metrics/{sample}.txt.insert_size.txt: Metrics from the InsertSizeMetricAlgo algo.sample_metrics/{sample}.txt.mean_qual_by_cycle.txt: Metrics from the MeanQualityByCycle algo.sample_metrics/{sample}.txt.qual_distribution.txt: Metrics from the QualDistribution algo.sample_metrics/{sample}.txt.wgs.txt: Metrics from the WgsMetricsAlgo algo. Only available for WGS samples.sample_metrics/{sample}.txt.hybrid-selection.txt: Metrics from the HsMetricAlgo algo.sample_metrics/multiqc_report.html: collected QC metrics aggregated by MultiQC.
DNAscope LongRead
Sentieon® DNAscope LongRead is a pipeline for alignment and germline variant calling (SNVs, SVs, CNVs, and indels) from long-read sequence data. The DNAscope LongRead pipeline is able to take advantage of longer read lengths to perform quick and accurate variant calling using specially calibrated machine learning models. The sentieon-cli provides a complete implementation of the DNAscope LongRead pipeline in a single command. The pipeline is designed for accuracy, efficiency, and ease-of-use.
The pipeline will accept as input aligned reads in BAM or CRAM format, or un-aligned reads in FASTQ, uBAM, or uCRAM format. The pipeline will output variants in the VCF (or gVCF) formats and aligned reads in BAM or CRAM formats.
Setup
Prerequisites
Sentieon® software package version 202308.01 or higher.
Python version 3.8 or higher.
bcftools version 1.10 or higher for the SNV and indel calling pipeline.
bedtools for the SNV and indel calling pipeline.
samtools version 1.16 or higher for alignment of read data in uBAM or uCRAM format or re-alignment of previously aligned reads.
mosdepth version 0.2.6 or higher for coverage metrics of long-read data.
hificnv version 1.0.0 or higher for CNV calling.
The sentieon, python, bcftools, bedtools, samtools,
hificnv, and mosdepth executables will be accessed through the
user’s PATH environment variable.
The Reference genome
DNAscope LongRead will call variants present in the sample relative to a high quality reference genome sequence. Besides the reference genome file, a samtools fasta index file (.fai) needs to be present.
We recommend aligning to a reference genome without alternate contigs.
Usage
Alignment and variant calling from FASTQ
A single command is run to align and call SNVs, indels and structural variants from PacBio HiFi or ONT reads in the FASTQ format:
sentieon-cli dnascope-longread [-h] \
-r REFERENCE \
--fastq INPUT_FASTQ ... \
--readgroups READGROUP ... \
-m MODEL_BUNDLE \
[-d DBSNP] \
[-b DIPLOID_BED] \
[-t NUMBER_THREADS] \
[-g] \
--tech HiFi|ONT \
[--haploid_bed HAPLOID_BED] \
[--cnv_excluded_regions CNV_EXCLUDE_BED] \
SAMPLE_VCF
With FASTQ input, the DNAscope LongRead pipeline requires the following arguments:
-r REFERENCE: the location of the reference FASTA file. A reference fasta index, “.fai” file, is also required.--fastq INPUT_FASTQ: the input sample file in FASTQ format. One or more files can be supplied by passing multiple files after the--fastqargument.--readgroups READGROUP: readgroup information for the read data. The--readgroupsargument is required if the input data is in the FASTQ format. This argument expects complete readgroup strings and these strings will be passed tominimap2through the-Rargument.An example argument is,
--readgroups '@RG\tID:foo\tSM:bar'.-m MODEL_BUNDLE: the location of the model bundle. Model bundle files can be found in the sentieon-models repository.--tech HiFi|ONT: Sequencing technology used to generate the reads. Supported arguments areONTorHiFi.SAMPLE_VCF: the location of the output VCF file for SNVs and indels. The pipeline requires the output file end with the suffix, “.vcf.gz”. The file path without the suffix will be used as the basename for other output files.
The Sentieon® LongRead pipeline accepts the following optional arguments:
-d DBSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants in VCF (.vcf) or bgzip compressed VCF (.vcf.gz) format. Only one file is supported. Supplying this file will annotate variants with their dbSNP refSNP ID numbers. A VCF index file is required.-b DIPLOID_BED: interval in the reference to restrict diploid variant calling, in BED file format. Supplying this file will limit diploid variant calling to the intervals inside the BED file.--haploid_bed HAPLOID_BED: interval in the reference to restrict haploid variant calling, in BED file format. Supplying this file will perform haploid variant calling across the intervals inside the BED file.--cnv_excluded_regions: a BED file of excluded CNV regions passed to hificnv. See the hificnv documentation for more details, https://github.com/PacificBiosciences/HiFiCNV.-t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted, the pipeline will use as many threads as the server has.-g: output variants in the gVCF format, in addition to the VCF output file. The tool will output a bgzip compressed gVCF file with a corresponding index file.-h: print the command-line help and exit.--dry_run: print the pipeline commands, but do not actually execute them.
Alignment and variant calling from uBAM, uCRAM, BAM, or CRAM
A single command is run to align and call SNVs, indels, and structural variants from PacBio HiFi or ONT reads in the uBAM, uCRAM, BAM, or CRAM formats:
sentieon-cli dnascope-longread [-h] \
-r REFERENCE \
-i SAMPLE_INPUT ... \
--align \
-m MODEL_BUNDLE \
[-d DBSNP] \
[-b DIPLOID_BED] \
[-t NUMBER_THREADS] \
[-g] \
--tech HiFi|ONT \
[--haploid_bed HAPLOID_BED] \
[--cnv_excluded_regions CNV_EXCLUDE_BED] \
[--input_ref INPUT_REF] \
SAMPLE_VCF
With uBAM, uCRAM, BAM, or CRAM input, the DNAscope LongRead pipeline requires the following new arguments:
-i SAMPLE_INPUT: the input sample file in uBAM or uCRAM format. One or more files can be supplied by passing multiple files after the-iargument.--align: re-align the input read data to the reference genome using Sentieon® minimap2.
The DNAscope LongRead pipeline accepts the following new optional arguments:
--input_ref INPUT_REF: a reference fasta used for decoding the input file(s). Required with uCRAM or CRAM input. Can be different from the fasta used with the-rargument.
Variant calling from BAM or CRAM
A single command is run to call SNVs, indels, and structural variants from PacBio HiFi or ONT reads in the BAM, or CRAM formats:
sentieon-cli dnascope-longread [-h] \
-r REFERENCE \
-i SAMPLE_INPUT ... \
-m MODEL_BUNDLE \
[-d DBSNP] \
[-b DIPLOID_BED] \
[-t NUMBER_THREADS] \
[-g] \
--tech HiFi|ONT \
[--haploid_bed HAPLOID_BED] \
[--cnv_excluded_regions CNV_EXCLUDE_BED] \
SAMPLE_VCF
Not supplying the --align argument will direct the pipeline to call
variants directly from the input reads.
Pipeline output
List of output files
The following files are output when processing FASTQ data or uBAM,
uCRAM, BAM, or CRAM files with the --align argument:
sample.vcf.gz: SNV and indel variant calls across the regions of the genome as defined in the-b DIPLOID_BEDfile.sample.sv.vcf.gz: structural variant calls from the Sentieon® LongReadSV tool.sample_mm2_sorted_fq_*.cram: aligned and coordinate-sorted reads from the input FASTQ files.sample_mm2_sorted_*.cram: aligned and coordinate-sorted reads from the input uBAM, uCRAM, BAM, or CRAM files.sample.hificnv: the base name of HiFiCNV output files (with HiFi input).
Other considerations
Diploid and haploid variant calling
The default pipeline is recommended for use with samples from diploid
organisms. For samples with both diploid and haploid chromosomes, the
-b DIPLOID_BED option can be used to limit diploid variant calling
to diploid chromosomes and the --haploid_bed HAPLOID_BED argument
can be used to perform haploid variant calling across haploid
chromosomes.
Diploid and haploid BED files for the human hg38 reference genome (with male samples) can be found in the /data folder in this repository.
Modification
Scripts in this repository are made available under the BSD 2-Clause license.
The Python scripts in the sentieon_cli/scripts folder perform
low-level manipulation of intermediate gVCF and VCF files generated by
the pipeline. Due to the low-level data handling performed by these
scripts, modification of these files by users is discouraged.
References
Sentieon DNAscope LongRead – A highly Accurate, Fast, and Efficient Pipeline for Germline Variant Calling from PacBio HiFi reads - A preprint describing the DNAscope LongRead pipeline for calling variants from PacBio HiFi data.
DNAscope Hybrid
Sentieon® DNAscope Hybrid is a pipeline for germline variant calling using combined short-read and long-read data from a single sample. The DNAscope Hybrid pipeline is able to utilize the strengths of both short and long-read technologies to generate variant callsets that are more accurate than either short-read or long-read data alone. The sentieon-cli provides a complete implementation of the DNAscope hybrid pipeline in a single command.
The pipeline supports input data in the following formats; both short-read and long-read input are required:
Unaligned short-read data in gzipped FASTQ format.
Aligned short-reads in BAM or CRAM format.
Unaliged long-read data in the uBAM or uCRAM format.
Aligned long-read data in BAM or CRAM format.
By default, the pipeline will generate the following output files:
Small variants (SNVs and indels) in the VCF format.
Structural variants in the VCF format.
Copy-number variants in the VCF format.
If unaligned reads are used as input, the pipeline will also output aligned reads in BAM or CRAM format.
Setup
Prerequisites
Sentieon® software package version 202503.01 or higher.
Python version 3.8 or higher.
bcftools version 1.10 or higher.
MultiQC version 1.18 or higher for metrics report generation.
samtools version 1.16 or higher.
mosdepth version 0.2.6 or higher for coverage metrics collection from long-read data.
The sentieon, python, bcftools, bedtools, samtools,
multiqc, and mosdepth executables will be accessed through the
user’s PATH environment variable.
The Reference genome
DNAscope LongRead will call variants present in the sample relative to a high quality reference genome in FASTA format. Besides the reference genome file, a samtools fasta index file (.fai) needs to be present. Short-read alignment also requires bwa index files.
We recommend aligning to a reference genome without alternate contigs. If alternate contigs are present in the genome and the pipeline is performing short-read alignment, please also supply a “.alt” file to activate alt-aware alignment in bwa.
Usage
Germline variant calling from aligned short and long-read data
A single command is run to call SNVs, indels, SVs and CNVs from aligned short and long reads:
sentieon-cli dnascope-hybrid \
-r REFERENCE \
--sr_aln SR_ALN [SR_ALN ...] \
--lr_aln LR_ALN [LR_ALN ...] \
-m MODEL_BUNDLE \
[-d DBSNP] \
[-b DIPLOID_BED] \
[-t NUMBER_THREADS] \
sample.vcf.gz
The DNAscope Hybrid pipeline requires the following arguments:
-r REFERENCE: the location of the reference FASTA file. A reference fasta index, “.fai” file, is also required.--sr_aln: the input short-read data in BAM or CRAM format. One or more files can be supplied by passing multiple files after the--sr_alnargument.--lr_aln: the input long-read data in BAM or CRAM format. One or more files can be supplied by passing multiple files after the--lr_alnargument.-m MODEL_BUNDLE: the location of the model bundle. Model bundle files can be found in the sentieon-models repository.sample.vcf.gz: the location of the output VCF file for SNVs and indels. The pipeline requires the output file end with the suffix, “.vcf.gz”.
The DNAscope Hybrid pipeline accepts the following optional arguments:
-d DBSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants in VCF (.vcf) or bgzip compressed VCF (.vcf.gz) format. Only one file is supported. Supplying this file will annotate variants with their dbSNP refSNP ID numbers. A VCF index file is required.-b DIPLOID_BED: interval in the reference to restrict diploid variant calling, in BED file format. Supplying this file will limit diploid variant calling to the intervals inside the BED file.-t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted, the pipeline will use as many threads as the server has.-h: print the command-line help and exit.--dry_run: print the pipeline commands, but do not actually execute them.
Germline variant calling from unaligned short and long-read data
A single command is run to call SNVs, indels, SVs and CNVs from unaligned short and long reads:
sentieon-cli dnascope-hybrid \
-r REFERENCE \
--sr_r1_fastq SR_R1_FQ [SR_R1_FQ ...] \
--sr_r2_fastq SR_R2_FQ [SR_R2_FQ ...] \
--sr_readgroups SR_READGROUP [SR_READGROUP ...] \
--lr_aln LR_ALN [LR_ALN ...] \
--lr_align_input \
-m MODEL_BUNDLE \
[-d DBSNP] \
[-b DIPLOID_BED] \
[-t NUMBER_THREADS] \
sample.vcf.gz
The DNAscope Hybrid pipeline requires the following arguments:
--sr_r1_fastq: the input R1 short-read data in gzipped FASTQ format. One or more files can be supplied by passing multiple files after the--sr_r1_fastqargument.--sr_r2_fastq: the input R2 short-read data in gzipped FASTQ format. One or more files can be supplied by passing multiple files after the--sr_r2_fastqargument.--sr_readgroups: readgroup information for each FASTQ. The pipeline will expect the same number of arguments to--sr_r1_fastqand--sr_readgroups. An example argument is,--sr_readgroups "@RG\tID:HG002-1\tSM:HG002\tLB:HG002-LB-1\tPL:ILLUMINA"--lr_aln: the input long-read data in uBAM or uCRAM format. One or more files can be supplied by passing multiple files after the--lr_alnargument. ---lr_align_input: directs the pipeline to align the input long-reads.
The DNAscope Hybrid pipeline accepts the following optional arguments:
--sr_duplicate_marking: setting for duplicate marking.markdupwill mark duplicate reads.rmdupwill remove duplicate reads.nonewill skip duplicate marking. The default setting ismarkdup.--lr_input_ref: a reference fasta used for decoding the input long-read file(s). Required with long-read uCRAM or CRAM input. Can be different from the fasta used with the-rargument.--bam_format: use BAM format instead of CRAM for output aligned files.
Pipeline output
List of output files
The following files are output by the DNAscope Hybrid pipeline:
sample.vcf.gz: SNV and indel variant calls across the regions of the genome as defined in the-b DIPLOID_BEDfile.sample.sv.vcf.gz: structural variant calls from the Sentieon® LongReadSV tool.sample.cnv.vcf.gz: copy-number variant calls from the Sentieon® CNVscope tool.sample_deduped.cram: aligned, coordinate-sorted and duplicate-markedshort-read data from the input FASTQ files.
sample_mm2_sorted_*.cram: aligned and coordinate-sorted long-readsfrom the input uBAM, uCRAM, BAM, or CRAM files.
sample_metrics: a directory containing QC metrics for the analyzedsample.
Troubleshooting
The pipeline complains, Input … has a different RG-SM tag
This error will occur if the pipeline detects that the input files have
(or will have) different readgroup SM tags. To fix this error,
please use the --rgsm argument to adjust the SM tags of the
input files during variant calling. Note that with this argument, all
reads in the input files will be used during variant calling.
Sentieon Pangenome
Sentieon® Pangenome is a pipeline for alignment and variant calling from short-read DNA sequence data using pangenome graphs. The Pangenome pipeline leverages graph-based reference representations to improve alignment and variant calling accuracy, particularly in complex genomic regions with high sequence diversity.
The pipeline accepts as input un-aligned reads in FASTQ format or aligned reads in BAM or CRAM format. The pipeline will output variants in the VCF format and aligned reads in BAM or CRAM formats.
Setup
Prerequisites
Sentieon® software package version 202503.02 or higher.
samtools version 1.16 or higher.
vg for personalized pangenome creation and pangenome manipulation.
KMC version 3 or higher for k-mer counting.
bcftools version 1.22 or higher for VCF operations.
MultiQC.
MultiQCis used to generate a report from the pipeline metrics.
The executables will be accessed through the user’s PATH environment
variable.
The Reference genome
The Sentieon® Pangenome will call variants relative to a high quality reference genome sequence. Besides the reference genome file, a samtools fasta index file (.fai) needs to be present. Read alignment also requires bwa index files.
We recommend using a reference genome without alternate contigs.
Currently, only GRCh38 with UCSC-style contig names (chr1, chr2, etc.)
is supported.
Pangenome files
The pipeline requires the following pangenome files:
GBZ file: The pangenome graph in GBZ format.Haplotype file: Pangenome haplotype information.
The pangenome needs to incorporate GRCh38 as a reference sequence.
The sentieon-cli will check that the .gbz and .hapl files follow the
HPRC naming convention, ending with grch38.gbz and grch38.hapl,
respectively. This check can be disabled using the hidden argument,
--skip_pangenome_name_checks.
Model bundle
A Sentieon model bundle containing machine learning models for variant calling is required. Model bundle files can be found in the sentieon-models repository.
Population VCF
The pipeline requires a population VCF file as input. The population VCF must match the model bundle that is used by the pipeline.
Optional input files
Additional input files are required for optional functionality:
A BED file containing variant calling intervals. Recommended for whole-genome sequencing data to restrict variant calling to the canonical contigs. Recommended for whole-exome sequencing data to restrict variant calling to target regions.
The Single Nucleotide Polymorphism database (dbSNP) data that you want to include in the pipeline. The data is used in the form of a VCF file; you can use a VCF file compressed with bgzip and indexed.
Usage
Alignment and variant calling from FASTQ
A single command is run to perform alignment, preprocessing and metrics collection, and SNV and indel calling from FASTQ files:
sentieon-cli sentieon-pangenome [-h] \
-r REFERENCE \
--hapl HAPL \
--gbz GBZ \
--r1_fastq R1_FASTQ ... \
--r2_fastq R2_FASTQ ... \
--readgroups READGROUP \
--pop_vcf POP_VCF \
-m MODEL_BUNDLE \
[-b INTERVAL_FILE] \
[--bam_format] \
[-d DBSNP] \
[--dry_run] \
[-t NUMBER_THREADS] \
[--pcr_free] \
SAMPLE_VCF
With FASTQ input, the Sentieon® pangenome pipeline requires the following arguments:
-r REFERENCE: the location of the reference FASTA file. A reference fasta index, “.fai” file, and bwa index files, are also required.--hapl HAPL: the location of the pangenome haplotype file.--gbz GBZ: the location of the pangenome graph file in GBZ format.--r1_fastq R1_FASTQ: the R1 input FASTQ. Can be used multiple times. EachR1_FASTQmust have a correspondingR2_FASTQfile. All fastq are expected to be from the same sample.--r2_fastq R2_FASTQ: the R2 input FASTQ. Can be used multiple times.--readgroup READGROUP: readgroup information for the sample. A single readgroup can be supplied for the sample. An example argument is,--readgroup "@RG\tID:HG002-1\tSM:HG002\tLB:HG002-LB-1\tPL:ILLUMINA"--pop_vcf POP_VCF: the location of the population VCF file.-m MODEL_BUNDLE: the location of the model bundle. Model bundle files can be found in the sentieon-models repository.SAMPLE_VCF: the location of the output VCF file for SNVs and indels. The pipeline requires the output file end with the suffix,.vcf.gz. The file path without the suffix will be used as the basename for other output files.
The Sentieon® pangenome pipeline accepts the following optional arguments:
-b INTERVAL_FILE: interval in the reference to restrict variant calling, in BED file format. Supplying this file will limit variant calling to the intervals inside the INTERVAL_FILE. We recommend using anINTERVAL_FILEto restrict variant calling to canonical contigs.--bam_format: use BAM format instead of CRAM for output aligned files.-d DBSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants in VCF (.vcf) or bgzip compressed VCF (.vcf.gz) format. Only one file is supported. Supplying this file will annotate variants with their dbSNP refSNP ID numbers. A VCF index file is required.--dry_run: print the pipeline commands, but do not actually execute them.-t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted, the pipeline will use as many threads as the server has.--pcr_free: Call variants using--pcr_indel_model NONE, which is appropriate for libraries prepared with a PCR-free library prep. Deduplication is still performed to identify optical duplicates.-h: print the command-line help and exit.
Variant calling from sorted BAM or CRAM
A single command is run to perform alignment, preprocessing and metrics collection, and SNV and indel calling from BAM or CRAM files:
sentieon-cli sentieon-pangenome [-h] \
-r REFERENCE \
--hapl HAPL \
--gbz GBZ \
-i SAMPLE_INPUT ... \
--pop_vcf POP_VCF \
-m MODEL_BUNDLE \
[-b INTERVAL_FILE] \
[-d DBSNP] \
[--dry_run] \
[-t NUMBER_THREADS] \
SAMPLE_VCF
With BAM or CRAM input, the pangenome pipeline requires the following new arguments:
-i SAMPLE_INPUT: the input sample file in uBAM or uCRAM format. One or more files can be supplied by passing multiple files after the-iargument.
Pipeline output
List of output files
The following files are output when processing WGS FASTQ with all features if
the output file is sample.vcf.gz:
sample.vcf.gz: SNV and indel variant calls across the regions of the genome as defined in the-b BEDfile.sample_bwa_deduped.cramorsample_bwa_deduped.bam: bwa aligned, coordinate-sorted and duplicate-marked read data from the input FASTQ file(s).sample_mm2_deduped.cramorsample_mm2_deduped.bam: pangenome aligned, coordinate-sorted and duplicate-marked read data from the input FASTQ files. The reads in this file are aligned to the pangenome and lifted back to the reference genome.sample_metrics/: A directory containing QC metrics for the analyzed sample.sample_metrics/sample.txt.alignment_stat.txt: Metrics from the Sentieon® AlignmentStat algo.sample_metrics/sample.txt.base_distribution_by_cycle.txt: Metrics from the Sentieon® BaseDistributionByCycle algo.sample_metrics/coverage*: Coverage metrics from the Sentieon® CoverageMetrics algo.sample_metrics/sample.txt.gc_bias*: Metrics from the Sentieon® GCBias algo.sample_metrics/sample.txt.insert_size.txt: Metrics from the Sentieon® InsertSizeMetricAlgo algo.sample_metrics/sample.txt.mean_qual_by_cycle.txt: Metrics from the Sentieon® MeanQualityByCycle algo.sample_metrics/sample.txt.qual_distribution.txt: Metrics from the Sentieon® QualDistribution algo.sample_metrics/sample.txt.wgs.txt: Metrics from the Sentieon® WgsMetricsAlgo algo.sample_metrics/multiqc_report.html: Collected QC metrics aggregated by MultiQC.
Limitations of the Sentieon® pangenome pipeline
The Sentieon® pangenome pipeline currently only supports Minigraph-Cactus pangenomes with a GRCh38 reference sequence, such as those generated by the Human Pangenome Reference Consortium (HPRC). Please reach out to Sentieon® support for information on using the pipeline with other pangenomes.