7. Detailed usage of the tools

7.1. DRIVER binary

DRIVER is the binary used to execute all stages of the bioinformatics pipeline. A single call to the driver binary can run multiple algorithms; for example, the metrics stage is implemented as a single command call to driver running multiple algorithms.

7.1.1. DRIVER syntax

The general syntax of the DRIVER binary is:

sentieon driver OPTIONS --algo ALGORITHM ALGO_OPTION OUTPUT \
[--algo ALGORITHM2 ALGO_OPTION2 OUTPUT2]

In the case of running multiple algorithms in the same driver call, the OPTIONS are shared among all the algorithms.

The arguments (OPTIONS) for this command include:

  • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted, the driver tool will use as many threads as the server has.
  • -r REFERENCE: location of the reference FASTA file. This argument is required in all algorithms except LocusCollector and Dedup without consensus.
  • -i INPUT_FILE: location of the BAM input file. This argument can be used multiple times to have the software use multiple input files and the result would be the same as if the input BAM files are merged into one BAM file. Sentieon® only support bam files that are sorted by coordinates, the @HD line with SO:coordinate attribute should be present. Sentieon® also requires the input bam files have RG tag in the alignment section and a corresponding @RG line with matching ID tag defined in the header. This argument is required in all algorithms except for CollectVCMetrics, GVCFtyper, DNAModelApply, SVSolver, TNModelApply, VarCal, ApplyVarCal and TNfilter.
  • -q QUALITY_RECALIBRATION_TABLE: location of the quality recalibration table output from the BQSR stage that will be used as an input during the BQSR stage and Variant Calling stage. This argument can be used multiple times to have the software use multiple input calibration tables; each calibration table will apply a recalibration to BAM files by matching the readgroup of the reads
  • --interval INTERVAL: interval in the reference that will be used in the software. This argument can be used multiple times to perform the calculation on the union of all the intervals. INTERVAL can be specified as:
    • CONTIG[:START-END]: calculation will be done only on the corresponding contig. START and END are optional numbers to further reduce the interval; both are first-base-1 based. You can input a comma-separated list of multiple contigs.
    • BED_FILE: location of the BED file containing the intervals. Providing an empty file will result in no processing done, as if the interval had 0 length. A compressed BGZF file is also supported.
    • PICARD_INTERVAL_FILE: location of the file containing the intervals, following the Picard interval standard. Providing an empty file will result in no processing done, as if the interval had 0 length. A compressed BGZF file is also supported.
    • VCF_FILE: location of VCF containing variant records whose genomic coordinates will be used as intervals. A compressed VCF.gz file is also supported.
  • --interval_padding PADDING_SIZE: adds PADDING_SIZE bases padding to the edges of the input intervals. The default value is 0.
  • --help: option to display help. The option can be used together with the --algo ALGORITHM to display help for the specific algorithm.
  • --read_filter FILTER,OPTION=VALUE,OPTION=VALUE: perform a filter or transformation of reads prior to the application of the algorithm. Please refer to Section 7.1.3 to get additional information on the available filters and their functionality.
  • --temp_dir DIRECTORY: determines where the temporary files will be stored. The default is the folder where the command is run ($PWD).
  • --skip_no_coor: determines whether to skip unmapped reads.
  • --cram_read_options decode_md=0: CRAM input option to turn off the NM/MD tag in the input CRAM.
  • --replace_rg ORIG_RG="NEW_RG_STRING": modifies the @RG Read group tag of the next BAM input file to update the ORIG_RG with the new information; the NEW_RG_STRING needs to be a valid and complete string. This argument can be used multiple times, and each will affect the next -i BAM input. You can check Section 8.7 for a detail example on its usage

The supported algorithms (ALGORITHM) for this command are:

  • LocusCollector, Dedup: used in the remove duplicates stage.
  • Realigner: used in the indel realignment stage.
  • QualCal: used in the base quality recalibration stage.
  • MeanQualityByCycle, QualDistribution, GCBias, AlignmentStat, InsertSizeMetricAlgo, HsMetricAlgo, CoverageMetrics, BaseDistributionByCycle, QualityYield, WgsMetricsAlgo, SequenceArtifactMetricsAlgo: used to calculate QC metrics.
  • Genotyper, Haplotyper: used for germline variant calling analysis.
  • DNAscope and DNAModelApply: used in DNAscope germline variant calling analysis.
  • DNAscope and SVsolver: used in the DNAscope germline structural variant analysis.
  • DNAscope and VariantPhaser: used in the DNAscope germline variant calling analysis for PacBio® HiFi® reads.
  • ReadWriter: used to output the BAM file after base quality recalibration or merge multiple BAM files.
  • VarCal, ApplyVarCal: used in the VQSR stage.
  • GVCFtyper: used for germline joint variant calling analysis.
  • TNsnv, TNhaplotyper, TNhaplotyper2, ContaminationModel, OrientationBias, TNfilter: used for somatic tumor-normal or tumor only analysis.
  • TNscope, TNModelApply: used for somatic tumor-normal somatic and structural variant analysis.
  • RNASplitReadsAtJunction: used in the stage to split RNA reads at junctions.
  • ContaminationAssessment: used to identify cross-sample contamination from BAM files.
  • CollectVCMetrics: used to calculate post-variant calling metrics.

7.1.2. DRIVER ALGORITHM syntax

7.1.2.1. LocusCollector ALGORITHM

The LocusCollector algorithm collects read information that will be used for removing duplicate reads.

The input to the LocusCollector algorithm is a BAM file; its output is the score file indicating which reads are likely duplicates.

The LocusCollector algorithm requires the following ALGO_OPTION:

  • --fun SCORE: scoring function to use. Possible values for SCORE are:
    • score_info: calculates the score of a read pair for the Dedup algorithm. This is the default score function.
  • --consensus: set this option to turn on consensus deduplication.
  • --umi_tag TAG: Logic UMI tag for UMI barcode aware deduplication. The default value is None.
  • --umi_ecc_dist DISTANCE: UMI barcode error correction distance. Set it to 0 to turn off the barcode error correction. The default value is 1.
  • --rna: set this option for RNA sequence data aligned with STAR.

7.1.2.2. Dedup ALGORITHM

The Dedup algorithm performs the marking/removing of duplicate reads.

The input to the Dedup algorithm is a BAM file; its output is the BAM file after removing duplicate reads.

The Dedup algorithm requires the following ALGO_OPTION:

  • --score_info LOCUS_COLLECTOR_OUTPUT: location of the output file of the LocusCollector command call.

The Dedup algorithm accepts the following optional ALGO_OPTION:

  • --rmdup: set this option to remove duplicated reads in the output BAM file. If this option is not set, the duplicated reads will be marked as such with a flag, but not removed from the BAM file.
  • --cram_write_options compressor=[gzip|bzip2|lzma|rans]: CRAM output compression options. compressor=gzip is default if not defined.
  • --cram_write_options version=[2.1|3.0]: CRAM output version options. version=3.0 is default if not defined.
  • --metrics METRICS_FILE: location and filename of the output file containing the metrics data from the deduping stage.
  • --optical_dup_pix_dist DISTANCE: determine the maximum distance between two duplicate reads for them to be considered optical duplicates. The default value is 100.
  • --bam_compression COMPRESSION_LEVEL[0-9]: gzip compression level for the output BAM file. The default value is 6.
  • --output_dup_read_name: when using this option, the output of the command will not be a BAM file with marked/removed duplicates, but a list of read names for reads that were marked as duplicate.
  • --dup_read_name DUPLICATE_READ_NAME_FILE: when using this input all reads contained in the DUPLICATE_READ_NAME_FILE will be marked as duplicate, regardless of whether they are primary or non-primary.

7.1.2.3. Realigner ALGORITHM

The Realigner algorithm performs the indel realignment.

The input to the Realigner algorithm is a BAM file; its output is the BAM file after realignment.

The Realigner algorithm accepts the following optional ALGO_OPTION:

  • -k KNOWN_SITES: location of the VCF file used as a set of known sites. The known sites will be used to help identify likely sites where the realignment is necessary; only indel variants in the file will be used. You can include multiple collections of known sites by specifying multiple files and repeating the -k KNOWN_SITES option.
  • --interval_list INTERVAL: interval in the reference that will be used in the calculation of the realign targets. Only a single input INTERVAL will be considered: if you repeat the --interval_list option, only the INTERVAL in the last one will be considered. INTERVAL can be specified as:
    • BED_FILE: location of the BED file containing the intervals.
    • PICARD_INTERVAL_FILE: location of the file containing the intervals, following the Picard interval standard.
  • --bam_compression COMPRESSION_LEVEL[0-9]: gzip compression level for the output BAM file. The default value is 6.
  • --cram_write_options compressor=[gzip|bzip2|lzma|rans]: CRAM output compression options. compressor=gzip is default if not defined.
  • --cram_write_options version=[2.1|3.0]: CRAM output version options. version=3.0 is default if not defined.

7.1.2.4. QualCal ALGORITHM

The QualCal algorithm calculates the recalibration table necessary to do the BQSR. The QualCal algorithm also applies the recalibration to calculate the data required to create a report, and creates the data required to create a report.

The recalibration math depends on platform (PL) tag of the ReadGroup; the QualCal algorithm supports the following platforms: ILLUMINA, ION_TORRENT, LS454, PACBIO, COMPLETE_GENOMICS, DNBSEQ. Support for sequencing data from the SOLID platform is not currently implemented.

The input to the QualCal algorithm is a BAM file; its output is a recalibration table or the csv file containing the data required to create a report.

The QualCal algorithm accepts the following optional ALGO_OPTION:

  • -k KNOWN_SITES: location of the VCF file used as a set of known sites. The known sites will be used to make sure that known locations do not get artificially low quality scores by misidentifying true variants errors in the sequencing methodology. You can include multiple collections of known sites by specifying multiple files and repeating the -k KNOWN_SITES option. We strongly recommend using as many known sites as possible, as otherwise the recalibration will consider variant sites to be sequencing errors.
  • --plot: indicates whether the command is being used to generate the data required to create a report.
  • --cycle_val_max: maximum allowed cycle value for the cycle covariate.
  • --before RECAL_TABLE: location of the previously calculated recalibration table; it will be used to apply the recalibration.
  • --after RECAL_TABLE.POST: location of the previously calculated results of applying the recalibration table; it will be used to calculate the data required to create a report.

7.1.2.5. MeanQualityByCycle ALGORITHM

The MeanQualityByCycle algorithm calculates the mean base quality score for each sequencing cycle.

The input to the MeanQualityByCycle algorithm is a BAM file; its output is the metrics data.

The MeanQualityByCycle algorithm does not accept any ALGO_OPTION.

7.1.2.6. QualDistribution ALGORITHM

The QualDistribution algorithm calculates the number of bases with a specific base quality score.

The input to the QualDistribution algorithm is a BAM file; its output is the metrics data.

The QualDistribution algorithm does not accept any ALGO_OPTION.

7.1.2.7. GCBias ALGORITHM

The GCBias algorithm calculates the GC bias in the reference and the sample.

The input to the GCBias algorithm is a BAM file; its output is the metrics data.

The GCBias algorithm accepts the following optional ALGO_OPTION:

  • --summary SUMMARY_FILE: location and filename of the output file summarizing the GC Bias metrics.
  • --accum_level LEVEL: determines the accumulation levels. The possible values of LEVEL are ALL_READS, SAMPLE, LIBRARY, READ_GROUP. The default value is ALL_READS.
  • --also_ignore_duplicates: determines whether the output metrics will be calculated using unique non duplicated reads.

7.1.2.8. HsMetricAlgo ALGORITHM

The HsMetricAlgo algorithm calculates the Hybrid Selection specific metrics for the sample and the AT/GC dropout metrics for the reference.

The input to the HsMetricAlgo algorithm is a BAM file; its output is the metrics data.

The HsMetricAlgo algorithm requires the following ALGO_OPTION:

  • --targets_list TARGETS_FILE: location and filename of the interval list input file that contains the locations of the targets.
  • --baits_list TARGETS_FILE: location and filename of the interval list input file that contains the locations of the baits used.
  • --clip_overlapping_reads: determines whether to clip overlapping reads during the calculation.
  • --min_map_qual QUALITY: determines the filtering quality of the reads used. Any reads with mapping quality less than QUALITY will be filtered out.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored.
  • --coverage_cap COVERAGE: determines the maximum coverage limit used in the histogram.

7.1.2.9. AlignmentStat ALGORITHM

The AlignmentStat algorithm calculates statistics about the alignment of the reads.

The input to the AlignmentStat algorithm is a BAM file; its output is the metrics data.

The AlignmentStat algorithm accepts the following optional ALGO_OPTION:

  • --adapter_seq SEQUENCE_LIST: the sequence of the adapters used in the sequencing, provided as a comma separated list. The default value is the list of default Illumina adapters.

7.1.2.10. InsertSizeMetricAlgo ALGORITHM

The InsertSizeMetricAlgo algorithm calculates the statistical distribution of insert sizes.

The input to the InsertSizeMetricAlgo algorithm is a BAM file; its output is the metrics data.

The InsertSizeMetricAlgo algorithm accepts the following optional ALGO_OPTION:

  • --deviation DEVIATION: Maximum multiples of standard deviation before the histogram is trimmed down. The default value is 10.0.
  • --hist_width HISTOGRAM\_WIDTH: sets the HISTOGRAM_WIDTH to override the automatic truncation of histogram tail. The default value is 0.
  • --min_read_ratio RATIO: Minimum ratio of reads for a read category to be included in the histogram. The default value is 0.05.

7.1.2.11. CoverageMetrics ALGORITHM

The CoverageMetrics algorithm calculates the depth coverage of the BAM file. The coverage is aggregated by interval if the --interval option is included at the driver level; if no --interval option is included, the aggregation per interval will be done per contig for all bases in the reference and the per locus coverage output file will be about 60GB. The coverage is aggregated by gene if a RefSeq file is included.

The input to the CoverageMetrics algorithm is a BAM file, it is recommended to use the after Deduplication BAM file for this analysis; its outputs are files containing the metrics data organized by partition, aggregation and output. Possible outputs are:

  • summary: contains the depth data.
  • statistics: contains the histogram of loci with specific depth.
  • cumulative_coverage_counts: contains the histogram of loci with depth larger than x.
  • cumulative_coverage_proportions: contains the normalized histogram of loci with depth larger than x.

Examples of output files when the output name is OUTPUT:

  • OUTPUT: the per locus coverage with no partition.
  • OUTPUT.sample_summary: the summary for PARTITION_GROUP sample, aggregated over all bases.
  • OUTPUT. library_interval_statistics: the statistics for PARTITION_GROUP library, aggregated by interval.

The CoverageMetrics algorithm accepts the following optional ALGO_OPTION:

  • --partition PARTITION_GROUP: determine how to partition the data. Possible values are readgroup or a comma separated combination of the RG attributes, namely sample, platform, library, center. The default value is “sample”. You can include multiple partition groups by repeating the -–partition option, and each ouput file will be created once per --partition option.
  • --gene_list REFSEQ_FILE: location of the RefSeq file used to aggregate the results of the CoverageMetrics algorithm to the gene level.
  • Filtering options:
    • --min_map_qual MAP_QUALITY: determines the filtering quality of the reads used. Any reads with mapping quality less than QUALITY will be filtered out.
    • --max_map_qual MAP_QUALITY: determines the filtering quality of the reads used. Any reads with mapping quality larger than QUALITY will be filtered out.
    • --min_base_qual QUALITY: determines the filtering quality of the bases used. Any base with quality less than QUALITY will be ignored.
    • --max_base_qual QUALITY: determines the filtering quality of the bases used. Any base with quality larger than QUALITY will be ignored.
    • --cov_thresh THRESHOLD: add percentage of bases in the aggregation that have coverage larger than the threshold. You can include multiple thresholds by repeating the --cov_thresh argument.
  • Omit output options:
    • --omit_base_output: skip the output of the per locus coverage with no partition. This option can be used when you do not use intervals to save space.
    • --omit_sample_stat: skip the output of summary results aggregated over all bases (_summary)
    • --omit_locus_stat: skip the output of all histogram files (both _cumulative_coverage_counts and _cumulative_coverage_proportions).
    • --omit_interval_stat: skip the output of all interval statistics files (_interval_statistics).
  • --count_type TYPE: determines how to deal with overlapping reads from the same fragment. Possible options are:
    • 0: to count overlapping reads even if they come from the same fragment. This is the default value.
    • 1: to count overlapping reads
    • 2: to count overlapping reads only if the reads in the fragment have consistent bases.
  • --print_base_counts: include the number of “AGCTND” in the output per locus coverage with no partition.
  • --include_ref_N: include the coverage data in loci where the reference genome is set to N.
  • --ignore_del_sites: ignore the coverage data in loci where there are deletions.
  • --include_del: this argument will interact with others as follows:
    • if ignore_del_sites is off, count Deletion as depth
    • if print_base_counts is on, include number of ‘D’
  • --histogram_scale [log/linear] --histogram_low MIN_DEPTH, --histogram_high MAX_DEPTH, --histogram_bin_count NUM_BINS: determine the scale type, bin and sizes for histograms. The default values are log, 1, 500, 499.

7.1.2.12. CollectVCMetrics ALGORITHM

The CollectVCMetrics algorithm collects metrics related to the variants present in the input VCF.

The input to the CollectVCMetrics algorithm is a VCF file and a DBSNP file; its output is a pair of files containing information about the variants from the VCF file.

The CollectVCMetrics algorithm requires the following ALGO_OPTION:

  • -d dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP). Only one file is supported.
  • -v INPUT: location of the VCF file on which the metrics will be calculated. Only one file is supported.

7.1.2.13. BaseDistributionByCycle

The BaseDistributionByCycle algorithm calculates the nucleotide distribution per sequencer cycle.

The input to the BaseDistributionByCycle algorithm is a BAM file; its output is the metrics data.

The BaseDistributionByCycle algorithm accepts the following optional ALGO_OPTION:

  • --aligned_reads_only: determines whether to calculate the base distribution over aligned reads only.
  • --pf_reads_only: determines whether to calculate the base distribution over PF reads only.

7.1.2.14. QualityYield

The QualityYield algorithm collects metrics related to reads that pass quality thresholds and Illumina-specific filters.

The input to the QualityYield algorithm is a BAM file; its output is the metrics data.

The QualityYield algorithm accepts the following optional ALGO_OPTION:

  • --include_secondary: determines whether to include bases from secondary alignments in the calculation.
  • --include_supplementary: determines whether to include bases from supplementary alignments in the calculation.

7.1.2.15. WgsMetricsAlgo

The WgsMetricsAlgo algorithm collects metrics related to the coverage and performance of whole genome sequencing (WGS) experiments.

The input to the WgsMetricsAlgo algorithm is a BAM file; its output is the metrics data.

The WgsMetricsAlgo algorithm accepts the following optional ALGO_OPTION:

  • --min_map_qual QUALITY: determines the filtering quality of the reads used in the calculation. Any read with quality less than QUALITY will be ignored. The default value is 20.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in the calculation. Any base with quality less than QUALITY will be ignored. The default value is 20.
  • --coverage_cap COVERAGE: determines the maximum coverage limit for the histogram. Any position with coverage higher than COVERAGE will have its coverage set to COVERAGE.
  • --sample_size SIZE: determines the Sample Size used for the Theoretical Het Sensitivity sampling. The default value is 10000.
  • --include_unpaired: determines whether to count unpaired reads and paired reads with one end unmapped.
  • --base_qual_histogram: determines whether to report the base quality histogram.

7.1.2.16. SequenceArtifactMetricsAlgo

The SequenceArtifactMetricsAlgo algorithm collects metrics that quantify single-base sequencing artifacts and OxoG artifacts.

The input to the SequenceArtifactMetricsAlgo algorithm is a BAM file; its output is the metrics data.

The SequenceArtifactMetricsAlgo algorithm accepts the following optional ALGO_OPTION:

  • --dbsnp FILE: location of the Single Nucleotide Polymorphism database (dbSNP) used to exclude regions around known polymorphisms. Only one file is supported.
  • --min_map_qual QUALITY: determines the filtering quality of the reads used in the calculation. Any read with quality less than QUALITY will be ignored. The default value is 30.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in the calculation. Any base with quality less than QUALITY will be ignored. The default value is 20.
  • --include_unpaired: determines whether to count unpaired reads and paired reads with one end unmapped.
  • --include_duplicates: determines whether to count duplicated reads.
  • --include_non_pf_reads: determines whether to count non-PF reads.
  • --min_insert_size ISIZE: determines the filtering insert size of the reads used in the calculation. Any read with insert size less than ISIZE will be ignored. The default value is 60.
  • --max_insert_size ISIZE: determines the filtering insert size of the reads used in the calculation. Any read with insert size larger than ISIZE will be ignored. The default value is 600.
  • --tandem_reads: determines whether the mate pairs are being sequenced from the same strand.
  • --context_size SIZE: determined the number of context bases to include on each side. The default value is 1.

7.1.2.17. Genotyper ALGORITHM

The Genotyper algorithm performs the Unified Genotyper variant calling.

The input to the Genotyper algorithm is a BAM file; its output is a VCF file.

The Genotyper algorithm accepts the following optional ALGO_OPTION:

  • --annotation 'ANNOTATION_LIST': determines additional annotations that will be added to the output VCF. Use a comma separated list to enable or disable annotations. Include 'none' to remove the default annotations; prefix annotations with the exclamation point (!) to disable the specific annotation. See Section 8.3 for supported annotations.
  • -d dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants. Only one file is supported.
  • --var_type VARIANT_TYPE: determine which variant types will be called; possible values for VARIANT_TYPE are:
    • SNP to call only Single Nucleotide Polymorphism. This is the default behavior.
    • INDEL to call only insertion-deletions.
    • both to call both SNPs and INDELs.
  • --call_conf CONFIDENCE: determine the threshold of variant quality to call a variant. Variants with quality less than CONFIDENCE will be removed.
  • --emit_conf CONFIDENCE: determine the threshold of variant quality to emit a variant. Variants with quality less than CONFIDENCE will be not be added to the output VCF file.
  • --emit_mode MODE: determines what calls will be emitted. Possible values for MODE are:
    • variant: emit calls only at confident variant sites. This is the default behavior.
    • confident: emit calls at confident variant sites or confident reference sites.
    • all: emit all calls, regardless of their confidence.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 17.
  • --ploidy PLOIDY: determines the ploidy number of the sample being processed. The default value is 2.
  • --given GIVEN_VCF: perform variant calling using only the variants provided in the GIVEN_VCF. The calling will only evaluate the locus and alleles provided in the file, and only if the variant has the FILTER column as PASS or ..
  • --genotype_model MODEL: determines which model to use for genotyping and QUAL calculation. MODEL can be coalescent to use the "so called exact model" based on the coalescent theory in population genetics or multinomial to use the simplified model that assumes variants are independent; the default is coalescent.

7.1.2.18. Haplotyper ALGORITHM

The Haplotyper algorithm performs the Haplotype variant calling.

The input to the Haplotyper algorithm is a BAM file; its output is a VCF file.

The Haplotyper algorithm accepts the following optional ALGO_OPTION:

  • --annotation 'ANNOTATION_LIST': determines additional annotations that will be added to the output VCF. Use a comma separated list to enable or disable annotations. Include 'none' to remove the default annotations; prefix annotations with the exclamation point (!) to disable the specific annotation. See Section 8.3 for supported annotations.
  • -d dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants. Only one file is supported.
  • --call_conf CONFIDENCE: determine the threshold of variant quality to call a variant. Variants with quality less than CONFIDENCE will be removed. This option is ignored when the --emit_mode is gvcf.
  • --emit_conf CONFIDENCE: determine the threshold of variant quality to emit a variant. Variants with quality less than CONFIDENCE will be not be added to the output VCF file. This option is ignored when the --emit_mode is gvcf.
  • --emit_mode MODE: determines what calls will be emitted. Possible values for mode are:
    • variant: emit calls only at confident variant sites. This is the default behavior.
    • confident: emit calls at confident variant sites or confident reference sites.
    • all: emit all calls, regardless of their confidence.
    • gvcf: emits additional information required for joint calling. This option is required if you want to perform joint calling using the GVCFtyper algorithm.
  • --gq_bands LIST_OF_BANDS: determines the bands that will be used to compress variants of similar genotype quality (GQ) that will be emitted as a single VCF record in the GVCF output file. The LIST_OF_BANDS is a comma-separated list of bands where each band is defined by START-END/STEP. The default value is 1-60,60-99/10,99.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 10.
  • --pcr_indel_model MODEL: PCR indel model used to weed out false positive indels more or less aggressively. The possible MODELs are: NONE (used for PCR free samples), and HOSTILE, AGGRESSIVE and CONSERVATIVE, in order of decreasing aggressiveness. The default value is CONSERVATIVE.
  • --phasing [1/0]: flag to enable or disable phasing in the output when using emit_mode GVCF. The default value is 1 (on) and this flag has no impact when using an emit_mode other than GVCF. Phasing is only calculated for diploid samples.
  • --ploidy PLOIDY: determines the ploidy number of the sample being processed. The default value is 2.
  • --prune_factor FACTOR: minimum pruning factor in local assembly; paths with fewer supporting kmers than FACTOR will be pruned from the graph. The default value is 2.
  • --trim_soft_clip: determines whether soft clipped bases in the reads should be excluded from the variant calling. This argument is only recommended to process RNA reads.
  • --given GIVEN_VCF: perform variant calling using only the variants provided in the GIVEN_VCF. The calling will only evaluate the locus and alleles provided in the file, and only if the variant has the FILTER column as PASS or .. This option cannot be used in conjunction with --emit_mode gvcf.
  • --bam_output OUTPUT_BAM: output a BAM file containing modified reads after the local reassembly done by the variant calling. This option should only be used in conjunction with a small bed file for troubleshooting purposes.
  • --genotype_model MODEL: determines which model to use for genotyping and QUAL calculation. MODEL can be coalescent to use the "so called exact model" based on the coalescent theory in population genetics or multinomial to use the simplified model that assumes variants are independent; the default is coalescent.

7.1.2.19. ReadWriter ALGORITHM

The ReadWriter algorithm outputs the result of applying the Base Quality Score Recalibration to a file.

The ReadWriter algorithm can also merge BAM files, and/or convert them into cram files.

The input to the ReadWriter algorithm is one or multiple BAM files and one or multiple recalibration tables; its output is the BAM file after recalibration. If the output file extension is CRAM, a CRAM file will be created. If multiple input files were used, the output file will be the result of merging all the files.

The ReadWriter algorithm accepts the following optional ALGO_OPTION:

  • --bam_compression COMPRESSION_LEVEL[0-9]: gzip compression level for the output BAM file. The default value is 6.

  • --cram_write_options compressor=[gzip|bzip2|lzma|rans]: CRAM output compression options. compressor=gzip is default if not defined.

  • --cram_write_options version=[2.1|3.0]: CRAM output version options. version=3.0 is default if not defined.

  • --output_mapq_filter min_map_qual=MIN_QUAL,max_map_qual=MAX_QUAL: filter reads based on their mapping quality; ReadWriter will only output reads that have a mapping quality equal or larger than MIN_QUAL and equal or smaller than MAX_QUAL.

  • --output_flag_filter MASK:VALUE[,VALUE,VALUE...]: filter reads based on the flags. MASK is the set of bits that should be considered, and VALUE is a comma separated list of FLAGS that need to be set for a read to be kept, or 0 if all bits in the MASK need to be unset. You can include multiple instances of this option and they will be applied sequentially in the same order they are in the command line. Some examples:

    • To only output reads that are PROPER_PAIR you would use --output_flag_filter PROPER_PAIR:PROPER_PAIR or --output_flag_filter 0x2:0x2.

    • To output all reads except unmapped reads with UNMAP flag set, you would use --output_flag_filter UNMAP:0 or --output_flag_filter 0x4:0

    • To write a BAM file that does not contain either duplicated reads or unmapped reads you would use two options that will be applied sequentially --output_flag_filter UNMAP:0 --output_flag_filter DUP:0 which will first only allow reads with the UNMAP flag unset, and then only allow reads that have the DUP flag unset.

    • To write a BAM file that does not contain reads where both mates are unmapped you would use --output_flag_filter UNMAP+MUNMAP:0,UNMAP,MUNMAP, which will allow reads as long as both UNMAP and MUNMAP are not set together.

    • More complex cases can be constructed with the help of a truth table, where the MASK is the sum of all bits you want to consider, and the VALUEs are the sum of those bits set for each of the different conditions; for instance, to find read pairs where the two non-secondary and non-supplementary reads have the same orientation (F1F2 or R1R2) hinting at a SV, you would create a table as follows:

      Flag # F1F2 R1R2
      PAIRED 0x001 x x
      REVERSE 0x010 0 x
      MREVERSE 0x020 0 x
      SECONDARY 0x100 0 0
      SUPPLEMENTARY 0x800 0 0
        0x931 0x001 0x031

      and the option would be --output_flag_filter 0x931:0x001,0x031 or, in more complex terms, --output_flag_filter PAIRED+SECONDARY+SUPPLEMENTARY+DUP+REVERSE+MREVERSE:PAIRED+REVERSE+MREVERSE,PAIRED+REVERSE+MREVERSE

    The table below is a reminder of the flag definition and their naming convention.

    Flag # Flag name Description Decimal
    0x1 PAIRED paired-end (or multiple-segment) sequencing technology 1
    0x2 PROPER_PAIR each segment properly aligned according to the aligner 2
    0x4 UNMAP segment unmapped 4
    0x8 MUNMAP next segment in the template unmapped 8
    0x10 REVERSE SEQ is reverse complemented 16
    0x20 MREVERSE SEQ of the next segment in the template is reversed 32
    0x40 READ1 the first segment in the template 64
    0x80 READ2 the last segment in the template 128
    0x100 SECONDARY secondary alignment 256
    0x200 QCFAIL not passing quality controls 512
    0x400 DUP PCR or optical duplicate 1024
    0x800 SUPPLEMENTARY supplementary alignment 2048

We recommend running the ReadWriter algorithm at the same command call as one of the variant calls, to reduce overhead.

7.1.2.20. GVCFtyper ALGORITHM

The GVCFtyper algorithm performs the joint variant calling of multiple samples, provided that each single sample has been previously processed using the Haplotyper algorithm with the option --emit_mode gvcf.

The GVCFtyper algorithm has no input in the driver level other than the reference file, its output is a VCF containing the joint called variants for all samples.

The GVCFtyper algorithm requires the following ALGO_OPTION:

  • -v INPUT: location of the GVCF file from the variant calling algorithm performed on a single sample using the extra option --emit_mode gvcf. You can include GVCF files from multiple samples by specifying multiple files and repeating the -v INPUT option. You can use VCF files compressed with bgzip and indexed.
  • Alternatively, you can input a list of GVCF files at the end of the command after the output file. Thus, the following 2 commands are interchangeable:
sentieon driver -r REFERENCE --algo GVCFtyper \
  -v s1_VARIANT_GVCF -v s2_VARIANT_GVCF -v s3_VARIANT_GVCF VARIANT_VCF
sentieon driver -r REFERENCE --algo GVCFtyper \
  VARIANT_VCF s1_VARIANT_GVCF s2_VARIANT_GVCF s3_VARIANT_GVCF
  • You can read in the GVCF file list from a text file using the following commands:
gvcf_argument=""
while read -r line; do
 gvcf_argument=$gvcf_argument" -v $line"
done < "list_of_gvcfs"
sentieon driver -r REFERENCE --algo GVCFtyper $gvcf_argument output-joint.vcf
  • You can also read in the GVCF file list from a text file using the following command leveraging the stdin pipe (-):
cat list_of_gvcfs | sentieon driver -r REFERENCE --algo GVCFtyper output-joint.vcf -
  • You could input all files from a specific folder using the following command:
sentieon driver -r REFERENCE --algo GVCFtyper output-joint.vcf sample*.g.vcf

The GVCFtyper algorithm accepts the following optional ALGO_OPTION:

  • --annotation 'ANNOTATION_LIST': determines additional annotations that will be added to the output VCF. Use a comma separated list to enable or disable annotations. Include 'none' to remove the default annotations; prefix annotations with the exclamation point (!) to disable the specific annotation. See Section 8.3 for supported annotations.
  • -d dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants. Only one file is supported.
  • --call_conf CONFIDENCE: determine the threshold of variant quality to call a variant. Variants with quality less than CONFIDENCE will be removed. The default value is 30.
  • --emit_conf CONFIDENCE: determine the threshold of variant quality to emit a variant. Variants with quality less than CONFIDENCE will be not be added to the output VCF file. The default value is 30.
  • --emit_mode MODE: determines what calls will be emitted. Possible values for mode are:
    • variant: emit calls only at confident variant sites. This is the default behavior.
    • confident: emit calls at confident variant sites or confident reference sites.
    • all: emit all calls, regardless of their confidence.
  • --max_alt_alleles NUMBER: Maximum number of alternate alleles. The default value is 100.
  • --genotype_model MODEL: determines which model to use for genotyping and QUAL calculation. MODEL can be coalescent to use the "so called exact model" based on the coalescent theory in population genetics or multinomial to use the simplified model that assumes variants are independent; the default is coalescent.

7.1.2.21. VarCal ALGORITHM

The VarCal algorithm calculates the Variant Quality Score Recalibration (VQSR). VQSR assigns a well-calibrated probability score to individual variant calls, to enable more accurate control in determining the most likely variants. For that, VQSR uses highly confident known sites to build a recalibration model and determine the probability that called sites are true. For more information about the algorithm, you can check http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr. For information on the recommended resources to use in VQSR, you can check https://www.broadinstitute.org/gatk/guide/article?id=1259.

The VarCal algorithm has no input in the driver level other than the reference file, its output is a recalibration file containing additional annotations related to the VQSR.

The VarCal algorithm requires the following ALGO_OPTION:

  • -v INPUT: location of the VCF file from the variant calling algorithm; you can use a VCF file compressed with bgzip and indexed.

  • --tranches_file TRANCHES_FILE: location and filename of the file containing the partition of the call sets into quality tranches.

  • --annotation ANNOTATION: determine annotation that will be used during the recalibration. You can include multiple annotations in the optimization by repeating the --annotation ANNOTATION option. You can use all annotations present in the original variant call file.

  • --resource RESOURCE_FILE --resource_param PARAM: location of the VCF file used as a training/truth resource in VQSR, followed by parameters determining how the file will be used. You can include multiple collections by specifying multiple files and repeating the --resource RESOURCE_FILE --resource_param PARAM option. The PARAM argument follows the syntax:

    LABEL,known=IS_KNOWN,training=IS_TRAIN,truth=IS_TRUTH,prior=PRIOR
    
    • LABEL is a descriptive name for the resource.
    • IS_KNOWN can be true or false, and determines whether the sites contained in the resource will be used to stratify output metrics.
    • IS_TRAIN can be true or false, and determines whether the sites contained in the resource will be used for training the recalibration model.
    • IS_TRUTH can be true or false, and determines whether the sites contained in the resource will be considered true sites.
    • PRIOR is a value that reflects your confidence in how reliable the resource is as a truth set.

The VarCal algorithm accepts the following optional ALGO_OPTION:

  • --srand RANDOM_SEED: determines the seed to use in the random number generation. You can set RANDOM_SEED to 0 and the software will use the random seed from your computer. In order to generate a deterministic result, you should use a non-zero RANDOM_SEED and set the NUMBER_THREADS to 1.
  • --var_type VARIANT_TYPE: determine which variant types will be recalibrated; possible values for VARIANT_TYPE are:
    • SNP to recalibrate only Single Nucleotide Polymorphism. This is the default behavior.
    • INDEL to recalibrate only insertion-deletions.
    • (do not use) BOTH to recalibrate both SNPs and INDELs. This setting SHOULD NOT be used, as VQSR should be performed independently for SNPs and INDELs.
  • --tranche TRANCH_THRESHOLD: normalized quality threshold for each tranche; the TRANCH_THRESHOLD number is a number between 0 and 100. Multiple instances of the option are allowed that will create as many tranches as there are thresholds. The default values are 90, 99, 99.9 and 100.
  • --max_gaussians MAX_GAUSS: determines the maximum number of Gaussians that will be used for the positive recalibration model. The default value is 8 for SNP and 4 for INDEL.
  • --max_neg_gaussians MAX_GAUSS: determines the maximum number of Gaussians that will be used for the negative recalibration model. The default value is 2.
  • --max_iter MAX_ITERATIONS: determines the maximum number of iterations for the Expectation Maximization (EM) optimization. The default value is 150.
  • --max_mq MAPQ: indicates the maximum MQ in your data, which will be used to perform a logit jitter transform of the MQ to make the distribution closer to a Gaussian.
  • --aggregate_data AGREGATE_VCF: location of an additional VCF file containing variants called from other similar samples; these additional data will increase the effective sample size for the statistical model calibration. Multiple instances of the option are allowed.
  • --plot_file PLOT_FILE: location of the temporary file containing the necessary data to generate the reports from the VarCal algorithm.

7.1.2.22. ApplyVarCal ALGORITHM

The ApplyVarCal algorithm combines the output information from the VQSR with the original variant information.

The ApplyVarCal algorithm has no input in the driver level other than the reference file; its output is a copy of the original VCF containing additional annotations from the VQSR.

The ApplyVarCal algorithm requires the following ALGO_OPTION:

  • -v INPUT: location of the VCF file from the variant calling algorithm. It should be the same as the one used in the VarCal algorithm; you can use a VCF file compressed with bgzip and indexed.
  • --recal VARIANT_RECAL_DATA: location of the VCF file output from the VarCal algorithm.
  • --tranches_file TRANCHES_FILE: location of the tranches file output from the VarCal algorithm.
  • --var_type VARIANT_TYPE: determine which variant types will be recalibrated. This option should be consistent with the one used in the VarCal algorithm.

Alternatively, you can use the option --vqsr_model to input a comma-separated list of the required information for multiple VQSR models; this option allows you to apply both a SNP and INDEL mode in a single command line. The syntax of the option is:

--vqsr_model var_type=VARIANT_TYPE,\
              recal=VARIANT_RECAL_DATA,\
              tranches_file=TRANCHES_FILE,\
              sensitivity=SENSITIVITY

The ApplyVarCal algorithm accepts the following optional ALGO_OPTION:

  • --sensitivity SENSITIVITY: determine the sensitivity to the available truth sites; only tranches with threshold larger than the sensitivity will be included in the recalibration. We recommend you use a sensitivity number that is included in the tranche threshold list of VarCal algorithm; this will reduce rounding issues. The default value is NULL, so that no tranches filtering is applied, and only the LOW_VQSLOD filter is applied.

7.1.2.23. TNsnv ALGORITHM

The TNsnv algorithm performs the somatic variant calling on the tumor-normal matched pair or the tumor and panel of normal data, using a Genotyper algorithm.

The input to the TNsnv algorithm is a BAM file; its output is a VCF file.

The TNsnv algorithm requires the following ALGO_OPTION:

  • --tumor_sample SAMPLE_NAME: name of the SM tag in the BAM file for the tumor sample.

Depending on the mode it is run, the TNsnv algorithm may require the following ALGO_OPTION:

  • --normal_sample SAMPLE_NAME: name of the SM tag in the BAM file for the normal sample.
  • --detect_pon: indicates that you are using the TNsnv algorithm to create a VCF file that will be part of a panel of normal.
  • --cosmic COSMIC_VCF: location of the Catalogue of Somatic Mutations in Cancer (COSMIC) VCF file used to create the panel of normal file. Only one file is supported.
  • --pon PANEL_OF_NORMAL_VCF: location of the file containing the variants detected in the Panel of Normal analysis that will be used to remove false positives. Only one file is supported.

The TNsnv algorithm accepts the following optional ALGO_OPTION:

  • --dbsnp dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP). The variants in the dbSNP will be more likely to be marked as germline as they require more evidence of absence in the normal. Only one file is supported.
  • --call_stats_out CALL_STATS_FILE: location and filename of the file containing the call stats information from the somatic variant calling.
  • --stdcov_out COVERAGE_FILE: location and filename of the wiggle file containing the standard coverage.
  • --tumor_depth_out TUMOR_DEPTH_FILE: location and filename of the wiggle file containing the depth of the tumor sample reads.
  • --normal_depth_out NORMAL_DEPTH_FILE: location and filename of the wiggle file containing the depth of the normal sample reads.
  • --power_out POWER_FILE: location and filename of the power file.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 5.
  • --min_init_tumor_lod NUMBER: minimum tumor log odds in the initial pass calling variants. The default value is 4.
  • --min_tumor_lod NUMBER: minimum tumor log odds in the final call of variants. The default value is 6.3.
  • --min_normal_lod NUMBER: minimum normal log odds used to check that the tumor variant is not a normal variant. The default value is 2.2.
  • --contamination_frac NUMBER: estimation of the contamination fraction from other samples. The default value is 0.02.
  • --min_cell_mutation_frac NUMBER: minimum fraction of cells which have mutation. The default value is 0.
  • --min_strand_bias_lod NUMBER: minimum log odds for calling strand bias. The default value is 2.
  • --min_strand_bias_power NUMBER: minimum power for calling strand bias. The default value is 0.9.
  • --min_dbsnp_normal_lod NUMBER: minimum log odds for calling normal non-variant at dbsnp sites. The default value is 5.5.
  • --min_normal_allele_frac NUMBER: minimum allele fraction to be considered in normal; this parameter is useful when the normal sample is contaminated with the tumor sample. The default value is 0.
  • --min_tumor_allele_frac NUMBER: minimum allelic fraction in tumor sample. The default value is 0.005.
  • --max_indel NUMBER: maximum nearby indel events that are allowed. The default value is 3.
  • --max_read_clip_frac NUMBER: maximum fraction of soft/hard clipped bases in a read. The default value is 0.3.
  • --max_mapq0_frac NUMBER: maximum ratio of reads whose mapq are 0 used to determine poor mapped area. The default value is 0.5.
  • --min_pir_median NUMBER: minimum read position median. The default value is 10.
  • --min_pir_mad NUMBER: minimum read position median absolute deviation. The default value is 3.
  • --max_alt_mapq NUMBER: maximum value of alt allele mapping quality score. The default value is 20.
  • --max_normal_alt_cnt NUMBER: maximum alt alleles count in normal pileup. The default value is 2.
  • --max_normal_alt_qsum NUMBER: maximum quality score sum of alt allele in normal pileup. The default value is 20.
  • --max_normal_alt_frac NUMBER: maximum fraction of alt allele in normal pileup. The default value is 0.03.
  • --power_allele_frac NUMBER: allele fraction used in power calculations. The default value is 0.3.

7.1.2.24. TNhaplotyper ALGORITHM

The TNhaplotyper algorithm performs the somatic variant calling on the tumor-normal matched pair or the tumor and panel of normal data, using a Haplotyper algorithm.

The input to the TNhaplotyper algorithm is a BAM file; its output is a VCF file.

The TNhaplotyper algorithm requires the following ALGO_OPTION:

  • --tumor_sample SAMPLE_NAME: name of the SM tag in the BAM file for the tumor sample.

Depending on the mode it is run, the TNhaplotyper algorithm may require the following ALGO_OPTION:

  • --normal_sample SAMPLE_NAME: name of the SM tag in the BAM file for the normal sample.
  • --detect_pon: indicates that you are using the TNhaplotyper algorithm to create a VCF file that will be part of a panel of normal.
  • --cosmic COSMIC_VCF: location of the Catalogue of Somatic Mutations in Cancer (COSMIC) VCF file used to create the panel of normal file. Only one file is supported.
  • --pon PANEL_OF_NORMAL_VCF: location of the file containing the variants detected in the Panel of Normal analysis that will be used to remove false positives. Only one file is supported.

The TNhaplotyper algorithm accepts the following optional ALGO_OPTION:

  • --dbsnp dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP). The variants in the dbSNP will be more likely to be marked as germline as they require more evidence of absence in the normal. Only one file is supported.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 10.
  • --prune_factor FACTOR: minimum pruning factor in local assembly; paths with fewer supporting kmers than FACTOR will be pruned from the graph. The default value is 2.
  • --pcr_indel_model MODEL: PCR indel model used to weed out false positive indels more or less aggressively. The possible modes are`: NONE (used for PCR free samples), and HOSTILE, AGGRESSIVE and CONSERVATIVE, in order of decreasing aggressiveness. The default value is HOSTILE.
  • --phasing [1/0]: flag to enable or disable phasing in the output.
  • --min_init_tumor_lod NUMBER: minimum tumor log odds in the initial pass calling variants. The default value is 4.
  • --min_init_normal_lod NUMBER: minimum tumor log odds in the initial pass calling variants. The default value is 0.5.
  • --min_tumor_lod NUMBER: minimum tumor log odds in the final call of variants. The default value is 6.3.
  • --min_normal_lod NUMBER: minimum normal log odds used to check that the tumor variant is not a normal variant. The default value is 2.2.
  • --min_strand_bias_lod NUMBER: minimum log odds for calling strand bias. The default value is 2.
  • --min_strand_bias_power NUMBER: minimum power for calling strand bias. The default value is 0.9.
  • --min_pir_median NUMBER: minimum read position median. The default value is 10.
  • --min_pir_mad NUMBER: minimum read position median absolute deviation. The default value is 3.
  • --max_normal_alt_cnt NUMBER: maximum alt alleles count in normal pileup. The default value is 2.
  • --max_normal_alt_qsum NUMBER: maximum quality score sum of alt allele in normal pileup. The default value is 20.
  • --max_normal_alt_frac NUMBER: maximum fraction of alt allele in normal pileup. The default value is 0.03.
  • --tumor_contamination_frac NUMBER: estimation of the contamination fraction on the tumor sample from other samples. The default value is 0.
  • --normal_contamination_frac NUMBER: estimation of the contamination fraction on the normal sample from other samples. The default value is 0.
  • --filter_clustered_read_position: filters variants that are clustered at the start or end of sequencing reads
  • --filter_strand_bias: filters variants that show evidence of strand bias
  • --bam_output OUTPUT_BAM: output a BAM file containing modified reads after the local reassembly done by the variant calling. This option should only be used in conjunction with a small bed file for troubleshooting purposes.
  • --trim_soft_clip: determines whether soft clipped bases in the reads should be excluded from the variant calling.

7.1.2.25. TNhaplotyper2 ALGORITHM

The TNhaplotyper2 algorithm performs the somatic variant calling on the tumor-normal matched pair or the tumor and panel of normal data, using a Haplotype based algorithm.

The input to the TNhaplotyper2 algorithm is one of multiple BAM files; its output is a VCF file that will be used when filtering the results in TNfilter; TNhaplotyper2 will output an additional file with the same output file name and .stats extension that contains statistics that can help the filtering.

The TNhaplotyper2 algorithm requires the following ALGO_OPTION:

  • --tumor_sample SAMPLE_NAME: name of the SM tag in the BAM file for the tumor sample.

Depending on the mode it is run, the TNhaplotyper2 algorithm may require the following ALGO_OPTION:

  • --normal_sample SAMPLE_NAME: name of the SM tag in the BAM file for the normal sample.
  • --pon PANEL_OF_NORMAL_VCF: location of the file containing the variants detected in the Panel of Normal analysis that will be used to remove false positives. Only one file is supported.

The TNhaplotyper2 algorithm accepts the following optional ALGO_OPTION:

  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 10.
  • --prune_factor FACTOR: minimum pruning factor in local assembly; paths with fewer supporting kmers than FACTOR will be pruned from the graph. Setting the prune factor to 0 will turn on adaptive pruning. The default value is 0.
  • --pcr_indel_model MODEL: PCR indel model used to weed out false positive indels more or less aggressively. The possible modes are`: NONE (used for PCR free samples), and HOSTILE, AGGRESSIVE and CONSERVATIVE, in order of decreasing aggressiveness. The default value is CONSERVATIVE.
  • --min_init_tumor_lod NUMBER: minimum tumor log odds in the initial pass calling variants. The default value is 2.0.
  • --min_tumor_lod NUMBER: minimum tumor log odds in the final call of variants. The default value is 3.0.
  • --min_normal_lod NUMBER: minimum normal log odds used to check that the tumor variant is not a normal variant. The default value is 2.2.
  • --germline_vcf VCF: location of the VCF containing the population allele frequency.
  • --default_af AF: determines the allele frequency value for alleles not found in the germline vcf. The default value is 1E-6 when running tumor-normal mode, and 5E-8 when running without a matched normal in tumor-only mode.
  • --max_germline_af AF: determines the maximum germline allele frequency in tumor-only mode. The default value is 0.01.
  • --call_pon_sites: determines whether to call candidate variants even if they are present in the Panel of Normal input.
  • --callable_depth DEPTH: determines the minimum depth for a site to be considered callable for the additional .stats file containing statistics that can help the filtering.
  • --given GIVEN_VCF: perform variant calling using the variants provided in the GIVEN_VCF. In addition to looking for variants in discovery mode, the calling will evaluate the locus and alleles provided in the file, but only if the variant has the FILTER column as PASS or ..
  • --bam_output OUTPUT_BAM: output a BAM file containing modified reads after the local reassembly done by the variant calling. This option should only be used in conjunction with a small bed file for troubleshooting purposes.
  • --trim_soft_clip: determines whether soft clipped bases in the reads should be excluded from the variant calling.
  • --call_germline_sites: determines whether to call candidate variants if they are present in the germline sites.

7.1.2.26. TNfilter ALGORITHM

The TNfilter algorithm performs filtering on the output of TNhaplotyper2.

The input to the TNfilter algorithm is a VCF containing candidate variants to be filtered; its output is the VCF containing the filtered variants; TNfilter will output an additional file with the same output file name and .stats extension that contains statistics about the filtering.

The TNfilter algorithm requires the following ALGO_OPTION:

  • -v CANDIDATE_VCF: the location and file name of the file containing the unfiltered variants produced by TNhaplotyper2. You can use VCF files compressed with bgzip and indexed. In addition to this file, the binary requires the file containing statictics named CANDIDATE_VCF.stats created by TNhaplotyper2.
  • --tumor_sample SAMPLE_NAME: name of the SM tag in the BAM file for the tumor sample.

Depending on the mode it is run, the TNfilter algorithm may require the following ALGO_OPTION:

  • --normal_sample SAMPLE_NAME: name of the SM tag in the BAM file for the normal sample.

The TNfilter algorithm accepts the following optional ALGO_OPTION:

  • --orientation_priors PRIORS: the location and file name of the file containing the orientation bias information produced by OrientationBias. This option affects the Orientation Bias Filter.
  • --contamination FILE: the location and file name of the file containing the contamination information produced by ContaminationModel. This option affects the Contamination Filter.
  • --tumor_segments SEGMENTS: the location and file name of the file containing the tumor segments information produced by ContaminationModel. This option affects the Germline Risk Artifact Filter.
  • --threshold_strategy STRATEGY: determines the strategy that should be applied to optimize the posterior probability threshold. The possible values are: f_score to prioritize F-Score, precision to prioritize reducing false positives, and constant. The default value is f_score.
  • --f_score_beta SCORE: when using --threshold_strategy f_score, determines the relative weigth of recall to precision that will be the goal of the filtering. The default value is 1 for equal weight of precision and recall.
  • --max_fp_rate FP_RATE: when using --threshold_strategy precision, determines the maximum expected rate of false positive calls. The default value is 0.05.
  • --threshold THRESHOLD: when using --threshold_strategy constant, determines the posterior probability threshold.
  • --min_median_base_qual QUALITY: determines the minimum median base quality of alt reads. This option affects the Median Base Quality Filter. The default value is 20.
  • --max_event_count COUNT: determines the maximum number of events allowed in the active region. If there are more than COUNT events in the active region, the variants will be filtered with the Clustered Events Filter. The default value is 2.
  • --unique_alt_reads COUNT: determines the minimum number of unique ALT reads that need to support the variant. This option affects the Duplicate Read Filter. The default value is 0.
  • --max_mfrl_diff VALUE: determines the maximum difference between the median ALT and REF fragments. This option affects the Median Fragment Length Difference Filter. The default value is 10000.
  • --max_haplotype_distance DISTANCE: determines the maximum distance of two phased variants within the same haplotype. If a variant is filtered, all phased variants within the DISTANCE will also be filtered out, but variants beyond the DISTANCE will not be affected. This option affects the Haplotype Filter. The default value is 100.
  • --min_tumor_af AF: determines the minimum tumor allele fraction. This option affects the Low Allele Fection Filter. The default value is 0.
  • --min_median_map_qual QUALITY: determines the minimum median mapping quality of the reads supporting the variant. This argument affects the Median Mapping Quality Filter. The default value is 30.
  • --long_indel_length LENGTH: determines whether INDELs will use the reference mapping quality; INDELs longer than LENGTH will use reference mapping quality. This argument affects the Median Mapping Quality Filter. The default value is 5.
  • --max_alt_count COUNT: determines the maximum number of ALT alleles at a site. This option affects the Multi-Allelic Filter. The default value is 1.
  • --max_n_ratio RATIO: determines the maximum ratio of N-bases to ALT bases. This option affects the N-base Ratio Filter. The default value is 1.
  • --normal_p_value VALUE: determines the P-value threshold for the detection of normal artifacs. This option affects the Normal Artifact Filter. The default is 0.001.
  • --min_median_pos DISTANCE: determines the minimum median distance from the variant to the end of the reads. This option affects the Median Read Position Filter. The default value is 1.
  • --min_slippage_length LENGTH: determines the minimum length of REF bases in an STR for it to be likely to have polymerase slippage. This option affects the Polymerase Slippage Filter. The default value is 8.
  • --slippage_rate RATE: determines the frequency of polymerase slippage in areas likely to have polymerase slippage. This option affects the Polymerase Slippage Filter. The default value is 0.1.
  • --min_alt_reads_per_strand COUNT: determines the minimum number of ALT reads required on each strand. This option affects the Hard Strand Bias Filter. The default value is 0.

In addition to the filters shown above, TNfilter may apply the following filters:

  • Model-based strad bias artifact Filter, for sites likely to be an artifact of strand bias.
  • Panel of Normal Filter, for sites present in the Panel of Normals.
  • Tumor Evidence Filter, for sites with weak evidence of a variant.

7.1.2.27. OrientationBias ALGORITHM

The OrientationBias algorithm estimates any possible orientation bias present in the sequencing data.

The input to the OrientationBias algorithm is one or multiple BAM files; its output is a file containing the orientation bias information that will be used when filtering the results of TNhaplotyper2 in TNfilter.

The OrientationBias algorithm requires the following ALGO_OPTION:

  • --tumor_sample SAMPLE_NAME: name of the SM tag in the BAM file for the tumor sample.

The OrientationBias algorithm accepts the following optional ALGO_OPTION:

  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 20.
  • --min_median_map_qual QUALITY: determines the minimum median mapping quality. Sites where the median mapping quality of the reads is below QUALITY will be ignored. The default value is 50.
  • --max_depth DEPTH: determines the depth threshold for grouping. Sites with depth higher than DEPTH will be grouped together. The default value is 200.

7.1.2.28. ContaminationModel ALGORITHM

The ContaminationModel algorithm estimates the cross-sample contamination and tumor segmentation to be used in the TNseq pipeline.

The input to the ContaminationModel algorithm is one or multiple BAM files; its output is a file containing the contamination information that can be used when filtering the results of TNhaplotyper2 in TNfilter. In addition the tool may output a .segments file containing the tumor segments information that can help the filtering.

The ContaminationModel algorithm requires the following ALGO_OPTION:

  • --tumor_sample SAMPLE_NAME: name of the SM tag in the BAM file for the tumor sample.
  • --vcf VCF: location of the VCF containing the population allele frequency.

Depending on the mode it is run, the ContaminationModel algorithm may require the following ALGO_OPTION:

  • --normal_sample SAMPLE_NAME: name of the SM tag in the BAM file for the normal sample.

The ContaminationModel algorithm accepts the following optional ALGO_OPTION:

  • --tumor_segments CONTAMINATION.segments: the location and file name of the optional output file containing the tumor segments information; this information will be used in TNfilter to filter calls that are likely Germline Risk Artifacts.
  • --min_map_qual QUALITY: determines the filtering quality of the reads used. Any reads with mapping quality less than QUALITY will be filtered out. The default value is 50.
  • --min_af AF: determines the minimum value of the population allele frequency. The default value is 0.01.
  • --max_af AF: determines the maximum value of the population allele frequency. The default value is 0.2.

7.1.2.29. TNscope ALGORITHM

The TNscope algorithm performs the somatic variant calling on the tumor-normal matched pair or the tumor only data, using a Haplotyper algorithm.

The input to the TNscope algorithm is a BAM file; its output is a VCF file.

The TNscope algorithm requires the following ALGO_OPTION:

  • --tumor_sample SAMPLE_NAME: name of the SM tag in the BAM file for the tumor sample.

The TNscope algorithm accepts the following optional ALGO_OPTION:

  • --normal_sample SAMPLE_NAME: name of the SM tag in the BAM file for the normal sample. When doing tumor-only somatic calling, this argument is not required.
  • --cosmic COSMIC_VCF: location of the Catalogue of Somatic Mutations in Cancer (COSMIC) VCF file used to create the panel of normal file. Only one file is supported.
  • --pon PANEL_OF_NORMAL_VCF: location of the file containing the variants detected in the Panel of Normal analysis that will be used to remove false positives. Only one file is supported. This file is the same as the one used with TNhaplotyper.
  • --dbsnp dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP). The variants in the dbSNP will be more likely to be marked as germline as they require more evidence of absence in the normal. Only one file is supported.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 15.
  • --prune_factor FACTOR: minimum pruning factor in local assembly; paths with fewer supporting kmers than FACTOR will be pruned from the graph. Setting the prune factor to 0 will turn on adaptive pruning. The default value is 2.
  • --pcr_indel_model MODEL: PCR indel model used to weed out false positive indels more or less aggressively. The possible modes are`: NONE (used for PCR free samples), and HOSTILE, AGGRESSIVE and CONSERVATIVE, in order of decreasing aggressiveness. The default value is CONSERVATIVE.
  • --phasing [1/0]: flag to enable or disable phasing in the output. The default value is 1 (on).
  • --min_init_tumor_lod NUMBER: minimum tumor log odds in the initial pass calling variants. The default value is 4.
  • --min_init_normal_lod NUMBER: minimum tumor log odds in the initial pass calling variants. The default value is 0.5.
  • --min_tumor_lod NUMBER: minimum tumor log odds in the final call of variants. The default value is 6.3.
  • --min_normal_lod NUMBER: minimum normal log odds used to check that the tumor variant is not a normal variant. The default value is 2.2.
  • --min_dbsnp_normal_lod NUMBER: minimum log odds for calling normal non-variant at dbsnp sites. The default value is 5.5.
  • --tumor_contamination_frac NUMBER: estimation of the contamination fraction on the tumor sample from other samples. The default value is 0.
  • --normal_contamination_frac NUMBER: estimation of the contamination fraction on the normal sample from other samples. The default value is 0.
  • --given GIVEN_VCF: perform variant calling using only the variants provided in the GIVEN_VCF. The calling will only evaluate the locus and alleles provided in the file, and only if the variant has the FILTER column as PASS or ..
  • --bam_output OUTPUT_BAM: output a BAM file containing modified reads after the local reassembly done by the variant calling. This option should only be used in conjunction with a small bed file for troubleshooting purposes.
  • --disable_detector DETECTOR: disable the variant calling for specific detectors: use 'sv' as DETECTOR to prevent calling of structural variants, and use 'snv_indel' as DETECTOR to prevent calling of small variants.
  • --trim_soft_clip: determines whether soft clipped bases in the reads should be excluded from the variant calling.
  • --trim_primer AMPLICON_TABLE: determines whether to trim primers based on the information provided in the AMPLICON_TABLE. The AMPLICON_TABLE is a BED file containins the location of the primers in the amplicon sequencing as a tab-delimited file with eigth columns: contig, amplicon_start, amplicon_end, name (ignored), score (ignored), strand (ignored), insert_start, insert_end; the tool will trim bases from amplicon reads between amplicon_start and insert_start, and between amplicon_end and insert_end.
  • --trim_adaptor=[0|1]: determines whether to trim off adaptor bases from the input reads. The default is to trim the adapter bases. Set --trim_adaptor=0 to turn it off.

7.1.2.30. RNASplitReadsAtJunction ALGORITHM

The RNASplitReadsAtJunction algorithm performs the splitting of reads into exon segments by getting rid of Ns but maintaining grouping information, and hard-clipping any sequences overhanging into the intron regions.

The input to the RNASplitReadsAtJunction algorithm is a BAM file; its output is a BAM file.

The RNASplitReadsAtJunction algorithm requires the following ALGO_OPTION:

  • --reassign_mapq IN_QUAL:OUT_QUAL: the algorithm will reassign mapping qualities from IN_QUAL to OUT_QUAL. This argument is required because STAR assigns a quality of 255 to good alignments instead of the expected default score of 60.

The RNASplitReadsAtJunction algorithm accepts the following optional ALGO_OPTION:

  • --ignore_overhang: determines whether to ignore and not fix the overhanging sections of the reads.
  • --overhang_max_bases NUMBER: determines the maximum number of bases allowed in a hard-clipped overhang, so that if there are more bases in the overhang, the overhang will not be hard-clipped. The default value is 40.
  • --overhang_max_mismatches NUMBER: determines the maximum number of mismatches allowed in a non-hard-clipped overhang, so that the complete overhang will be hard-clipped if the number of mismatches is too high. The default value is 1.
  • --cram_write_options compressor=[gzip|bzip2|lzma|rans]: CRAM output compression options. compressor=gzip is default if not defined.
  • --cram_write_options version=[2.1|3.0]: CRAM output version options. version=3.0 is default if not defined.

7.1.2.31. ContaminationAssessment ALGORITHM

The ContaminationAssessment algorithm assesses the contamination present in a sample BAM file; the output of this algorithm can be used as the value of argument contamination_frac, normal_contamination_frac and tumor_contamination_frac in the TNseq and TNscope tools.

The input to the ContaminationAssessment algorithm is a BAM file; its output is a text file.

The ContaminationAssessment algorithm requires the following ALGO_OPTION:

  • --pop_vcf VCF_FILE: the location of the VCF file containing the allele frequency information for the specific population of the sample.
  • --genotype_vcf VCF_FILE: the location of the VCF file containing the DNAseq variants reported for the individual; to calculate the contamination in the tumor sample, you should use the DNAseq variants reported for the normal sample. You can create this file by using Haplotyper or Genotyper on the sample bam.

The ContaminationAssessment algorithm accepts the following optional ALGO_OPTION:

  • --type ASSESS_TYPE: determines the type for the estimate. The possible values are SAMPLE, READGROUP and META to assess the contamination by sample, by lane, or in aggregate across all the reads. Multiple instances of the option are allowed that will assess the contamination at multiple levels. The default value is META.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in the contamination assessment. Any base with quality less than QUALITY will be ignored. The default value is 20.
  • --min_map_qual QUALITY: determines the filtering quality of the reads used in the contamination assessment. Any read with quality less than QUALITY will be ignored. The default value is 20.
  • --min_basecount NUMBER: determines the minimum number of bases that need to be present at a locus before the contamination is assessed. The default value is 500.
  • --trim_thresh NUMBER: threshold that will be used to trim sites; if the probability of the contamination ratio being larger than 0.5 is larger than the threshold, the site will not be included in the contamination assessment. The default value is 0.95.
  • --trim_frac NUMBER: determines the maximum fraction of sites that may be trimmed based on the trim threshold. The default value is 0.01.
  • --precision NUMBER: determines the PRECISION on the output percent number. The default value is 0.1.
  • --base_report FILE: location and filename of the output file that will contain an extended report about the processed data.
  • --population POPULATION_NAME: a population to use to determine the baseline allele frequency of the sample. The default value is CEU.

7.1.2.32. TNModelApply ALGORITHM

The TNModelApply algorithm applies a Machine Learning model on the results of TNscope to help with variant filtration. This algorithm is only supported in the Linux version of the Sentieon Genomics software.

The TNModelApply algorithm has no input in the driver level, its output is a VCF file.

The TNModelApply algorithm requires the following ALGO_OPTION:

  • -v INPUT: location of the VCF file from the TNscope variant calling; you can use a VCF file compressed with bgzip and indexed.
  • -m MODEL_FILE: location of the file containing the Machine Learning model.

The TNModelApply algorithm modifies the input VCF file by adding the MLrejected FILTER to the variants; since the FILTER is added, you may want to remove any FILTERs already present in the input VCF, as they may no longer be relevant. You can use bcftools(https://samtools.github.io/bcftools/bcftools.html) for that purpose:

$BCF/bcftools annotate -x "^FILTER/MLrejected,FILTER/PASS" -O z \
  -o $OUTPUT.vcf.gz $INPUT.vcf.gz

7.1.2.33. DNAscope ALGORITHM

The DNAscope algorithm performs an improved version of Haplotype variant calling.

The input to the DNAscope algorithm is a BAM file; its output is a VCF file.

The DNAscope algorithm accepts the following optional ALGO_OPTION:

  • --annotation 'ANNOTATION_LIST': determines additional annotations that will be added to the output VCF. Use a comma separated list to enable or disable annotations. Include 'none' to remove the default annotations; prefix annotations with the exclamation point (!) to disable the specific annotation. See Section 8.3 for supported annotations.
  • -d dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants. Only one file is supported.
  • --var_type VARIANT_TYPE: determine which variant types will be called; VARIANT_TYPE is a comma separated list of the following possible values:
    • SNP to call Single Nucleotide Polymorphism. This is included in the default behavior.
    • INDEL to call insertion-deletions. This is included in the default behavior.
    • BND to call break-end information required for the structural variant caller.
  • --call_conf CONFIDENCE: determine the threshold of variant quality to call a variant. Variants with quality less than CONFIDENCE will be removed.
  • --emit_conf CONFIDENCE: determine the threshold of variant quality to emit a variant. Variants with quality less than CONFIDENCE will be not be added to the output VCF file.
  • --emit_mode MODE: determines what calls will be emitted. Possible values for mode are:
    • variant: emit calls only at confident variant sites. This is the default behavior.
    • confident: emit calls at confident variant sites or confident reference sites.
    • all: emit all calls, regardless of their confidence.
    • gvcf: emits additional information required for joint calling. This option is required if you want to perform joint calling using the GVCFtyper algorithm.
  • --gq_bands LIST_OF_BANDS: determines the bands that will be used to compress variants of similar genotype quality (GQ) that will be emitted as a single VCF record in the GVCF output file. The LIST_OF_BANDS is a comma-separated list of bands where each band is defined by START-END/STEP. The default value is 1-60,60-99/10,99.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 10.
  • --pcr_indel_model MODEL: PCR indel model used to weed out false positive indels more or less aggressively. The possible MODELs are: NONE (used for PCR free samples), and HOSTILE, AGGRESSIVE and CONSERVATIVE, in order of decreasing aggressiveness. The default value is CONSERVATIVE.
  • --phasing [1/0]: flag to enable or disable phasing in the output. The default value is 1 (on). Phasing is only calculated for diploid samples.
  • --ploidy PLOIDY: determines the ploidy number of the sample being processed. The default value is 2.
  • --prune_factor FACTOR: minimum pruning factor in local assembly; paths with fewer supporting kmers than FACTOR will be pruned from the graph. The default value is 2.
  • --trim_soft_clip: determines whether soft clipped bases in the reads should be excluded from the variant calling. This argument is only recommended to process RNA reads.
  • --given GIVEN_VCF: perform variant calling using only the variants provided in the GIVEN_VCF. The calling will only evaluate the locus and alleles provided in the file, and only if the variant has the FILTER column as PASS or .. This option cannot be used in conjunction with --emit_mode gvcf.
  • --bam_output OUTPUT_BAM: output a BAM file containing modified reads after the local reassembly done by the variant calling. This option should only be used in conjunction with a small bed file for troubleshooting purposes.
  • --filter_chimeric_reads: determines whether chimeric reads will be used when calling variants. The default is to include chimeric reads only if the var_type BND is set.
  • --trim_adaptor: determines whether to trim off adaptor bases from the input reads. The default is to NOT trim the adapter bases as that could affect the region around SVs.
  • --model MODEL_FILE: the location of the machine learning model file that will be used with the DNAModelApply tool; the model will be used to determine the settings used in variant calling.

7.1.2.34. DNAModelApply ALGORITHM

The DNAModelApply algorithm performs the second step of variant calling using DNAscope. This algorithm is only supported in the Linux version of the Sentieon Genomics software.

The DNAModelApply algorithm has no input in the driver level, its output is a VCF containing the variants.

The DNAModelApply algorithm requires the following ALGO_OPTION:

  • -v INPUT: location of the VCF file from the DNAscope variant calling algorithm performed with an input model file determining the correct settings. You can use VCF files compressed with bgzip and indexed.
  • -m MODEL_FILE: the location of the machine learning model file; this file should be the same as the one used in the DNAscope command to generate the input VCF.

The DNAModelApply algorithm does not support any optional ALGO_OPTION.

The DNAModelApply algorithm modifies the input VCF file by adding the MLrejected FILTER to the variants; since the FILTER is added, you may want to remove any FILTERs already present in the input VCF, as they may no longer be relevant. You can use bcftools(https://samtools.github.io/bcftools/bcftools.html) for that purpose:

$BCF/bcftools annotate -x "^FILTER/MLrejected,FILTER/PASS" -O z \
  -o $OUTPUT.vcf.gz $INPUT.vcf.gz

7.1.2.35. SVSolver ALGORITHM

The SVSolver algorithm performs the structural variant calling of a sample, provided that the sample has been previously processed using the DNAscope algorithm with the option --var_type bnd.

The SVSolver algorithm has no input in the driver level, its output is a VCF containing the called structural variants.

The SVSolver algorithm requires the following ALGO_OPTION:

  • -v INPUT: location of the VCF file from the DNAscope variant calling algorithm performed on a sample using the option --var_type bnd. You can use VCF files compressed with bgzip and indexed.

The SVSolver algorithm does not support any optional ALGO_OPTION.

7.1.2.36. VariantPhaser ALGORITHM

The VariantPhaser algorithm performs read-based phasing of variants from a VCF by using read information from a BAM file containing long reads.

The input to the VariantPhaser algorithm is a BAM file and a VCF file; its output is the VCF file after phasing.

The VariantPhaser algorithm requires the following ALGO_OPTION:

  • -v INPUT: location of the VCF file containing the variants to be phased. You can use VCF files compressed with bgzip and indexed.

7.1.3. DRIVER read_filter options

The --read_filter argument of the DRIVER binary allows to filter or transform reads from the input BAM file before performing the calculation. It is possible to use multiple --read_filter arguments in the same command line, in which case they will be executed sequentially , so the order is important.

The syntax for the argument is: --read_filter FILTER,OPTION=VALUE,OPTION=VALUE,…

7.1.3.1. QualCalFilter read_filter

The QualCalFilter read filter is used to transform reads and perform base quality score recalibration while modifying the information contained in the recalibration table.

The QualCalFilter read filter requires one of the following OPTION:

  • table=TABLE_FILEPATH: the location of the recalibration table that will be used as the basis to perform the base quality score recalibration.
  • use_oq=[true/false]: determines whether to use the original base quality scores contained in the OQ tag in the BAM file. This option cannot be used in conjunction with the table option and is used to undo base quality score recalibration by setting the base quality scores of the output to the ones contained in the OQ tag in the input BAM file. Typically this option will be used before a second QualCalFilter read filter to first undo a possible recalibration done on the input BAM file.

The QualCalFilter read filter accepts the following optional OPTION:

  • prior=PRIOR: determines the global bias for all the base quality scores.
  • min_qual=QUAL: determines the quality threshold to perform recalibration; bases with quality scores less than QUAL will not be recalibrated.
  • levels=LEVEL1/LEVEL2/…: determines the static quantization levels of the base quality scores.
  • indel=[true/false]: determines whether to add the base quality scores for INDELs into the BAM tags.
  • keep_oq=[true/false]: determines whether to keep the original before recalibration base quality scores by using the OQ tag in the bam file.

7.1.3.2. OverclippingFilter read_filter

The OverclippingFilter read filter is used to filter reads depending on their soft clipping characteristics.

The OverclippingFilter read filter accepts the following optional OPTION:

  • min_align_len=LENGTH: filter reads where the number of bases that are not soft clipped is less than LENGTH.
  • count_both_ends=[true/false]: if set to true, only filter reads where both ends of the read are soft clipped, so that reads with soft-clipping on one end only will not be filtered regardless of their non soft-clipped length. The default value is true.

7.2. BWA binary

The BWA binary performs alignment of DNA-seq data.

In Sentieon® version 202112, a bug that resulting in an incorrect MAPQ for a small number of alignments in the original bwa mem is fixed. A compatibility environment variable ksw_compat=1 is provided and setting this environment variable will cause Sentieon® bwa mem behave the same way as described in http://bio-bwa.sourceforge.net/bwa.shtml, providing an incorrect MAPQ for some alignments. The default behavior implements the correct MAPQ calculation for all alignments and provides a speedup on some newer hardware.

The BWA binary has two modes of interest, “mem” mode to align FASTQ files against a reference FASTA file, and “shm” mode to load the FASTA index file in memory to be shared among multiple BWA processes running in the same server.

7.2.1. BWA mem syntax

You can run the following command to align a single-ended FASTQ1 file or a pair-ended set of 2 FASTQ files against the FASTA reference, which will produce the mapped reads to stdout, to be piped onto util sort:

<SENTIEON_FOLDER>/bin/sentieon bwa mem OPTIONS FASTA FASTQ1 [FASTQ2]

The arguments (OPTIONS) for this command include:

  • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted the bwa binary will use 1 thread.
  • -p: determines whether the first input FASTQ file contains interleaved pair-ended reads. If this argument is used, only use a single FASTQ input, as the second FASTQ2 file will be ignored.
  • -M: determines whether to make split reads as secondary.
  • -R READGROUP_STRING: Read Group header line that all reads will be attached to. The recommended READGROUP_STRING is @RG\tID:$readgroup\tSM:$sample\tPL:$platform\tPU:$platform_unit
    • $readgroup is a unique ID that identifies the reads.
    • $sample is the name of the sample the reads belong to.
    • $platform is the sequencing technology, typically ILLUMINA.
    • $platform_unit is the sequencing element that performed the sequencing.
  • -K CHUNK_SIZE: determines the size of the group of reads that will be mapped at the same time. If this argument is not set, the results will depend on the number of threads used.

7.2.2. BWA shm syntax

You can run the following command to load the FASTA index file in memory:

<SENTIEON_FOLDER>/bin/sentieon bwa shm FASTA

You can run the following command to list FASTA indices files stored in memory:

<SENTIEON_FOLDER>/bin/sentieon bwa shm -l

You can run the following command to remove all FASTA indices files stored in memory, thus freeing memory when no longer necessary:

<SENTIEON_FOLDER>/bin/sentieon bwa shm -d

The arguments (OPTIONS) for this command include:

  • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted the bwa binary will use 1 thread.
  • -f FILE: location of a temporary file that will be used to reduce peak memory usage.

7.2.3. Controlling memory usage in BWA

By default BWA will use about 24 GB in a Linux system and 8 GB in a Mac system. You can control the memory usage via the bwt_max_mem environment variable, which can be used to enhance the speed performance by using more memory, or limit the memory usage at the expense of speed performance. For example, you will get faster alignment by adding the following to your scripts:

export bwt_max_mem=50G

Bear in mind that the number you use in the bwt_max_mem environmental variable is not a hard limit, but an estimate of the memory used in BWA; as such, if BWA memory usage does not go beyond anything lower a certain value, that means it is the minimum required memory for the specific reference, setting bwt_max_mem to a smaller value than the minimum required memory won't change the BWA mem jobs's memory usage.

7.2.4. Using an existing BAM file as input

If you do not have access to the FASTQ inputs, but only have an already aligned and sorted BAM file, you can use it as input and redo the alignment by running samtools:

samtools collate -@ 32 -Ou INPUT_BAM tmp- | samtools fastq -@ 32 -s \
/dev/null -0 /dev/null - | <SENTIEON_FOLDER>/bin/sentieon bwa mem -t 32 -R \
'@RG\tID:id\tLB:lib\tSM:sample\tPL:ILLUMINA' -M -K 1000000 -p $ref /dev/stdin \
| <SENTIEON_FOLDER>/bin/sentieon util sort -t 32 -o OUTPUT_BAM --sam2bam -

Alternatively, you could first create the FASTQ files and then process them as you would normally do:

samtools collate -n -@ 32 -uO INPUT_BAM tmp- | samtools fastq -@ 32 \
-s >(gzip -c > single.fastq.gz) -0 >(gzip -c > unpaired.fastq.gz) \
-1 >(gzip -c > output_1.fastq.gz) -2 >(gzip -c > output_2.fastq.gz) -

If you do this, you may encounter an abnormal memory usage in BWA; if that is the case, you can follow the instructions in BWA uses an abnormal amount of memory when using FASTQ files created from a BAM file .

7.3. STAR binary

The STAR binary performs alignment of RNA-seq data and will behave the same way as the tool described in https://github.com/alexdobin/STAR.

7.3.1. Using compressed FASTQ.gz input files

Using compressed FASTQ input files requires the use of the STAR option --readFilesCommand COMMAND to determine what external program will be used to decompress the input files. When using the Sentieon® implementation of STAR together with Sentieon® util sort, the efficient procesing of the inputs could cause the decompression to be a bottleneck, so it is important to use a fast decompression method.

7.4. minimap2 binary

The minimap2 binary performs alignment of PacBio or Oxford Nanopore genomic reads data and will behave the same way as the tool described in https://github.com/lh3/minimap2.

7.5. UTIL binary

UTIL is the binary used to run some utility functions. This binary is mainly used to process the raw reads output from BWA.

7.5.1. UTIL syntax

The general syntax of the UTIL binary is:

sentieon util MODE [OPTIONS]

The supported modes (MODE) for this command are:

  • index: build the index for a BAM file. The following command will generate a bai BAM index file at the same location as the input file:

    sentieon util index INPUT.bam
    
  • vcfindex: build the index for a VCF file. The following command will generate a idx VCF index file at the same location as the input file:

    sentieon util vcfindex INPUT.vcf
    
  • sort: sort a BAM file. The optional arguments (OPTIONS) for the UTIL command using the sort MODE include:

    • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The default is as many threads as available in the server.
    • -r REFERENCE: location of the reference FASTA file. This argument is required if you are using a CRAM output file, otherwise it is optional.
    • -i INPUT: location of the input file.
    • -o OUTPUT: the location and filename of the output file.
    • --temp_dir DIRECTORY: determines where the temporary files will be stored. The default is the folder where the command is run ($PWD).
    • --cram_read_options decode_md=0: CRAM input option to turn off the NM/MD tag in the input CRAM.
    • --cram_write_options compressor=[gzip|bzip2|lzma|rans]: CRAM output compression options. compressor=gzip is default if not defined.
    • --cram_write_options version=[2.1|3.0]: CRAM output version options. version=3.0 is default if not defined.
    • --bam_compression COMPRESSION_LEVEL[0-9]: gzip compression level for the output BAM file. The default value is 6.
    • --sam2bam: indicates that the input will be in the form of an uncompressed SAM file, that needs to be converted to BAM. If this option is not used, the input should have been converted to BAM format from the BWA output using samtools.
    • --block_size BLOCK_SIZE: size of the block to be used for sorting.
    • --umi_post_process: indicates that the input comes from next-generation sequence data containing molecular barcode information (also called unique molecular indices or UMIs) processed by the sentieon umi consensus tool. This option instructs the tool to perform the necessary post-processing for this kind of data.
    • --trim_primer AMPLICON_TABLE,clip=CLIPPING: determines whether to mark primers as soft/hard clips, depending on whether the value of CLIPPING is soft or hard. When using option --trim_primer AMPLICON_TABLE,clip=soft you will need to make sure to use the --trim_soft_clip option in any subsequent variant caller command so that the soft-clipped primers are ignored. The primers are determined based on the information provided in the AMPLICON_TABLE. The AMPLICON_TABLE is a BED file containins the location of the primers in the amplicon sequencing as a tab-delimited file with eigth columns: contig, amplicon_start, amplicon_end, name (ignored), score (ignored), strand (ignored), insert_start, insert_end; the tool will trim bases from amplicon reads between amplicon_start and insert_start, and between amplicon_end and insert_end.
    sentieon util sort -t NUMBER_THREADS --sam2bam -i INPUT.sam -o OUTPUT.bam
    
  • vcfconvert: compress and decompress VCF and GVCF files.

    The following command will compress and index the input file:

    sentieon util vcfconvert INPUT.vcf OUTPUT.vcf.gz
    

    The following command will decompress a non-indexed vcf file generated with gzip and then compress and index the file. When using this command make sure that the INPUT and OUTPUT files are not the same:

    sentieon util vcfconvert INPUT.gz OUTPUT.vcf.gz
    
  • stream: perform base quality correction in streaming mode. The optional arguments (OPTIONS) for the UTIL command using the stream MODE include:

    • -r REFERENCE: location of the reference FASTA file. This argument is required if you are using a CRAM output file, otherwise it is optional.
    • -i INPUT: location of the input file. By default the binary will use stdin as the input.
    • -q RECAL_TABLE: location of the recalibration table.
    • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The default value is 1.
    • -o OUTPUT: the location and filename of the output file. By default the binary will output to stdout, to be able to stream the results.
    • --output_format FORMAT: determines the format of the output stream. The possible FORMAT values are BAM or CRAM. The default is BAM.
    • --output_index_file OUTPUT_INDEX: determines where the corresponding index file will be created. If this option is ommitted, the tool will not generate an index file.
    • --read_filter FILTER,OPTION=VALUE,OPTION=VALUE: perform a filter or transformation of reads prior to the application of the algorithm. Please refer to Section 7.1.3 in the driver usage to get additional information on the available filters and their functionality.
    • --cram_read_options decode_md=0: CRAM input option to turn off the NM/MD tag in the input CRAM.
    • --cram_write_options compressor=[gzip|bzip2|lzma|rans]: CRAM output compression options. compressor=gzip is default if not defined.
    • --cram_write_options version=[2.1|3.0]: CRAM output version options. version=3.0 is default if not defined.
    • --bam_compression COMPRESSION_LEVEL[0-9]: gzip compression level for the output BAM file. The default value is 6.

    The following command will apply the recalibration and output the corresponding recalibrated BAM file to stdout:

    sentieon util stream -i INPUT.bam -q RECAL_TABLE \
      --output_index_file OUTPUT.bam.bai -o -
    

    The above command will not generate an index file unless the option output_index_file is included.

7.6. UMI binary

UMI is the binary used to process reads containing UMI sequences.

7.6.1. UMI syntax

The general syntax of the UMI binary is:

sentieon umi MODE [OPTIONS]

The supported modes (MODE) for this command are:

  • extract: pre-process FASTQ files containing reads with UMI sequences. The syntax of the extract MODE is:

    sentieon umi extract [OPTIONS] read_structure fastq1 [fastq2] [fastq3]
    

    where:

    • read_structure is the logical structure of the reads. It consists of a collection of integer+character pairs describing the #bases+type; the type can be M for molecular barcode, T for template and S for skip. The read structure consists of comma separated groups, where each group will be read from the corresponding input FASTQ (first group from first FASTQ, second group from second FASTQ...)
    • fastq1/2/3 are the FASTQ files. Up to 3 input FASTQ files are supported to allow the use case when the UMI sequence is already in a separate FASTQ file.

    The optional arguments (OPTIONS) for the UMI binary using the extract MODE include:

    • -o OUTPUT: the location and filename of the output file. If ommitted, the output will be stdout.
    • -d: if present, the extraction will be done in duplex mode.
    • --umi_tag TAG: the logic UMI tag. The default value is XR.
  • consensus: combines reads with the same barcode into a consensus read, outputing a new FASTQ file containing the consensus reads. The syntax of the consensus MODE is:

    sentieon umi consensus [OPTIONS] -o OUTPUT
    

    where:

    • -o OUTPUT is the location and filename of the output file.

    The optional arguments (OPTIONS) for the UMI binary using the consensus MODE include:

    • -i INPUT is the location and filename of the input file. By default the binary will use stdin as the input.
    • --input_format FORMAT: the format of the input FILE. The possible values are SAM or BAM. The default is SAM.
    • --umi_tag TAG: the logic UMI tag. The default value is XR.
    • --copy_tags TAGS: the tags to be copied form the input file to the output. The default is XR,XZ,RX,MI,BI,BD.
    • --read_name_prefix PREFIX: the prefix of the consensus reads. The default value is UMI-.

7.7. PLOT script

PLOT is a script used to create plots of the results of the metrics and recalibration stages. The plots are stored in a PDF file.

7.7.1. PLOT syntax

The general syntax of the PLOT script is:

sentieon plot STAGE -o OUTPUPT_FILE INPUTS [OPTIONS]

The supported modes (STAGE) for this command are:

  • GCBias: generate PDF file from the metrics results of GCBias.
  • QualDistribution: generate PDF file from the metrics results of QualDistribution.
  • InsertSizeMetricAlgo: generate PDF file from the metrics results of InsertSizeMetricAlgo.
  • MeanQualityByCycle: generate PDF file from the metrics results of MeanQualityByCycle.
  • QualCal: generate PDF file from the BQSR QualCal tool.
  • VarCal: generate PDF file from the VQSR VarCal tool.

7.7.1.1. PLOT results of GCBias STAGE

The INPUTS to generate the plots from the GCBias stage are:

  • GC_METRIC_TXT: where GC_METRIC_TXT is the output file of the GCBias algorithm from Section 7.1.2.7.

The plotting of the GCBias metrics STAGE does not accept any OPTIONS.

7.7.1.2. PLOT results of MeanQualityByCycle STAGE

The INPUTS to generate the plots from the MeanQualityByCycle stage are:

  • MQ_METRIC_TXT: where MQ_METRIC_TXT is the output file of the MeanQualityByCycle algorithm from Section 7.1.2.5.

The plotting of the MeanQualityByCycle metrics STAGE does not accept any OPTIONS.

7.7.1.3. PLOT results of QualDistribution STAGE

The INPUTS to generate the plots from the QualDistribution stage are:

  • QD_METRIC_TXT: where QD_METRIC_TXT is the output file of the QualDistribution algorithm from Section 7.1.2.6.

The plotting of the QualDistribution metrics STAGE does not accept any OPTIONS.

7.7.1.4. PLOT results of InsertSizeMetricAlgo STAGE

The INPUTS to generate the plots from the InsertSizeMetricAlgo stage are:

  • IS_METRIC_TXT: where IS_METRIC_TXT is the output file of the InsertSizeMetricAlgo algorithm from Section 7.1.2.10.

The plotting of the InsertSizeMetricAlgo metrics STAGE does not accept any OPTIONS.

7.7.1.5. PLOT results of QualCal STAGE

The INPUTS to generate the plots from the bqsr stage are:

  • RECAL_RESULT.CSV: the output csv file of the QualCal algorithm from Section 7.1.2.4.

The plotting of the bqsr STAGE does not accept any OPTIONS.

7.7.1.6. PLOT results of VarCal STAGE

The INPUTS to generate the plots from the vqsr stage is:

  • PLOT_FILE: a file created by the VarCal algorithm from Section 7.1.2.21 containing the data required to create the report.

The plotting of the vqsr STAGE accept the following OPTIONS:

  • tranches_file=TRANCHES_FILE: location of the file containing the partition of the call sets into quality tranches, generated by the VarCal algorithm from Section 7.1.2.21.
  • target_titv=TITV_THRES: expected TiTv number for the species; it is used calculate the True Positive and False Positive numbers in the plot.
  • min_fp_rate=MIN_RATE: minimum False Positive number; it is used calculate the True Positive and False Positive numbers in the plot.

7.8. LICSRVR binary

LICSRVR is the binary used to run the license server to facilitate dynamic license assignment and record license utilization within a cluster.

7.8.1. LICSRVR syntax

The general syntax of the LICSRVR binary is:

<SENTIEON_FOLDER>/bin/sentieon licsrvr [--start|--stop] [--log LOG_FILE] LICENSE_FILE

The following inputs are optional for the command:

  • LOG_FILE: location and filename of the output file containing the log of the server.
  • LICENSE_FILE: location of the server license file.

After the license server is operational, the client applications can request license tokens from the server by setting the SENTIEON_LICENSE environment variable to the server address in the form of HOST:PORT.

The licsrvr binary supports the following additional modes:

  • --version: will report the software package version the binary belongs to.

  • --dump: will report the current status of the license server, including the number of available licenses.

    <SENTIEON_FOLDER>/bin/sentieon licsrvr --dump LICENSE_FILE
    
  • --dump=update: if the license information has been updated and automatically pulled by the license server, this mode will dump the updated license information to the stdout. The following command will report the updated license information the license server has, if there has been any change:

    <SENTIEON_FOLDER>/bin/sentieon licsrvr --dump=update LICENSE_FILE
    

7.9. LICCLNT binary

LICCLNT is the binary used to test the license server functionality to help determine whether the license server is operational, and how many licenses of the different algorithms are available.

7.9.1. LICCLNT syntax

The LICCLNT binary has two modes, one to ping the license server and one to check the available licenses for specific algorithms.

You can run the following command to check if the license server is operational:

<SENTIEON_FOLDER>/bin/sentieon licclnt ping --server HOST:PORT

The command will return 0 if the server is operational.

You can run the following command to check how many licenses are:

<SENTIEON_FOLDER>/bin/sentieon licclnt query --server HOST:PORT FEATURE

The command will return the number of licenses that are available for the specific license feature, which can be used for managing your jobs: before submitting a job on a certain number of threads, you can check if there are enough licenses for those threads, preventing the tool from being idle while waiting for licenses.