8. Detailed usage of the tools

8.1. DRIVER binary

DRIVER is the binary used to execute all stages of the bioinformatics pipeline. A single call to the driver binary can run multiple algorithms; for example, the metrics stage is implemented as a single command call to driver running multiple algorithms.

8.1.1. DRIVER syntax

The general syntax of the DRIVER binary is:

sentieon driver OPTIONS --algo ALGORITHM ALGO_OPTION OUTPUT \
[--algo ALGORITHM2 ALGO_OPTION2 OUTPUT2]

In the case of running multiple algorithms in the same driver call, the OPTIONS are shared among all the algorithms.

The arguments (OPTIONS) for this command include:

  • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted, the driver tool will use as many threads as the server has.
  • -r REFERENCE: location of the reference FASTA file. This argument is required in all algorithms except LocusCollector and Dedup.
  • -i INPUT_FILE: location of the BAM input file. This argument can be used multiple times to have the software use multiple input files and the result would be the same as if the input BAM files are merged into one BAM file. This argument is required in all algorithms except for CollectVCMetrics, GVCFtyper, DNAModelApply, SVSolver, TNModelApply, TNModelTrain, VarCal and ApplyVarCal.
  • -q QUALITY_RECALIBRATION_TABLE: location of the quality recalibration table output from the BQSR stage that will be used as an input during the BQSR stage and Variant Calling stage. This argument can be used multiple times to have the software use multiple input calibration tables; each calibration table will apply a recalibration to BAM files by matching the readgroup of the reads
  • --interval INTERVAL: interval in the reference that will be used in the software. This argument can be used multiple times to perform the calculation on the union of all the intervals. INTERVAL can be specified as:
    • CONTIG[:START-END]: calculation will be done only on the corresponding contig. START and END are optional numbers to further reduce the interval; both are first-base-1 based. You can input a comma-separated list of multiple contigs.
    • BED_FILE: location of the BED file containing the intervals.
    • PICARD_INTERVAL_FILE: location of the file containing the intervals, following the Picard interval standard.
  • --interval_padding PADDING_SIZE: adds PADDING_SIZE bases padding to the edges of the input intervals. The default is 0.
  • --help: option to display help. The option can be used together with the --algo ALGORITHM to display help for the specific algorithm.
  • --read_filter FILTER,OPTION=VALUE,OPTION=VALUE: perform a filter or transformation of reads prior to the application of the algorithm. Please refer to Section 8.1.3 to get additional information on the available filters and their functionality.
  • --temp_dir DIRECTORY: determines where the temporary files will be stored. The default is the folder where the command is run ($PWD).
  • --skip_no_coor: determines whether to skip unmapped reads.
  • --cram_read_options decode_md=0: CRAM input option to turn off the NM/MD tag in the input CRAM.
  • --replace_rg ORIG_RG="NEW_RG_STRING": modifies the @RG Read group tag of the next BAM input file to update the ORIG_RG with the new information; the NEW_RG_STRING needs to be a valid and complete string. This argument can be used multiple times, and each will affect the next -i BAM input. You can check Section 9.8 for a detail example on its usage

The supported algorithms (ALGORITHM) for this command are:

  • LocusCollector, Dedup: used in the remove duplicates stage.
  • Realigner: used in the indel realignment stage.
  • QualCal: used in the base quality recalibration stage.
  • MeanQualityByCycle, QualDistribution, GCBias, AlignmentStat, InsertSizeMetricAlgo, HsMetricAlgo, CoverageMetrics, BaseDistributionByCycle, QualityYield, WgsMetricsAlgo, SequenceArtifactMetricsAlgo: used to calculate QC metrics.
  • Genotyper, Haplotyper: used for germline variant calling analysis.
  • DNAscope and DNAModelApply: used in DNAscope germline variant calling analysis.
  • DNAscope and SVsolver: used in the DNAscope germline structural variant analysis.
  • ReadWriter: used to output the BAM file after base quality recalibration or merge multiple BAM files.
  • VarCal, ApplyVarCal: used in the VQSR stage.
  • GVCFtyper: used for germline joint variant calling analysis.
  • TNsnv, TNhaplotyper, TNhaplotyper2: used for somatic tumor-normal or tumor only analysis.
  • TNscope, TNModelApply: used for somatic tumor-normal somatic and structural variant analysis.
  • RNASplitReadsAtJunction: used in the stage to split RNA reads at junctions.
  • ContaminationAssessment: used to identify cross-sample contamination from BAM files.
  • CollectVCMetrics: used to calculate post-variant calling metrics.

8.1.2. DRIVER ALGORITHM syntax

8.1.2.1. LocusCollector ALGORITHM

The LocusCollector algorithm collects read information that will be used for removing duplicate reads.

The input to the LocusCollector algorithm is a BAM file; its output is the score file indicating which reads are likely duplicates.

The LocusCollector algorithm requires the following ALGO_OPTION:

  • --fun SCORE: scoring function to use. Possible values for SCORE are:
    • score_info: calculates the score of a read pair for the Dedup algorithm. This is the default score function.

8.1.2.2. Dedup ALGORITHM

The Dedup algorithm performs the removing of duplicate reads.

The input to the Dedup algorithm is a BAM file; its output is the BAM file after removing duplicate reads.

The Dedup algorithm requires the following ALGO_OPTION:

  • --score_info LOCUS_COLLECTOR_OUTPUT: location of the output file of the LocusCollector command call.

The Dedup algorithm accepts the following optional ALGO_OPTION:

  • --rmdup: set this option to remove duplicated reads in the output BAM file. If this option is not set, the duplicated read will be marks as such with a flag, but not removed from the BAM file.
  • --cram_write_options compressor=[gzip|bzip2|lzma|rans]: CRAM output compression options. compressor=gzip is default if not defined.
  • --cram_write_options version=[2.1|3.0]: CRAM output version options. version=2.1 is default if not defined.
  • --metrics METRICS_FILE: location and filename of the output file containing the metrics data from the deduping stage.
  • --optical_dup_pix_dist DISTANCE: determine the maximum distance between two duplicate reads for them to be considered optical duplicates. The default is 100.
  • --bam_compression COMPRESSION_LEVEL[0-9]: gzip compression level for the output BAM file. The default value is 6.
  • --output_dup_read_name: when using this option, the output of the command will not be a BAM file with marked/removed duplicates, but a list of read names for reads that were marked as duplicate.
  • --dup_read_name DUPLICATE_READ_NAME_FILE: when using this input all reads contained in the DUPLICATE_READ_NAME_FILE will be marked as duplicate, regardless of whether they are primary or non-primary.

8.1.2.3. Realigner ALGORITHM

The Realigner algorithm performs the indel realignment.

The input to the Realigner algorithm is a BAM file; its output is the BAM file after realignment.

The Realigner algorithm accepts the following optional ALGO_OPTION:

  • -k KNOWN_SITES: location of the VCF file used as a set of known sites. The known sites will be used to help identify likely sites where the realignment is necessary; only indel variants in the file will be used. You can include multiple collections of known sites by specifying multiple files and repeating the -k KNOWN_SITES option.
  • --interval_list INTERVAL: interval in the reference that will be used in the calculation of the realign targets. Only a single input INTERVAL will be considered: if you repeat the --interval_list option, only the INTERVAL in the last one will be considered. INTERVAL can be specified as:
    • BED_FILE: location of the BED file containing the intervals.
    • PICARD_INTERVAL_FILE: location of the file containing the intervals, following the Picard interval standard.
  • --bam_compression COMPRESSION_LEVEL[0-9]: gzip compression level for the output BAM file. The default value is 6.
  • --cram_write_options compressor=[gzip|bzip2|lzma|rans]: CRAM output compression options. compressor=gzip is default if not defined.
  • --cram_write_options version=[2.1|3.0]: CRAM output version options. version=2.1 is default if not defined.

8.1.2.4. QualCal ALGORITHM

The QualCal algorithm calculates the recalibration table necessary to do the BQSR. The QualCal algorithm also applies the recalibration to calculate the data required to create a report, and creates the data required to create a report.

The recalibration math depends on platform (PL) tag of the ReadGroup; the QualCal algorithm supports the following platforms: ILLUMINA, ION_TORRENT, LS454, PACBIO, COMPLETE_GENOMICS, Support for sequencing data from the SOLID platform is not currently implemented.

The input to the QualCal algorithm is a BAM file; its output is a recalibration table or the csv file containing the data required to create a report.

The QualCal algorithm accepts the following optional ALGO_OPTION:

  • -k KNOWN_SITES: location of the VCF file used as a set of known sites. The known sites will be used to make sure that known locations do not get artificially low quality scores by misidentifying true variants errors in the sequencing methodology. You can include multiple collections of known sites by specifying multiple files and repeating the -k KNOWN_SITES option. We strongly recommend using as many known sites as possible, as otherwise the recalibration will consider variant sites to be sequencing errors.
  • --plot: indicates whether the command is being used to generate the data required to create a report.
  • --cycle_val_max: maximum allowed cycle value for the cycle covariate.
  • --before RECAL_TABLE: location of the previously calculated recalibration table; it will be used to apply the recalibration.
  • --after RECAL_TABLE.POST: location of the previously calculated results of applying the recalibration table; it will be used to calculate the data required to create a report.

8.1.2.5. MeanQualityByCycle ALGORITHM

The MeanQualityByCycle algorithm calculates the mean base quality score for each sequencing cycle.

The input to the MeanQualityByCycle algorithm is a BAM file; its output is the metrics data.

The MeanQualityByCycle algorithm does not accept any ALGO_OPTION.

8.1.2.6. QualDistribution ALGORITHM

The QualDistribution algorithm calculates the number of bases with a specific base quality score.

The input to the QualDistribution algorithm is a BAM file; its output is the metrics data.

The QualDistribution algorithm does not accept any ALGO_OPTION.

8.1.2.7. GCBias ALGORITHM

The GCBias algorithm calculates the GC bias in the reference and the sample.

The input to the GCBias algorithm is a BAM file; its output is the metrics data.

The GCBias algorithm accepts the following optional ALGO_OPTION:

  • --summary SUMMARY_FILE: location and filename of the output file summarizing the GC Bias metrics.
  • --accum_level LEVEL: determines the accumulation levels. The possible values of LEVEL are ALL_READS, SAMPLE, LIBRARY, READ_GROUP. The default is ALL_READS.
  • --also_ignore_duplicates: determines whether the output metrics will be calculated using unique non duplicated reads.

8.1.2.8. HsMetricAlgo ALGORITHM

The HsMetricAlgo algorithm calculates the Hybrid Selection specific metrics for the sample and the AT/GC dropout metrics for the reference.

The input to the HsMetricAlgo algorithm is a BAM file; its output is the metrics data.

The HsMetricAlgo algorithm requires the following ALGO_OPTION:

  • --targets_list TARGETS_FILE: location and filename of the interval list input file that contains the locations of the targets.
  • --baits_list TARGETS_FILE: location and filename of the interval list input file that contains the locations of the baits used.
  • --clip_overlapping_reads: determines whether to clip overlapping reads during the calculation.
  • --min_map_qual QUALITY: determines the filtering quality of the reads used. Any reads with mapping quality less than QUALITY will be filtered out.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored.
  • --coverage_cap COVERAGE: determines the maximum coverage limit used in the histogram.

8.1.2.9. AlignmentStat ALGORITHM

The AlignmentStat algorithm calculates statistics about the alignment of the reads.

The input to the AlignmentStat algorithm is a BAM file; its output is the metrics data.

The AlignmentStat algorithm accepts the following optional ALGO_OPTION:

  • --adapter_seq SEQUENCE_LIST: the sequence of the adapters used in the sequencing, provided as a comma separated list. The default value is the list of default Illumina adapters.

8.1.2.10. InsertSizeMetricAlgo ALGORITHM

The InsertSizeMetricAlgo algorithm calculates the statistical distribution of insert sizes.

The input to the InsertSizeMetricAlgo algorithm is a BAM file; its output is the metrics data.

The InsertSizeMetricAlgo algorithm does not accept any ALGO_OPTION.

8.1.2.11. CoverageMetrics ALGORITHM

The CoverageMetrics algorithm calculates the depth coverage of the BAM file. The coverage is aggregated by interval if the --interval option is included at the driver level; if no --interval option is included, the aggregation per interval will be done per contig for all bases in the reference and the per locus coverage output file will be about 60GB. The coverage is aggregated by gene if a RefSeq file is included.

The input to the CoverageMetrics algorithm is a BAM file; its outputs are files containing the metrics data organized by partition, aggregation and output. Possible outputs are:

  • summary: contains the depth data.
  • statistics: contains the histogram of loci with specific depth.
  • cumulative_coverage_counts: contains the histogram of loci with depth larger than x.
  • cumulative_coverage_proportions: contains the normalized histogram of loci with depth larger than x.

Examples of output files when the output name is OUTPUT:

  • OUTPUT: the per locus coverage with no partition.
  • OUTPUT.sample_summary: the summary for PARTITION_GROUP sample, aggregated over all bases.
  • OUTPUT. library_interval_statistics: the statistics for PARTITION_GROUP library, aggregated by interval.

The CoverageMetrics algorithm accepts the following optional ALGO_OPTION:

  • --partition PARTITION_GROUP: determine how to partition the data. Possible values are readgroup or a comma separated combination of the RG attributes, namely sample, platform, library, center. The default value is “sample”. You can include multiple partition groups by repeating the -–partition option, and each ouput file will be created once per --partition option.
  • --gene_list REFSEQ_FILE: location of the RefSeq file used to aggregate the results of the CoverageMetrics algorithm to the gene level.
  • Filtering options:
    • --min_map_qual and --max_map_qual MAP_QUALITY: determines the filtering quality of the reads used. Any reads with mapping quality less than QUALITY will be filtered out.
    • --min_base_qual and --max_base_qual QUALITY: determines the filtering quality of the bases used. Any base with quality less than QUALITY will be ignored.
    • --cov_thresh THRESHOLD: add percentage of bases in the aggregation that have coverage larger than the threshold. You can include multiple thresholds by repeating the --cov_thresh argument.
  • Omit output options:
    • --omit_base_output: skip the output of the per locus coverage with no partition. This option can be used when you do not use intervals to save space.
    • --omit_sample_stat: skip the output of summary results aggregated over all bases (_summary)
    • --omit_locus_stat: skip the output of all histogram files (both _cumulative_coverage_counts and _cumulative_coverage_proportions).
    • --omit_interval_stat: skip the output of all interval statistics files (_interval_statistics).
  • --count_type TYPE: determines how to deal with overlapping reads from the same fragment. Possible options are:
    • 0: to count overlapping reads even if they come from the same fragment. This is the default value.
    • 1: to count overlapping reads
    • 2: to count overlapping reads only if the reads in the fragment have consistent bases.
  • --print_base_counts: include the number of “AGCTND” in the output per locus coverage with no partition.
  • --include_ref_N: include the coverage data in loci where the reference genome is set to N.
  • --ignore_del_sites: ignore the coverage data in loci where there are deletions.
  • --include_del: this argument will interact with others as follows:
    • if ignore_del_sites is off, count Deletion as depth
    • if print_base_counts is on, include number of ‘D’
  • --histogram_scale [log/linear] --histogram_low MIN_DEPTH, --histogram_high MAX_DEPTH, --histogram_bin_count NUM_BINS: determine the scale type, bin and sizes for histograms. The default values are log, 1, 500, 499.

8.1.2.12. CollectVCMetrics ALGORITHM

The CollectVCMetrics algorithm collects metrics related to the variants present in the input VCF.

The input to the CollectVCMetrics algorithm is a VCF file and a DBSNP file; its output is a pair of files containing information about the variants from the VCF file.

The CollectVCMetrics algorithm requires the following ALGO_OPTION:

  • -d dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP). Only one file is supported.
  • -v INPUT: location of the VCF file on which the metrics will be calculated. Only one file is supported.

8.1.2.13. BaseDistributionByCycle

The BaseDistributionByCycle algorithm calculates the nucleotide distribution per sequencer cycle.

The input to the BaseDistributionByCycle algorithm is a BAM file; its output is the metrics data.

The BaseDistributionByCycle algorithm accepts the following optional ALGO_OPTION:

  • --aligned_reads_only: determines whether to calculate the base distribution over aligned reads only.
  • --pf_reads_only: determines whether to calculate the base distribution over PF reads only.

8.1.2.14. QualityYield

The QualityYield algorithm collects metrics related to reads that pass quality thresholds and Illumina-specific filters.

The input to the QualityYield algorithm is a BAM file; its output is the metrics data.

The QualityYield algorithm accepts the following optional ALGO_OPTION:

  • --include_secondary: determines whether to include bases from secondary alignments in the calculation.
  • --include_supplementary: determines whether to include bases from supplementary alignments in the calculation.

8.1.2.15. WgsMetricsAlgo

The WgsMetricsAlgo algorithm collects metrics related to the coverage and performance of whole genome sequencing (WGS) experiments.

The input to the WgsMetricsAlgo algorithm is a BAM file; its output is the metrics data.

The WgsMetricsAlgo algorithm accepts the following optional ALGO_OPTION:

  • --min_map_qual QUALITY: determines the filtering quality of the reads used in the calculation. Any read with quality less than QUALITY will be ignored. The default value is 20.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in the calculation. Any base with quality less than QUALITY will be ignored. The default value is 20.
  • --coverage_cap COVERAGE: determines the maximum coverage limit for the histogram. Any position with coverage higher than COVERAGE will have its coverage set to COVERAGE.
  • --sample_size SIZE: determines the Sample Size used for the Theoretical Het Sensitivity sampling. The default value is 10000.
  • --include_unpaired: determines whether to count unpaired reads and paired reads with one end unmapped.
  • --base_qual_histogram: determines whether to report the base quality histogram.

8.1.2.16. SequenceArtifactMetricsAlgo

The SequenceArtifactMetricsAlgo algorithm collects metrics that quantify single-base sequencing artifacts and OxoG artifacts.

The input to the SequenceArtifactMetricsAlgo algorithm is a BAM file; its output is the metrics data.

The SequenceArtifactMetricsAlgo algorithm accepts the following optional ALGO_OPTION:

  • --dbsnp FILE: location of the Single Nucleotide Polymorphism database (dbSNP) used to exclude regions around known polymorphisms. Only one file is supported.
  • --min_map_qual QUALITY: determines the filtering quality of the reads used in the calculation. Any read with quality less than QUALITY will be ignored. The default value is 30.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in the calculation. Any base with quality less than QUALITY will be ignored. The default value is 20.
  • --include_unpaired: determines whether to count unpaired reads and paired reads with one end unmapped.
  • --include_duplicates: determines whether to count duplicated reads.
  • --include_non_pf_reads: determines whether to count non-PF reads.
  • --min_insert_size ISIZE: determines the filtering insert size of the reads used in the calculation. Any read with insert size less than ISIZE will be ignored. The default value is 60.
  • --max_insert_size ISIZE: determines the filtering insert size of the reads used in the calculation. Any read with insert size larger than ISIZE will be ignored. The default value is 600.
  • --tandem_reads: determines whether the mate pairs are being sequenced from the same strand.
  • --context_size SIZE: determined the number of context bases to include on each side. The default value is 1.

8.1.2.17. Genotyper ALGORITHM

The Genotyper algorithm performs the Unified Genotyper variant calling.

The input to the Genotyper algorithm is a BAM file; its output is a VCF file.

The Genotyper algorithm accepts the following optional ALGO_OPTION:

  • --annotation 'ANNOTATION_LIST': determines additional annotations that will be added to the output VCF. Use a comma separated list to enable or disable annotations. Include 'none' to remove the default annotations; prefix annotations with the exclamation point (!) to disable the specific annotation. See Section 9.3 for supported annotations.
  • -d dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants. Only one file is supported.
  • --var_type VARIANT_TYPE: determine which variant types will be called; possible values for VARIANT_TYPE are:
    • SNP to call only Single Nucleotide Polymorphism. This is the default behavior.
    • INDEL to call only insertion-deletions.
    • both to call both SNPs and INDELs.
  • --call_conf CONFIDENCE: determine the threshold of variant quality to call a variant. Variants with quality less than CONFIDENCE will be removed.
  • --emit_conf CONFIDENCE: determine the threshold of variant quality to emit a variant. Variants with quality less than CONFIDENCE will be not be added to the output VCF file.
  • --emit_mode MODE: determines what calls will be emitted. Possible values for MODE are:
    • variant: emit calls only at confident variant sites. This is the default behavior.
    • confident: emit calls at confident variant sites or confident reference sites.
    • all: emit all calls, regardless of their confidence.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 17.
  • --ploidy PLOIDY: determines the ploidy number of the sample being processed. The default is 2.
  • --given GIVEN_VCF: perform variant calling using only the variants provided in the GIVEN_VCF. The calling will only evaluate the locus and alleles provided in the file.

8.1.2.18. Haplotyper ALGORITHM

The Haplotyper algorithm performs the Haplotype variant calling.

The input to the Haplotyper algorithm is a BAM file; its output is a VCF file.

The Haplotyper algorithm accepts the following optional ALGO_OPTION:

  • --annotation 'ANNOTATION_LIST': determines additional annotations that will be added to the output VCF. Use a comma separated list to enable or disable annotations. Include 'none' to remove the default annotations; prefix annotations with the exclamation point (!) to disable the specific annotation. See Section 9.3 for supported annotations.
  • -d dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants. Only one file is supported.
  • --call_conf CONFIDENCE: determine the threshold of variant quality to call a variant. Variants with quality less than CONFIDENCE will be removed. This option is ignored when the --emit_mode is gvcf.
  • --emit_conf CONFIDENCE: determine the threshold of variant quality to emit a variant. Variants with quality less than CONFIDENCE will be not be added to the output VCF file. This option is ignored when the --emit_mode is gvcf.
  • --emit_mode MODE: determines what calls will be emitted. Possible values for mode are:
    • variant: emit calls only at confident variant sites. This is the default behavior.
    • confident: emit calls at confident variant sites or confident reference sites.
    • all: emit all calls, regardless of their confidence.
    • gvcf: emits additional information required for joint calling. This option is required if you want to perform joint calling using the GVCFtyper algorithm.
  • --gq_bands LIST_OF_BANDS: determines the bands that will be used to compress variants of similar genotype quality (GQ) that will be emitted as a single VCF record in the GVCF output file. The LIST_OF_BANDS is a comma-separated list of bands where each band is defined by START-END/STEP. The default is 1-60,60-99/10,99.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 10.
  • --pcr_indel_model MODEL: PCR indel model used to weed out false positive indels more or less aggressively. The possible MODELs are: NONE (used for PCR free samples), and HOSTILE, AGGRESSIVE and CONSERVATIVE, in order of decreasing aggressiveness. The default value is CONSERVATIVE.
  • --phasing [1/0]: flag to enable or disable phasing in the output when using emit_mode GVCF. The default value is 1 (on) and this flag has no impact when using an emit_mode other than GVCF. Phasing is only calculated for diploid samples.
  • --ploidy PLOIDY: determines the ploidy number of the sample being processed. The default is 2.
  • --prune_factor FACTOR: minimum pruning factor in local assembly; paths with fewer supporting kmers than FACTOR will be pruned from the graph. The default value is 2.
  • --trim_soft_clip: determines whether soft clipped bases in the reads should be excluded from the variant calling. This argument is only recommended to process RNA reads.
  • --given GIVEN_VCF: perform variant calling using only the variants provided in the GIVEN_VCF. The calling will only evaluate the locus and alleles provided in the file. This option cannot be used in conjunction with --emit_mode gvcf.
  • --bam_output OUTPUT_BAM: output a BAM file containing modified reads after the local reassembly done by the variant calling. This option should only be used in conjunction with a small bed file for troubleshooting purposes.
  • --snp_heterozygosity HETEROZIGOSITY: determines the expected heterozygosity value used to compute prior likelihoods in SNP calling.
  • --indel_heterozygosity HETEROZIGOSITY: determines the expected heterozygosity value used to compute prior likelihoods in INDEL calling.

8.1.2.19. ReadWriter ALGORITHM

The ReadWriter algorithm outputs the result of applying the Base Quality Score Recalibration to a file.

The ReadWriter algorithm can also merge BAM files, and/or convert them into cram files.

The input to the ReadWriter algorithm is one or multiple BAM files and one or multiple recalibration tables; its output is the BAM file after recalibration. If the output file extension is CRAM, a CRAM file will be created. If multiple input files were used, the output file will be the result of merging all the files.

The ReadWriter algorithm accepts the following optional ALGO_OPTION:

  • --bam_compression COMPRESSION_LEVEL[0-9]: gzip compression level for the output BAM file. The default value is 6.
  • --cram_write_options compressor=[gzip|bzip2|lzma|rans]: CRAM output compression options. compressor=gzip is default if not defined.
  • --cram_write_options version=[2.1|3.0]: CRAM output version options. version=2.1 is default if not defined.
  • --read_flag_mask pass=[all|FLAG_LIST_pass],drop=[FLAG_LIST_drop]: read filtering settings. FLAG_LIST is a '+' separated list of multiple Flag numbers or case insensitive Flag names. The default read_flag_mask is "pass=all,drop=0x0" as the ReadWriter algorithm allows all reads. As an example, to write a BAM file that does not contain either duplicated reads or unmapped reads, you can use --read_flag_mask pass=all,drop=DUP+UNMAP. The table below is a reminder of the flag definition and their naming convention.
Flag # Flag name Description
0x1 PAIRED paired-end (or multiple-segment) sequencing technology
0x2 PROPER_PAIR each segment properly aligned according to the aligner
0x4 UNMAP segment unmapped
0x8 MUNMAP next segment in the template unmapped
0x10 REVERSE SEQ is reverse complemented
0x20 MREVERSE SEQ of the next segment in the template is reversed
0x40 READ1 the first segment in the template
0x80 READ2 the last segment in the template
0x100 SECONDARY secondary alignment
0x200 QCFAIL not passing quality controls
0x400 DUP PCR or optical duplicate
0x800 SUPPLEMENTARY supplementary alignment

We recommend running the ReadWriter algorithm at the same command call as one of the variant calls, to reduce overhead.

8.1.2.20. GVCFtyper ALGORITHM

The GVCFtyper algorithm performs the joint variant calling of multiple samples, provided that each single sample has been previously processed using the Haplotyper algorithm with the option --emit_mode gvcf.

The GVCFtyper algorithm has no input in the driver level, its output is a VCF containing the joint called variants for all samples.

The GVCFtyper algorithm requires the following ALGO_OPTION:

  • -v INPUT: location of the GVCF file from the variant calling algorithm performed on a single sample using the extra option --emit_mode gvcf. You can include GVCF files from multiple samples by specifying multiple files and repeating the -v INPUT option. You can use VCF files compressed with bgzip and indexed.
  • Alternatively, you can input a list of GVCF files at the end of the command after the output file. Thus, the following 2 commands are interchangeable:
sentieon driver -r REFERENCE --algo GVCFtyper \
  -v s1_VARIANT_GVCF -v s2_VARIANT_GVCF -v s3_VARIANT_GVCF VARIANT_VCF
sentieon driver -r REFERENCE --algo GVCFtyper \
  VARIANT_VCF s1_VARIANT_GVCF s2_VARIANT_GVCF s3_VARIANT_GVCF
  • You can read in the GVCF file list from a text file using the following commands:
gvcf_argument=""
while read -r line; do
 gvcf_argument=$gvcf_argument" -v $line"
done < "list_of_gvcfs"
sentieon driver -r REFERENCE --algo GVCFtyper $gvcf_argument output-joint.vcf
  • You can also read in the GVCF file list from a text file using the following command leveraging the stdin pipe (-):
cat list_of_gvcfs | sentieon driver -r REFERENCE --algo GVCFtyper output-joint.vcf -
  • You could input all files from a specific folder using the following command:
sentieon driver -r REFERENCE --algo GVCFtyper output-joint.vcf sample*.g.vcf

The GVCFtyper algorithm accepts the following optional ALGO_OPTION:

  • --annotation 'ANNOTATION_LIST': determines additional annotations that will be added to the output VCF. Use a comma separated list to enable or disable annotations. Include 'none' to remove the default annotations; prefix annotations with the exclamation point (!) to disable the specific annotation. See Section 9.3 for supported annotations.
  • -d dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants. Only one file is supported.
  • --call_conf CONFIDENCE: determine the threshold of variant quality to call a variant. Variants with quality less than CONFIDENCE will be removed. The default value is 30.
  • --emit_conf CONFIDENCE: determine the threshold of variant quality to emit a variant. Variants with quality less than CONFIDENCE will be not be added to the output VCF file. The default value is 30.
  • --emit_mode MODE: determines what calls will be emitted. Possible values for mode are:
    • variant: emit calls only at confident variant sites. This is the default behavior.
    • confident: emit calls at confident variant sites or confident reference sites.
    • all: emit all calls, regardless of their confidence.
  • --max_alt_alleles NUMBER: Maximum number of alternate alleles. The default value is 100.
  • --snp_heterozygosity HETEROZIGOSITY: determines the expected heterozygosity value used to compute prior likelihoods in SNP calling.
  • --indel_heterozygosity HETEROZIGOSITY: determines the expected heterozygosity value used to compute prior likelihoods in INDEL calling.

8.1.2.21. VarCal ALGORITHM

The VarCal algorithm calculates the Variant Quality Score Recalibration (VQSR). VQSR assigns a well-calibrated probability score to individual variant calls, to enable more accurate control in determining the most likely variants. For that, VQSR uses highly confident known sites to build a recalibration model and determine the probability that called sites are true. For more information about the algorithm, you can check http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr. For information on the recommended resources to use in VQSR, you can check https://www.broadinstitute.org/gatk/guide/article?id=1259.

The VarCal algorithm has no input in the driver level, its output is a recalibration file containing additional annotations related to the VQSR.

The VarCal algorithm requires the following ALGO_OPTION:

  • -v INPUT: location of the VCF file from the variant calling algorithm; you can use a VCF file compressed with bgzip and indexed.

  • --tranches_file TRANCHES_FILE: location and filename of the file containing the partition of the call sets into quality tranches.

  • --resource RESOURCE_FILE --resource_param PARAM: location of the VCF file used as a training/truth resource in VQSR, followed by parameters determining how the file will be used. You can include multiple collections by specifying multiple files and repeating the --resource dbSNP_FILE --resource_param PARAM option. The PARAM argument follows the syntax:

    LABEL,known=IS_KNOWN,training=IS_TRAIN,truth=IS_TRUTH,prior=PRIOR
    
    • LABEL is a descriptive name for the resource.
    • IS_KNOWN can be true or false, and determines whether the sites contained in the resource will be used to stratify output metrics.
    • IS_TRAIN can be true or false, and determines whether the sites contained in the resource will be used for training the recalibration model.
    • IS_TRUTH can be true or false, and determines whether the sites contained in the resource will be considered true sites.
    • PRIOR is a value that reflects your confidence in how reliable the resource is as a truth set.

The VarCal algorithm accepts the following optional ALGO_OPTION:

  • --srand RANDOM_SEED: determines the seed to use in the random number generation. You can set RANDOM_SEED to 0 and the software will use the random seed from your computer. In order to generate a deterministic result, you should use a non-zero RANDOM_SEED and set the NUMBER_THREADS to 1.
  • --annotation ANNOTATION: determine annotation that will be used during the recalibration. You can include multiple annotations in the optimization by repeating the --annotation ANNOTATION option. You can use all annotations present in the original variant call file.
  • --var_type VARIANT_TYPE: determine which variant types will be recalibrated; possible values for VARIANT_TYPE are:
    • SNP to recalibrate only Single Nucleotide Polymorphism. This is the default behavior.
    • INDEL to recalibrate only insertion-deletions.
    • (do not use) BOTH to recalibrate both SNPs and INDELs. This setting SHOULD NOT be used, as VQSR should be performed independently for SNPs and INDELs.
  • --tranche TRANCH_THRESHOLD: normalized quality threshold for each tranche; the TRANCH_THRESHOLD number is a number between 0 and 100. Multiple instances of the option are allowed that will create as many tranches as there are thresholds. The default values are 90, 99, 99.9 and 100.
  • --max_gaussians MAX_GAUSS: determines the maximum number of Gaussians that will be used for the positive recalibration model. The default value is 8 for SNP and 4 for INDEL.
  • --max_neg_gaussians MAX_GAUSS: determines the maximum number of Gaussians that will be used for the negative recalibration model. The default value is 2.
  • --max_iter MAX_ITERATIONS: determines the maximum number of iterations for the Expectation Maximization (EM) optimization. The default value is 150.
  • --max_mq MAPQ: indicates the maximum MQ in your data, which will be used to perform a logit jitter transform of the MQ to make the distribution closer to a Gaussian.
  • --aggregate_data AGREGATE_VCF: location of an additional VCF file containing variants called from other similar samples; these additional data will increase the effective sample size for the statistical model calibration. Multiple instances of the option are allowed.
  • --plot_file PLOT_FILE: location of the temporary file containing the necessary data to generate the reports from the VarCal algorithm.

8.1.2.22. ApplyVarCal ALGORITHM

The ApplyVarCal algorithm combines the output information from the VQSR with the original variant information.

The ApplyVarCal algorithm has no input in the driver level; its output is a copy of the original VCF containing additional annotations from the VQSR.

The ApplyVarCal algorithm requires the following ALGO_OPTION:

  • -v INPUT: location of the VCF file from the variant calling algorithm. It should be the same as the one used in the VarCal algorithm; you can use a VCF file compressed with bgzip and indexed.
  • --recal VARIANT_RECAL_DATA: location of the VCF file output from the VarCal algorithm.
  • --tranches_file TRANCHES_FILE: location of the tranches file output from the VarCal algorithm.
  • --var_type VARIANT_TYPE: determine which variant types will be recalibrated. This option should be consistent with the one used in the VarCal algorithm.

Alternatively, you can use the option --vqsr_model to input a comma-separated list of the required information for multiple VQSR models; this option allows you to apply both a SNP and INDEL mode in a single command line. The syntax of the option is:

--vqsr_model var_type=VARIANT_TYPE,\
              recal=VARIANT_RECAL_DATA,\
              tranches_file=TRANCHES_FILE,\
              sensitivity=SENSITIVITY

The ApplyVarCal algorithm accepts the following optional ALGO_OPTION:

  • --sensitivity SENSITIVITY: determine the sensitivity to the available truth sites; only tranches with threshold larger than the sensitivity will be included in the recalibration. We recommend you use a sensitivity number that is included in the tranche threshold list of VarCal algorithm; this will reduce rounding issues. The default value is NULL, so that no tranches filtering is applied, and only the LOW_VQSLOD filter is applied.

8.1.2.23. TNsnv ALGORITHM

The TNsnv algorithm performs the somatic variant calling on the tumor-normal matched pair or the tumor and panel of normal data, using a Genotyper algorithm.

The input to the TNsnv algorithm is a BAM file; its output is a VCF file.

The TNsnv algorithm requires the following ALGO_OPTION:

  • --tumor_sample SAMPLE_NAME: name of the SM tag in the BAM file for the tumor sample.

Depending on the mode it is run, the TNsnv algorithm may require the following ALGO_OPTION:

  • --normal_sample SAMPLE_NAME: name of the SM tag in the BAM file for the normal sample.
  • --detect_pon: indicates that you are using the TNsnv algorithm to create a VCF file that will be part of a panel of normal.
  • --cosmic COSMIC_VCF: location of the Catalogue of Somatic Mutations in Cancer (COSMIC) VCF file used to create the panel of normal file. Only one file is supported.
  • --pon PANEL_OF_NORMAL_VCF: location of the file containing the variants detected in the Panel of Normal analysis that will be used to remove false positives. Only one file is supported.

The TNsnv algorithm accepts the following optional ALGO_OPTION:

  • --dbsnp dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP). The variants in the dbSNP will be more likely to be marked as germline as they require more evidence of absence in the normal. Only one file is supported.
  • --call_stats_out CALL_STATS_FILE: location and filename of the file containing the call stats information from the somatic variant calling.
  • --stdcov_out COVERAGE_FILE: location and filename of the wiggle file containing the standard coverage.
  • --tumor_depth_out TUMOR_DEPTH_FILE: location and filename of the wiggle file containing the depth of the tumor sample reads.
  • --normal_depth_out NORMAL_DEPTH_FILE: location and filename of the wiggle file containing the depth of the normal sample reads.
  • --power_out POWER_FILE: location and filename of the power file.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 5.
  • --min_init_tumor_lod NUMBER: minimum tumor log odds in the initial pass calling variants. The default value is 4.
  • --min_tumor_lod NUMBER: minimum tumor log odds in the final call of variants. The default value is 6.3.
  • --min_normal_lod NUMBER: minimum normal log odds used to check that the tumor variant is not a normal variant. The default value is 2.2.
  • --contamination_frac NUMBER: estimation of the contamination fraction from other samples. The default value is 0.02.
  • --min_cell_mutation_frac NUMBER: minimum fraction of cells which have mutation. The default value is 0.
  • --min_strand_bias_lod NUMBER: minimum log odds for calling strand bias. The default value is 2.
  • --min_strand_bias_power NUMBER: minimum power for calling strand bias. The default value is 0.9.
  • --min_dbsnp_normal_lod NUMBER: minimum log odds for calling normal non-variant at dbsnp sites. The default value is 5.5.
  • --min_normal_allele_frac NUMBER: minimum allele fraction to be considered in normal; this parameter is useful when the normal sample is contaminated with the tumor sample. The default value is 0.
  • --min_tumor_allele_frac NUMBER: minimum allelic fraction in tumor sample. The default value is 0.005.
  • --min_normal_allele_frac NUMBER: minimum allele fraction to be considered in normal; this parameter is useful when the normal sample is contaminated with the tumor sample. The default value is 0.
  • --min_tumor_allele_frac NUMBER: minimum allelic fraction in tumor sample. The default value is 0.005.
  • --max_indel NUMBER: maximum nearby indel events that are allowed. The default value is 3.
  • --max_read_clip_frac NUMBER: maximum fraction of soft/hard clipped bases in a read. The default value is 0.3.
  • --max_mapq0_frac NUMBER: maximum ratio of reads whose mapq are 0 used to determine poor mapped area. The default value is 0.5.
  • --min_pir_median NUMBER: minimum read position median. The default value is 10.
  • --min_pir_mad NUMBER: minimum read position median absolute deviation. The default value is 3.
  • --max_alt_mapq NUMBER: maximum value of alt allele mapping quality score. The default value is 20.
  • --max_normal_alt_cnt NUMBER: maximum alt alleles count in normal pileup. The default value is 2.
  • --max_normal_alt_qsum NUMBER: maximum quality score sum of alt allele in normal pileup. The default value is 20.
  • --max_normal_alt_frac NUMBER: maximum fraction of alt allele in normal pileup. The default value is 0.03.
  • --power_allele_frac NUMBER: allele fraction used in power calculations. The default value is 0.3.

8.1.2.24. TNhaplotyper ALGORITHM

The TNhaplotyper algorithm performs the somatic variant calling on the tumor-normal matched pair or the tumor and panel of normal data, using a Haplotyper algorithm.

The input to the TNhaplotyper algorithm is a BAM file; its output is a VCF file.

The TNhaplotyper algorithm requires the following ALGO_OPTION:

  • --tumor_sample SAMPLE_NAME: name of the SM tag in the BAM file for the tumor sample.

Depending on the mode it is run, the TNhaplotyper algorithm may require the following ALGO_OPTION:

  • --normal_sample SAMPLE_NAME: name of the SM tag in the BAM file for the normal sample.
  • --detect_pon: indicates that you are using the TNhaplotyper algorithm to create a VCF file that will be part of a panel of normal.
  • --cosmic COSMIC_VCF: location of the Catalogue of Somatic Mutations in Cancer (COSMIC) VCF file used to create the panel of normal file. Only one file is supported.
  • --pon PANEL_OF_NORMAL_VCF: location of the file containing the variants detected in the Panel of Normal analysis that will be used to remove false positives. Only one file is supported.

The TNhaplotyper algorithm accepts the following optional ALGO_OPTION:

  • --dbsnp dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP). The variants in the dbSNP will be more likely to be marked as germline as they require more evidence of absence in the normal. Only one file is supported.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 10.
  • --prune_factor FACTOR: minimum pruning factor in local assembly; paths with fewer supporting kmers than FACTOR will be pruned from the graph. The default value is 2.
  • --pcr_indel_model MODEL: PCR indel model used to weed out false positive indels more or less aggressively. The possible modes are`: NONE (used for PCR free samples), and HOSTILE, AGGRESSIVE and CONSERVATIVE, in order of decreasing aggressiveness. The default value is HOSTILE.
  • --phasing [1/0]: flag to enable or disable phasing in the output.
  • --min_init_tumor_lod NUMBER: minimum tumor log odds in the initial pass calling variants. The default value is 4.
  • --min_init_normal_lod NUMBER: minimum tumor log odds in the initial pass calling variants. The default value is 0.5.
  • --min_tumor_lod NUMBER: minimum tumor log odds in the final call of variants. The default value is 6.3.
  • --min_normal_lod NUMBER: minimum normal log odds used to check that the tumor variant is not a normal variant. The default value is 2.2.
  • --min_strand_bias_lod NUMBER: minimum log odds for calling strand bias. The default value is 2.
  • --min_strand_bias_power NUMBER: minimum power for calling strand bias. The default value is 0.9.
  • --min_pir_median NUMBER: minimum read position median. The default value is 10.
  • --min_pir_mad NUMBER: minimum read position median absolute deviation. The default value is 3.
  • --max_normal_alt_cnt NUMBER: maximum alt alleles count in normal pileup. The default value is 2.
  • --max_normal_alt_qsum NUMBER: maximum quality score sum of alt allele in normal pileup. The default value is 20.
  • --max_normal_alt_frac NUMBER: maximum fraction of alt allele in normal pileup. The default value is 0.03.
  • --tumor_contamination_frac NUMBER: estimation of the contamination fraction on the tumor sample from other samples. The default value is 0.
  • --normal_contamination_frac NUMBER: estimation of the contamination fraction on the normal sample from other samples. The default value is 0.
  • --filter_clustered_read_position: filters variants that are clustered at the start or end of sequencing reads
  • --filter_strand_bias: filters variants that show evidence of strand bias
  • --bam_output OUTPUT_BAM: output a BAM file containing modified reads after the local reassembly done by the variant calling. This option should only be used in conjunction with a small bed file for troubleshooting purposes.
  • --trim_soft_clip: determines whether soft clipped bases in the reads should be excluded from the variant calling.

8.1.2.25. TNhaplotyper2 ALGORITHM

The TNhaplotyper2 algorithm performs the somatic variant calling on the tumor-normal matched pair or the tumor and panel of normal data, using a Haplotyper algorithm.

The input to the TNhaplotyper2 algorithm is a BAM file; its output is a VCF file.

The TNhaplotyper2 algorithm requires the following ALGO_OPTION:

  • --tumor_sample SAMPLE_NAME: name of the SM tag in the BAM file for the tumor sample.

Depending on the mode it is run, the TNhaplotyper2 algorithm may require the following ALGO_OPTION:

  • --normal_sample SAMPLE_NAME: name of the SM tag in the BAM file for the normal sample.
  • --pon PANEL_OF_NORMAL_VCF: location of the file containing the variants detected in the Panel of Normal analysis that will be used to remove false positives. Only one file is supported.

The TNhaplotyper2 algorithm accepts the following optional ALGO_OPTION:

  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 10.
  • --prune_factor FACTOR: minimum pruning factor in local assembly; paths with fewer supporting kmers than FACTOR will be pruned from the graph. The default value is 2.
  • --pcr_indel_model MODEL: PCR indel model used to weed out false positive indels more or less aggressively. The possible modes are`: NONE (used for PCR free samples), and HOSTILE, AGGRESSIVE and CONSERVATIVE, in order of decreasing aggressiveness. The default value is CONSERVATIVE.
  • --min_init_tumor_lod NUMBER: minimum tumor log odds in the initial pass calling variants. The default value is 2.0.
  • --min_tumor_lod NUMBER: minimum tumor log odds in the final call of variants. The default value is 3.0.
  • --min_normal_lod NUMBER: minimum normal log odds used to check that the tumor variant is not a normal variant. The default value is 2.2.
  • --tumor_contamination_frac NUMBER: estimation of the contamination fraction on the tumor sample from other samples. The default value is 0.
  • --normal_contamination_frac NUMBER: estimation of the contamination fraction on the normal sample from other samples. The default value is 0.
  • --germline_vcf VCF: location of the VCF containing the population allele frequency.
  • --default_af AF: determines the af value for alleles not found in the germline vcf. The default value is: 0.001.
  • --max_germline_af AF: determines the maximum germline allele frequency in tumor-only model. The default value is 0.01.
  • --call_pon_sites: determines whether to call candidate variants even if they are present in the Panel of Normal input.
  • --bam_output OUTPUT_BAM: output a BAM file containing modified reads after the local reassembly done by the variant calling. This option should only be used in conjunction with a small bed file for troubleshooting purposes.
  • --trim_soft_clip: determines whether soft clipped bases in the reads should be excluded from the variant calling.

8.1.2.26. TNscope ALGORITHM

The TNscope algorithm performs the somatic variant calling on the tumor-normal matched pair or the tumor only data, using a Haplotyper algorithm.

The input to the TNscope algorithm is a BAM file; its output is a VCF file.

The TNscope algorithm requires the following ALGO_OPTION:

  • --tumor_sample SAMPLE_NAME: name of the SM tag in the BAM file for the tumor sample.

The TNscope algorithm accepts the following optional ALGO_OPTION:

  • --normal_sample SAMPLE_NAME: name of the SM tag in the BAM file for the normal sample. When doing tumor-only somatic calling, this argument is not required.
  • --cosmic COSMIC_VCF: location of the Catalogue of Somatic Mutations in Cancer (COSMIC) VCF file used to create the panel of normal file. Only one file is supported.
  • --pon PANEL_OF_NORMAL_VCF: location of the file containing the variants detected in the Panel of Normal analysis that will be used to remove false positives. Only one file is supported. This file is the same as the one used with TNhaplotyper.
  • --dbsnp dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP). The variants in the dbSNP will be more likely to be marked as germline as they require more evidence of absence in the normal. Only one file is supported.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 15.
  • --prune_factor FACTOR: minimum pruning factor in local assembly; paths with fewer supporting kmers than FACTOR will be pruned from the graph. The default value is 2.
  • --pcr_indel_model MODEL: PCR indel model used to weed out false positive indels more or less aggressively. The possible modes are`: NONE (used for PCR free samples), and HOSTILE, AGGRESSIVE and CONSERVATIVE, in order of decreasing aggressiveness. The default value is HOSTILE.
  • --phasing [1/0]: flag to enable or disable phasing in the output.
  • --min_init_tumor_lod NUMBER: minimum tumor log odds in the initial pass calling variants. The default value is 4.
  • --min_init_normal_lod NUMBER: minimum tumor log odds in the initial pass calling variants. The default value is 0.5.
  • --min_tumor_lod NUMBER: minimum tumor log odds in the final call of variants. The default value is 6.3.
  • --min_normal_lod NUMBER: minimum normal log odds used to check that the tumor variant is not a normal variant. The default value is 2.2.
  • --tumor_contamination_frac NUMBER: estimation of the contamination fraction on the tumor sample from other samples. The default value is 0.
  • --normal_contamination_frac NUMBER: estimation of the contamination fraction on the normal sample from other samples. The default value is 0.
  • --given GIVEN_VCF: perform variant calling using only the variants provided in the GIVEN_VCF. The calling will only evaluate the locus and alleles provided in the file.
  • --bam_output OUTPUT_BAM: output a BAM file containing modified reads after the local reassembly done by the variant calling. This option should only be used in conjunction with a small bed file for troubleshooting purposes.
  • --disable_detector DETECTOR: disable the variant calling for specific detectors: use 'sv' as DETECTOR to prevent calling of structural variants, and use 'snv_indel' as DETECTOR to prevent calling of small variants.
  • --trim_soft_clip: determines whether soft clipped bases in the reads should be excluded from the variant calling.

8.1.2.27. RNASplitReadsAtJunction ALGORITHM

The RNASplitReadsAtJunction algorithm performs the splitting of reads into exon segments by getting rid of Ns but maintaining grouping information, and hard-clipping any sequences overhanging into the intron regions.

The input to the RNASplitReadsAtJunction algorithm is a BAM file; its output is a BAM file.

The RNASplitReadsAtJunction algorithm requires the following ALGO_OPTION:

  • --reassign_mapq IN_QUAL:OUT_QUAL: the algorithm will reassign mapping qualities from IN_QUAL to OUT_QUAL. This argument is required because STAR assigns a quality of 255 to good alignments instead of the expected default score of 60.

The RNASplitReadsAtJunction algorithm accepts the following optional ALGO_OPTION:

  • --ignore_overhang: determines whether to ignore and not fix the overhanging sections of the reads.
  • --overhang_max_bases NUMBER: determines the maximum number of bases allowed in a hard-clipped overhang, so that if there are more bases in the overhang, the overhang will not be hard-clipped. The default value is 40.
  • --overhang_max_mismatches NUMBER: determines the maximum number of mismatches allowed in a non-hard-clipped overhang, so that the complete overhang will be hard-clipped if the number of mismatches is too high. The default value is 1.
  • --cram_write_options compressor=[gzip|bzip2|lzma|rans]: CRAM output compression options. compressor=gzip is default if not defined.
  • --cram_write_options version=[2.1|3.0]: CRAM output version options. version=2.1 is default if not defined.

8.1.2.28. ContaminationAssessment ALGORITHM

The ContaminationAssessment algorithm assesses the contamination present in a sample BAM file; the output of this algorithm can be used as the value of argument contamination_frac, normal_contamination_frac and tumor_contamination_frac in the TNseq and TNscope tools.

The input to the ContaminationAssessment algorithm is a BAM file; its output is a text file.

The ContaminationAssessment algorithm requires the following ALGO_OPTION:

  • --pop_vcf VCF_FILE: the location of the VCF file containing the allele frequency information for the specific population of the sample.
  • --genotype_vcf VCF_FILE: the location of the VCF file containing the DNAseq variants reported for the individual; to calculate the contamination in the tumor sample, you should use the DNAseq variants reported for the normal sample. You can create this file by using Haplotyper or Genotyper on the sample bam.

The ContaminationAssessment algorithm accepts the following optional ALGO_OPTION:

  • --type ASSESS_TYPE: determines the type for the estimate. The possible values are SAMPLE, READGROUP and META to assess the contamination by sample, by lane, or in aggregate across all the reads. Multiple instances of the option are allowed that will assess the contamination at multiple levels. The default value is META.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in the contamination assessment. Any base with quality less than QUALITY will be ignored. The default value is 20.
  • --min_map_qual QUALITY: determines the filtering quality of the reads used in the contamination assessment. Any read with quality less than QUALITY will be ignored. The default value is 20.
  • --min_basecount NUMBER: determines the minimum number of bases that need to be present at a locus before the contamination is assessed. The default value is 500.
  • --trim_thresh NUMBER: threshold that will be used to trim sites; if the probability of the contamination ratio being larger than 0.5 is larger than the threshold, the site will not be included in the contamination assessment. The default value is 0.95.
  • --trim_frac NUMBER: determines the maximum fraction of sites that may be trimmed based on the trim threshold. The default value is 0.01.
  • --precision NUMBER: determines the PRECISION on the output percent number. The default value is 0.1.
  • --base_report FILE: location and filename of the output file that will contain an extended report about the processed data.
  • --population POPULATION_NAME: a population to use to determine the baseline allele frequency of the sample. The default value is CEU.

8.1.2.29. TNModelApply ALGORITHM

The TNModelApply algorithm applies a Machine Learning model on the results of TNscope to help with variant filtration. This algorithm is only supported in the Linux version of the Sentieon Genomics software.

The TNModelApply algorithm has no input in the driver level, its output is a VCF file.

The TNModelApply algorithm requires the following ALGO_OPTION:

  • -v INPUT: location of the VCF file from the TNscope variant calling; you can use a VCF file compressed with bgzip and indexed.
  • -m MODEL_FILE: location of the file containing the Machine Learning model.

The TNModelApply algorithm modifies the input VCF file by adding the MLrejected FILTER to the variants; since the FILTER is added, you may want to remove any FILTERs already present in the input VCF, as they may no longer be relevant. You can use bcftools(https://samtools.github.io/bcftools/bcftools.html) for that purpose:

$BCF/bcftools annotate -x "^FILTER/MLrejected,FILTER/PASS" -O z \
  -o $OUTPUT.vcf.gz $INPUT.vcf.gz

8.1.2.30. DNAscope ALGORITHM

The DNAscope algorithm performs an improved version of Haplotype variant calling.

The input to the DNAscope algorithm is a BAM file; its output is a VCF file.

The DNAscope algorithm accepts the following optional ALGO_OPTION:

  • --annotation 'ANNOTATION_LIST': determines additional annotations that will be added to the output VCF. Use a comma separated list to enable or disable annotations. Include 'none' to remove the default annotations; prefix annotations with the exclamation point (!) to disable the specific annotation. See Section 9.3 for supported annotations.
  • -d dbSNP_FILE: location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants. Only one file is supported.
  • --var_type VARIANT_TYPE: determine which variant types will be called; VARIANT_TYPE is a comma separated list of the following possible values:
    • SNP to call Single Nucleotide Polymorphism. This is included in the default behavior.
    • INDEL to call insertion-deletions. This is included in the default behavior.
    • BND to call break-end information required for the structural variant caller.
  • --call_conf CONFIDENCE: determine the threshold of variant quality to call a variant. Variants with quality less than CONFIDENCE will be removed.
  • --emit_conf CONFIDENCE: determine the threshold of variant quality to emit a variant. Variants with quality less than CONFIDENCE will be not be added to the output VCF file.
  • --emit_mode MODE: determines what calls will be emitted. Possible values for mode are:
    • variant: emit calls only at confident variant sites. This is the default behavior.
    • confident: emit calls at confident variant sites or confident reference sites.
    • all: emit all calls, regardless of their confidence.
    • gvcf: emits additional information required for joint calling. This option is required if you want to perform joint calling using the GVCFtyper algorithm.
  • --gq_bands LIST_OF_BANDS: determines the bands that will be used to compress variants of similar genotype quality (GQ) that will be emitted as a single VCF record in the GVCF output file. The LIST_OF_BANDS is a comma-separated list of bands where each band is defined by START-END/STEP. The default is 1-60,60-99/10,99.
  • --min_base_qual QUALITY: determines the filtering quality of the bases used in variant calling. Any base with quality less than QUALITY will be ignored. The default value is 10.
  • --pcr_indel_model MODEL: PCR indel model used to weed out false positive indels more or less aggressively. The possible MODELs are: NONE (used for PCR free samples), and HOSTILE, AGGRESSIVE and CONSERVATIVE, in order of decreasing aggressiveness. The default value is CONSERVATIVE.
  • --phasing [1/0]: flag to enable or disable phasing in the output when using emitmode GVCF. The default value is 1 (on) and this flag has no impact when using an emitmode other than GVCF. Phasing is only calculated for diploid samples.
  • --ploidy PLOIDY: determines the ploidy number of the sample being processed. The default is 2.
  • --prune_factor FACTOR: minimum pruning factor in local assembly; paths with fewer supporting kmers than FACTOR will be pruned from the graph. The default value is 2.
  • --trim_soft_clip: determines whether soft clipped bases in the reads should be excluded from the variant calling.
  • --given GIVEN_VCF: perform variant calling using only the variants provided in the GIVEN_VCF. The calling will only evaluate the locus and alleles provided in the file. This option cannot be used in conjunction with --emit_mode gvcf.
  • --bam_output OUTPUT_BAM: output a BAM file containing modified reads after the local reassembly done by the variant calling. This option should only be used in conjunction with a small bed file for troubleshooting purposes.
  • -- filter_chimeric_reads: determines whether chimeric reads will be used when calling variants. The default is to include chimeric reads only if the var_type BND is set.
  • --snp_heterozygosity HETEROZIGOSITY: determines the expected heterozygosity value used to compute prior likelihoods in SNP calling.
  • --indel_heterozygosity HETEROZIGOSITY: determines the expected heterozygosity value used to compute prior likelihoods in INDEL calling.
  • --model MODEL_FILE: the location of the machine learning model file that will be used with the DNAModelApply tool; the model will be used to determine the settings used in variant calling.

8.1.2.31. DNAModelApply ALGORITHM

The DNAModelApply algorithm performs the second step of variant calling using DNAscope. This algorithm is only supported in the Linux version of the Sentieon Genomics software.

The DNAModelApply algorithm has no input in the driver level, its output is a VCF containing the variants.

The DNAModelApply algorithm requires the following ALGO_OPTION:

  • -v INPUT: location of the VCF file from the DNAscope variant calling algorithm performed with an input model file determining the correct settings. You can use VCF files compressed with bgzip and indexed.
  • -m MODEL_FILE: the location of the machine learning model file; this file should be the same as the one used in the DNAscope command to generate the input VCF.

The DNAModelApply algorithm does not support any optional ALGO_OPTION.

The DNAModelApply algorithm modifies the input VCF file by adding the MLrejected FILTER to the variants; since the FILTER is added, you may want to remove any FILTERs already present in the input VCF, as they may no longer be relevant. You can use bcftools(https://samtools.github.io/bcftools/bcftools.html) for that purpose:

$BCF/bcftools annotate -x "^FILTER/MLrejected,FILTER/PASS" -O z \
  -o $OUTPUT.vcf.gz $INPUT.vcf.gz

8.1.2.32. SVSolver ALGORITHM

The SVSolver algorithm performs the structural variant calling of a sample, provided that the sample has been previously processed using the DNAscope algorithm with the option --var_type bnd.

The SVSolver algorithm has no input in the driver level, its output is a VCF containing the called structural variants.

The SVSolver algorithm requires the following ALGO_OPTION:

  • -v INPUT: location of the VCF file from the DNAscope variant calling algorithm performed on a sample using the option --var_type bnd. You can use VCF files compressed with bgzip and indexed.

The SVSolver algorithm does not support any optional ALGO_OPTION.

8.1.3. DRIVER read_filter options

The --read_filter argument of the DRIVER binary allows to filter or transform reads from the input BAM file before performing the calculation.

The syntax for the argument is: --read_filter FILTER,OPTION=VALUE,OPTION=VALUE,…

8.1.3.1. QualCalFilter read_filter

The QualCalFilter read filter is used to transform reads and perform base quality score recalibration while modifying the information contained in the recalibration table.

The QualCalFilter read filter requires one of the following OPTION:

  • table=TABLE_FILEPATH: the location of the recalibration table that will be used as the basis to perform the base quality score recalibration.
  • use_oq=[true/false]: determines whether to use the original base quality scores contained in the OQ tag in the BAM file. This option cannot be used in conjunction with the table option.

The QualCalFilter read filter accepts the following optional OPTION:

  • prior=PRIOR: determines the global bias for all the base quality scores.
  • min_qual=QUAL: determines the quality threshold to perform recalibration; bases with quality scores less than QUAL will not be recalibrated.
  • levels=LEVEL1/LEVEL2/…: determines the static quantization levels of the base quality scores.
  • indel=[true/false]: determines whether to add the base quality scores for INDELs into the BAM tags.
  • keep_oq=[true/false]: determines whether to keep the original before recalibration base quality scores by using the OQ tag in the bam file.

8.1.3.2. MapQualFilter read_filter

The MapQualFilter read filter is used to filter reads by mapping quality.

The MapQualFilter read filter accepts the following optional OPTION:

  • min_map_qual=QUALITY: filter out reads with mapping quality less than QUALITY.

8.1.3.3. OverclippingFilter read_filter

The OverclippingFilter read filter is used to filter reads depending on their soft clipping characteristics.

The OverclippingFilter read filter accepts the following optional OPTION:

  • min_align_len=LENGTH: filter reads where the number of bases that are not soft clipped is less than LENGTH.
  • count_both_ends=[true/false]: if set to true, only filter reads where both ends of the read are soft clipped, so that reads with soft-clipping on one end only will not be filtered regardless of their non soft-clipped length.

8.2. UTIL binary

UTIL is the binary used to run some utility functions. This binary is mainly used to process the raw reads output from BWA.

8.2.1. UTIL syntax

The general syntax of the UTIL binary is:

sentieon util MODE [OPTIONS]

The supported modes (MODE) for this command are:

  • index: build the index for a BAM file. The following command will generate a bai BAM index file at the same location as the input file:

    sentieon util index INPUT.bam
    
  • vcfindex: build the index for a VCF file. The following command will generate a idx VCF index file at the same location as the input file:

    sentieon util vcfindex INPUT.vcf
    
  • sort: sort a BAM file. The optional arguments (OPTIONS) for the UTIL command using the sort ALGORITHM include:

    • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The default is as many threads as available in the server.
    • -r REFERENCE: location of the reference FASTA file. This argument
      is required if you are using a CRAM output file, otherwise it is optional.
    • -i INPUT: location of the input file.
    • -o OUTPUT: the location and filename of the output file.
    • --temp_dir DIRECTORY: determines where the temporary files will be stored. The default is the folder where the command is run ($PWD).
    • --cram_read_options decode_md=0: CRAM input option to turn off the NM/MD tag in the input CRAM.
    • --cram_write_options compressor=[gzip|bzip2|lzma|rans]: CRAM output compression options. compressor=gzip is default if not defined.
    • --cram_write_options version=[2.1|3.0]: CRAM output version options. version=2.1 is default if not defined.
    • --bam_compression COMPRESSION_LEVEL[0-9]: gzip compression level for the output BAM file. The default value is 6.
    • --skip_no_coor: determines whether to skip unmapped reads.
    • --sam2bam: indicates that the input will be in the form of an uncompressed SAM file, that needs to be converted to BAM. If this option is not used, the input should have been converted to BAM format from the BWA output using samtools.
    • --block_size BLOCK_SIZE: size of the block to be used for sorting.
    sentieon util sort -t NUMBER_THREADS --sam2bam -i INPUT.sam -o OUTPUT.bam
    
  • vcfconvert: compress and decompress VCF and GVCF files.

    The following command will compress and index the input file:

    sentieon util vcfconvert INPUT.vcf OUTPUT.vcf.gz
    

    The following command will decompress a non-indexed vcf file generated with gzip and then compress and index the file. When using this command make sure that the INPUT and OUTPUT files are not the same:

    sentieon util vcfconvert INPUT.gz OUTPUT.vcf.gz
    
  • stream: perform base quality correction in streaming mode. The optional arguments (OPTIONS) for the UTIL command using the sort ALGORITHM include:

    • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The default value is 1.
    • --output_format FORMAT: determines the format of the output stream. The possible FORMAT values are BAM or CRAM. The default is BAM.
    • --output_index_file OUTPUT_INDEX: determines where the corresponding index file will be created.

    The following command will apply the recalibration and output the corresponding recalibrated BAM file to stdout:

    sentieon util stream -i INPUT.bam -q RECAL_TABLE \
      --output_index_file OUTPUT.bam.bai -o -
    

The above command will not generate an index file unless the option output_index_file is included.

8.3. TNHAPFILTER script

TNHAPFILTER is a script used to filter the results of the TNhaplotyper2 algorithm.

8.3.1. TNHAPFILTER syntax

The general syntax of the TNHAPFILTER script is:

sentieon tnhapfilter --tumor_sample TUMOR_SAMPLE_NAME -v TMP_OUT_TN_VCF
  OUT_TN_VCF

The following inputs are required for the command:

  • TUMOR_SAMPLE_NAME: sample name used for tumor sample in Map reads to reference stage.
  • TMP_OUT_TN_VCF: the location and file name of the file containing the unfiltered variants produced by TNhaplotyper2.
  • OUT_TN_VCF: the location and file name of the output file containing the variants.

The arguments (OPTIONS) for this command include:

  • -v INPUT__VCF: location of the VCF file from the DNAscope variant calling algorithm performed with an input model file determining the correct settings. You can use VCF files compressed with bgzip and indexed.
  • --tumor_sample SAMPLE_NAME: name of the SM tag in the BAM file for the tumor sample.
  • --normal_sample SAMPLE_NAME: name of the SM tag in the BAM file for the normal sample, if present.
  • --min_tumor_lod NUMBER: minimum tumorLOD. The default is 5.3.
  • --max_germline_prob NUMBER: maximum germline probability. The default is 0.025.
  • --max_normal_art_lod NUMBER: maximum normal artifact LOD. The default is 0.0.
  • --max_alt_cnt NUMBER: maximum ALT allele count. The default is 1.
  • --max_event_cnt NUMBER: maximum events in region. The default is 2.
  • --min_median_base_qual NUMBER: minimum median base quality. The default is 20.
  • --min_median_mapq NUMBER: minimum median mapping quality. The default is 30.
  • --max_diff_fraglen NUMBER: maximum median difference in fragment length. THE default is 10000.
  • --min_pir_median NUMBER: minimum median read position. The default is 5.
  • --max_strand_prob NUMBER: maximum strand artifact probability. The default is 0.99.
  • --min_strand_af NUMBER: minimum strand artifact allele fraction. The default is 0.01.
  • --contamination NUMBER: contamination fraction to filter; if NUMBER is greater than 0, the tool will try to remove a fraction of the reads supporting each ALT. The default is 0.0.

If you are using Python 2.6.x, you may get the following error when running tnhapfilter: ImportError: No module named argparse. If that is the case, you will need to install the argparse module to your python installation; you can do this by running pip install argparse or whichever other package manager you use.

8.4. PLOT script

PLOT is a script used to create plots of the results of the metrics and recalibration stages. The plots are stored in a PDF file.

8.4.1. PLOT syntax

The general syntax of the PLOT script is:

sentieon plot STAGE -o OUTPUPT_FILE INPUTS [OPTIONS]

The supported modes (STAGE) for this command are:

  • GCBias: generate PDF file from the metrics results of GCBias.
  • QualDistribution: generate PDF file from the metrics results of QualDistribution.
  • InsertSizeMetricAlgo: generate PDF file from the metrics results of InsertSizeMetricAlgo.
  • MeanQualityByCycle: generate PDF file from the metrics results of MeanQualityByCycle.
  • QualCal: generate PDF file from the BQSR QualCal tool.
  • VarCal: generate PDF file from the VQSR VarCal tool.

8.4.1.1. PLOT results of GCBias STAGE

The INPUTS to generate the plots from the GCBias stage are:

  • GC_METRIC_TXT: where GC_METRIC_TXT is the output file of the GCBias algorithm from Section 8.1.2.7.

The plotting of the GCBias metrics STAGE does not accept any OPTIONS.

8.4.1.2. PLOT results of MeanQualityByCycle STAGE

The INPUTS to generate the plots from the MeanQualityByCycle stage are:

  • MQ_METRIC_TXT: where MQ_METRIC_TXT is the output file of the MeanQualityByCycle algorithm from Section 8.1.2.5.

The plotting of the MeanQualityByCycle metrics STAGE does not accept any OPTIONS.

8.4.1.3. PLOT results of QualDistribution STAGE

The INPUTS to generate the plots from the QualDistribution stage are:

  • QD_METRIC_TXT: where QD_METRIC_TXT is the output file of the QualDistribution algorithm from Section 8.1.2.6.

The plotting of the QualDistribution metrics STAGE does not accept any OPTIONS.

8.4.1.4. PLOT results of InsertSizeMetricAlgo STAGE

The INPUTS to generate the plots from the InsertSizeMetricAlgo stage are:

  • IS_METRIC_TXT: where IS_METRIC_TXT is the output file of the InsertSizeMetricAlgo algorithm from Section 8.1.2.10.

The plotting of the InsertSizeMetricAlgo metrics STAGE does not accept any OPTIONS.

8.4.1.5. PLOT results of QualCal STAGE

The INPUTS to generate the plots from the bqsr stage are:

  • RECAL_RESULT.CSV: the output csv file of the QualCal algorithm from Section 8.1.2.4.

The plotting of the bqsr STAGE does not accept any OPTIONS.

8.4.1.6. PLOT results of VarCal STAGE

The INPUTS to generate the plots from the vqsr stage is:

  • PLOT_FILE: a file created by the VarCal algorithm from Section 8.1.2.21 containing the data required to create the report.

The plotting of the vqsr STAGE accept the following OPTIONS:

  • tranches_file=TRANCHES_FILE: location of the file containing the partition of the call sets into quality tranches, generated by the VarCal algorithm from Section 8.1.2.21.
  • target_titv=TITV_THRES: expected TiTv number for the species; it is used calculate the True Positive and False Positive numbers in the plot.
  • min_fp_rate=MIN_RATE: minimum False Positive number; it is used calculate the True Positive and False Positive numbers in the plot.

8.5. LICSRVR binary

LICSRVR is the binary used to run the license server to facilitate dynamic license assignment and record license utilization within a cluster.

8.5.1. LICSRVR syntax

The general syntax of the LICSRVR binary is:

<SENTIEON_FOLDER>/bin/sentieon licsrvr [--start|--stop] [--log LOG_FILE] LICENSE_FILE

The following inputs are optional for the command:

  • LOG_FILE: location and filename of the output file containing the log of the server.
  • LICENSE_FILE: location of the server license file.

After the license server is operational, the client applications can request license tokens from the server by setting the SENTIEON_LICENSE environment variable to the server address in the form of HOST:PORT.

The licsrvr binary supports the following additional modes:

  • --version: will report the software package version the binary belongs to.

  • --dump: will report the current status of the license server, including the number of available licenses.

    <SENTIEON_FOLDER>/bin/sentieon licsrvr --dump LICENSE_FILE
    
  • --dump=update: if the license information has been updated and automatically pulled by the license server, this mode will dump the updated license information to the stdout. The following command will report the updated license information the license server has, if there has been any change:

    <SENTIEON_FOLDER>/bin/sentieon licsrvr --dump=update LICENSE_FILE
    

8.6. LICCLNT binary

LICCLNT is the binary used to test the license server functionality to help determine whether the license server is operational, and how many licenses of the different algorithms are available.

8.6.1. LICCLNT syntax

The LICCLNT binary has two modes, one to ping the license server and one to check the available licenses for specific algorithms.

You can run the following command to check if the license server is operational:

<SENTIEON_FOLDER>/bin/sentieon licclnt ping --server HOST:PORT

The command will return 0 if the server is operational.

You can run the following command to check how many licenses are:

<SENTIEON_FOLDER>/bin/sentieon licclnt query --server HOST:PORT FEATURE

The command will return the number of licenses that are available for the specific license feature, which can be used for managing your jobs: before submitting a job on a certain number of threads, you can check if there are enough licenses for those threads, preventing the tool from being idle while waiting for licenses.

8.7. BWA binary

The BWA binary has two modes of interest, “mem” mode to align FASTQ files against a reference FASTA file, and “shm” mode to load the FASTA index file in memory to be shared among multiple BWA processes running in the same server.

8.7.1. BWA mem syntax

You can run the following command to align a single-ended FASTQ1 file or a pair-ended set of 2 FASTQ files against the FASTA reference, which will produce the mapped reads to stdout, to be piped onto util sort:

<SENTIEON_FOLDER>/bin/sentieon bwa mem OPTIONS FASTA FASTQ1 [FASTQ2]

The arguments (OPTIONS) for this command include:

  • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted the bwa binary will use 1 thread.
  • -p: determines whether the first input FASTQ file contains interleaved pair-ended reads. If this argument is used, only use a single FASTQ input, as the second FASTQ2 file will be ignored.
  • -M: determines whether to make split reads as secondary.
  • -R READGROUP_STRING: Read Group header line that all reads will be attached to. The recommended READGROUP_STRING is @RG\tID:$readgroup\tSM:$sample\tPL:$platform\tPU:$platform_unit
    • $readgroup is a unique ID that identifies the reads.
    • $sample is the name of the sample the reads belong to.
    • $platform is the sequencing technology, typically ILLUMINA.
    • $platform_unit is the sequencing element that performed the sequencing.
  • -K CHUNK_SIZE: determines the size of the group of reads that will be mapped at the same time. If this argument is not set, the results will depend on the number of threads used.

8.7.2. BWA shm syntax

You can run the following command to load the FASTA index file in memory:

<SENTIEON_FOLDER>/bin/sentieon bwa shm FASTA

You can run the following command to list FASTA indices files stored in memory:

<SENTIEON_FOLDER>/bin/sentieon bwa shm -l

You can run the following command to remove all FASTA indices files stored in memory, thus freeing memory when no longer necessary:

<SENTIEON_FOLDER>/bin/sentieon bwa shm -d

The arguments (OPTIONS) for this command include:

  • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted the bwa binary will use 1 thread.
  • -f FILE: location of a temporary file that will be used to reduce peak memory usage.

8.7.3. Controlling memory usage in BWA

By default BWA will use about 24 GB in a Linux system and 8 GB in a Mac system. You can control the memory usage via the bwt_max_mem environment variable, which can be used to limit the memory usage at the expense of speed performance, or enhance the speed performance by using more memory. For example, you will get faster alignment by adding the following to your scripts:

export bwt_max_mem=50G

Bear in mind that the number you use in the bwt_max_mem environmental variable is not a hard limit, but an estimate of the memory used in BWA; as such, it will not guarantee that BWA will use less memory than the provided value.

8.7.4. Using an existing BAM file as input

If you do not have access to the FASTQ inputs, but only have an already aligned and sorted BAM file, you can use it as input and redo the alignment by running samtools:

samtools collate -@ 32 -Ou INPUT_BAM tmp- | samtools fastq -@ 32 -s \
/dev/null -0 /dev/null - | <SENTIEON_FOLDER>/bin/sentieon bwa mem -t 32 -R \
'@RG\tID:id\tLB:lib\tSM:sample\tPL:ILLUMINA' -M -K 1000000 -p $ref /dev/stdin \
| <SENTIEON_FOLDER>/bin/sentieon util sort -t 32 -o OUTPUT_BAM --sam2bam -

Alternatively, you could first create the FASTQ files and then process them as you would normally do:

samtools collate -n -@ 32 -uO INPUT_BAM tmp- | samtools fastq -@ 32 \
-s >(gzip -c > single.fastq.gz) -0 >(gzip -c > unpaired.fastq.gz) \
-1 >(gzip -c > output_1.fastq.gz) -2 >(gzip -c > output_2.fastq.gz) -

If you do this, you may encounter an abnormal memory usage in BWA; if that is the case, you can follow the instructions in BWA uses an abnormal amount of memory when using FASTQ files created from a BAM file .