3. Typical usage for DNAscope

The Sentieon® Genomics software includes an improved algorithm to perform the variant calling step of germline DNA analysis. The pipeline used for DNAscope is similar to the DNAseq® one described in Typical usage for DNAseq®, but with differences in both alignment and variant calling. DNAscope accepts model files to improve processing speed and accuracy, and it can perform structural variant calling in addition to calling SNPs and small indels. DNAscope is recommended for datasets sequenced from human or other mammalian samples.

3.1. General

In this bioinformatics pipeline you will need the following inputs:

  • The FASTA file containing the nucleotide sequence of the reference genome corresponding to the sample you will analyze.
  • One or multiple FASTQ files containing the nucleotide sequence of the sample to be analyzed. These files contain the raw reads from the DNA sequencing. The software supports inputting FASTQ files compressed using GZIP. The software only supports files containing quality scores in Sanger format (Phred+33).
  • A sequencing platform specific machine learning model file that can be accessed from https://github.com/Sentieon/sentieon-models.
  • (Optional) A BED file containing variant calling intervals. Recommended for whole-exome or targeted sequencing data.
  • (Optional) The Single Nucleotide Polymorphism database (dbSNP) data that you want to include in the pipeline. The data is used in the form of a VCF file; you can use a VCF file compressed with bgzip and indexed.

The following steps compose the typical bioinformatics pipeline for DNAscope:

  1. Map reads to reference: This step aligns the reads contained in the FASTQ files to map to a reference genome contained in the FASTA file. This step ensures that the data can be placed in context.
  2. Calculate data metrics: This step produces a statistical summary of the data quality and the pipeline data analysis quality.
  3. Remove or mark duplicates: This step detects reads indicative that the same DNA molecules were sequenced several times. These duplicates are not informative and should not be counted as additional evidence.
  4. Variant calling using DNAscope with a machine learning model: This step identifies the sites where your data displays variation relative to the reference genome, and calculates genotypes for each sample at that site.

3.2. Step by step usage

3.2.1. Map reads to reference

A single command is run to efficiently perform the alignment using BWA, and the creation of the BAM file and the sorting using Sentieon® software:

(sentieon bwa mem -R '@RG\tID:GROUP_NAME\tSM:SAMPLE_NAME\tPL:PLATFORM' \
  -t NUMBER_THREADS -x DNASCOPE_MODEL/bwa.model REFERENCE SAMPLE [SAMPLE2] \
  || echo -n 'error' ) \
  | sentieon util sort -r REFERENCE -o SORTED_BAM -t NUMBER_THREADS --sam2bam -i -

Inputs and options for BWA are described in its manual.

Compared to the BWA usage described in Typical usage for DNAseq®, the DNASCOPE_MODEL is added with the argument -x DNASCOPE_MODEL/bwa.model.

The following inputs are required for the command:

  • GROUP_NAME: Readgroup identifier that will be added to the readgroup header line. The RG:ID needs to be unique among all the datasets that you plan on using, which is important when working with multiple input files as described in Section 8.2, or when performing a Tumor-Normal analysis as described in Section 5.
  • SAMPLE_NAME: name of the sample that will be added to the readgroup header line.
  • PLATFORM: name of the sequencing platform used to sequence the DNA. Possible options are ILLUMINA when the fastq files have been produced in an IlluminaTM machine; ELEMENT when the fastq files have been produced in an Element BiosciencesTM machine; DNBSEQ when the fastq files have been produced in a MGITM machine; ULTIMA when the fastq files have been produced in an Ultima GenomicsTM machine.
  • NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system. We recommend that you use the same number of threads for both BWA and for the util binary.
  • REFERENCE: the location of the reference FASTA file. You should make sure that all additional reference data specified in Table 2.1 is available in the same location with consistent naming.
  • SAMPLE: the location of the sample FASTQ file. If the data comes from pair ended sequencing technology, you will also need to input SAMPLE2 as the corresponding pair sample FASTQ file.
  • SORTED_BAM: the location and filename of the sorted mapped BAM output file. A corresponding index file (.bai) will be created.
  • DNASCOPE_MODEL: the location of the DNAscope model bundle. The model will be used to determine the settings used in alignment and variant calling.

BWA will produce slightly different results depending on the number of threads used in the command. This is due to the fact that BWA computes the insert size distribution on a chunk, whose size is dependent on the number of threads. To guarantee that the results are independent of the number of threads used, you should fix the chunk size in bases using option -K 10000000.

For the metrics collection and duplicate removal stages, please refer to Section 2.2 for detailed usage instructions.

3.2.2. Germline variant calling with a machine learning model

It is recommended to use DNAscope with a machine learning model to perform variant calling with higher accuracy by improving the candidate detection and filtering.

Sentieon® can provide you with a sequencing platform specific model trained using a subset of the data from the GiAB truth-set found in https://github.com/genome-in-a-bottle. The models were created by processing reference samples HG001-HG007 through a pipeline consisting of Sentieon® BWA-mem alignment and Sentieon® deduplication, and using the variant calling results to calibrate a model to fit the truth-set.

In addition, Sentieon® can assist you in the creation of models using your own data, which will calibrate the specifics of your sequencing and bio-informatics processing.

3.2.2.1. Using a machine learning model with DNAscope

Two individual commands are run to call variants and to apply the machine learning model. The input BAM file should come from a pipeline where only alignment and deduplication have been performed (no BQSR or indel realignment), to match the model creation methodology.

PCRFREE=true #PCRFREE=true means the sample is PCRFree, change it to false for PCR samples.
if [ "$PCRFREE" = true ] ; then
    sentieon driver -t NUMBER_THREADS -r REFERENCE -i DEDUPED_BAM \
    [--interval INTERVAL_FILE] --algo DNAscope [-d dbSNP] \
    --pcr_indel_model none --model DNASCOPE_MODEL/dnascope.model \
    TMP_VARIANT_VCF
else
    sentieon driver -t NUMBER_THREADS -r REFERENCE -i DEDUPED_BAM \
    [--interval INTERVAL_FILE] --algo DNAscope [-d dbSNP] \
    --model DNASCOPE_MODEL/dnascope.model TMP_VARIANT_VCF
fi
sentieon driver -t NUMBER_THREADS -r REFERENCE --algo DNAModelApply \
  --model DNASCOPE_MODEL/dnascope.model -v TMP_VARIANT_VCF VARIANT_VCF

Reminder

It is important to add option --pcr_indel_model NONE when running DNAscope if the data you are using is PCR Free.

Depending on whether PCR is involved, DNAscope uses different priors for finding significant INDEL variants, which could be controlled by the --pcr_indel_model option. The default --pcr_indel_model setting is for PCR samples. Thus it is important to set --pcr_indel_model none for PCR Free samples.

The following inputs are required for the command:

  • NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system.
  • REFERENCE: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
  • DEDUPED_BAM: the location of the input BAM file.
  • TMP_VARIANT_VCF: the location and filename of the variant calling output of DNAscope. This is a temporary file.
  • VARIANT_VCF: the location and filename of the variant calling output. A corresponding index file will be created. The tool will output a compressed file by using .gz extension.
  • DNASCOPE_MODEL: the location of the machine learning model file. In the DNAscope command the model will be used to determine the settings used in alignment and variant calling.

Reminder

When running DNAscope with a machine learning model, most of the advanced settings are determined by the model itself, and setting specific values to options other than the --pcr_indel_model option could negatively affect the results.

The following inputs are optional for the command:

  • INTERVAL_FILE: the location of the BED file.
  • dbSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) that will be used to label known variants. You can only use one dbSNP file.

3.2.2.2. Using DNAscope to produce GVCF output files

From version 202112.04 onwards, DNAscope accepts a model to produce variant calls in the Genomic VCF (GVCF) format. The GVCF format contains additional information on sites that are homozygous for the reference allele in the sample being processed. Recently trained DNAscope models are required and using a DNAscope model trained with Sentieon version 202112.01 or earlier will result in an error.

Two individual commands are run to call variants and to apply the machine learning model. The input BAM file should come from a pipeline where only alignment and deduplication have been performed, to match the model creation methodology.

PCRFREE=true #PCRFREE=true means the sample is PCRFree, change it to false for PCR samples.
if [ "$PCRFREE" = true ] ; then
    sentieon driver -t NUMBER_THREADS -r REFERENCE -i DEDUPED_BAM \
    [--interval INTERVAL_FILE] --algo DNAscope [-d dbSNP] \
    --pcr_indel_model none --model DNASCOPE_MODEL/dnascope.model \
    --emit_mode gvcf TMP_VARIANT_GVCF
else
    sentieon driver -t NUMBER_THREADS -r REFERENCE -i DEDUPED_BAM \
    [--interval INTERVAL_FILE] --algo DNAscope [-d dbSNP] \
    --model DNASCOPE_MODEL/dnascope.model --emit_mode gvcf TMP_VARIANT_GVCF
fi
sentieon driver -t NUMBER_THREADS -r REFERENCE --algo DNAModelApply \
  --model DNASCOPE_MODEL/dnascope.model -v TMP_VARIANT_GVCF VARIANT_GVCF

Reminder

It is important to add option --pcr_indel_model NONE when running DNAscope if the data you are using is PCR Free.

Depending on whether PCR is involved, DNAscope uses different priors for finding significant INDEL variants, which could be controlled by the --pcr_indel_model option. The default --pcr_indel_model setting is for PCR samples. Thus it is important to set --pcr_indel_model none for PCR Free samples.

The following inputs are required for the command:

  • NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system.
  • REFERENCE: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
  • DEDUPED_BAM: the location of the input BAM file.
  • TMP_VARIANT_GVCF: the location and filename of the GVCF output of DNAscope. This is a temporary file.
  • VARIANT_GVCF: the location and filename of the GVCF output. A corresponding index file will be created. The tool will output a compressed file by using .gz extension.
  • DNASCOPE_MODEL: the location of the machine learning model file. In the DNAscope command the model will be used to determine the settings used in variant calling.

The following inputs are optional for the command:

  • INTERVAL_FILE: the location of the BED file.
  • dbSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) that will be used to label known variants. You can only use one dbSNP file.

3.2.2.3. Genotyping GVCF files generated by DNAscope

The GVCF output file can be genotyped either individually, or jointly with GVCFs from other samples using the GVCFtyper algo in Sentieon version 202112.06 and later and will output a single sample or multi-sample VCF.

sentieon driver -r REFERENCE --algo GVCFtyper \
  -v s1_VARIANT_GVCF -v s2_VARIANT_GVCF -v s3_VARIANT_GVCF VARIANT_VCF

Please check the Sentieon manual for additional details about the GVCFtyper algo, https://support.sentieon.com/manual/usages/general/#gvcftyper-algorithm.

Reminder

GVCFtyper can be used genotype DNAscope GVCFs from multiple sequencing platforms into a single multi-sample VCF.

GVCFtyper does not support joint genotyping of DNAscope GVCFs mixed with GVCFs produced by DNAscope without a machine learning model or DNAscope GVCFs mixed with GVCFs produced by other tools.

3.2.3. Structural variant calling

In order to perform structural variant calling you need to add the option to output break-end (BND) information to the DNAscope command; this is done by enabling the bnd (break-end) variant type. Two individual commands are run to perform structural variant calling.

sentieon driver -t NUMBER_THREADS -r REFERENCE -i DEDUPED_BAM \
  --algo DNAscope --var_type bnd \
  [-d dbSNP] TMP_VARIANT_VCF
sentieon driver -t NUMBER_THREADS -r REFERENCE --algo SVSolver  \
  -v TMP_VARIANT_VCF STRUCTURAL_VARIANT_VCF

The following inputs are required for the command:

  • NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system.
  • REFERENCE: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
  • DEDUPED_BAM: the location of the input BAM file.
  • TMP_VARIANT_VCF: the location and filename of the variant calling output file from DNAscope, which includes the BND information. This is a temporary file when calling structural variants.
  • STRUCTURAL_VARIANT_VCF: the location and filename of the variant calling output file containing the structural variants. A corresponding index file will be created. The tool will output a compressed file by using .gz extension.

The following inputs are optional for the command:

  • dbSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) that will be used to label known variants. You can only use one dbSNP file.

Note that structural variant calling is incompatible with the DNAscope model input file. In particular, the --model option should be avoided when —var_type BND is set.