4. Typical usage for DNA scope

The Sentieon Genomics software includes an improved algorithm to perform the variant calling step of germline DNA analysis. The pipeline used for DNAscope is similar to the DNAseq one described in Typical usage for DNA seq. DNAscope can perform structural variant calling in addition to calling SNPs and small indels.

4.1. General

In this bioinformatics pipeline you will need the following inputs:

  • The FASTA file containing the nucleotide sequence of the reference genome corresponding to the sample you will analyze.
  • One or multiple FASTQ files containing the nucleotide sequence of the sample to be analyzed. These files contain the raw reads from the DNA sequencing. The software supports inputting FASTQ files compressed using GZIP. The software only supports files containing quality scores in Sanger format (Phred+33).
  • (Optional) The Single Nucleotide Polymorphism database (dbSNP) data that you want to include in the pipeline. The data is used in the form of a VCF file; you can use a VCF file compressed with bgzip and indexed.
  • (Optional) Multiple collections of known sites that you want to include in the pipeline. The data is used in the form of a VCF file; you can use a VCF file compressed with bgzip and indexed.
  • (For SNPs and small indels calling) A machine learning model file.

The following steps compose the typical bioinformatics pipeline for a tumor-normal matched pair:

  1. Pre-process the sample using a DNA seq pipeline like the one introduced in Section 3, with the following stages:
    1. Map reads to reference.
    2. Calculate data metrics.
    3. Remove duplicates.
    4. (optional) Indel realignment.
    5. Base quality score recalibration (BQSR).
  2. Variant calling using DNAscope: This step identifies the sites where your data displays variation relative to the reference genome, and calculates genotypes for each sample at that site.

4.2. Step by step usage

For the mapping, duplicate removal, indel realignment and base quality score recalibration stages, please refer to Section 3.2 for detailed usage instructions.

4.2.1. Germline variant calling

A single command is run to call variants and additionally apply the BQSR calculated before. The input BAM file to this command depends on whether the Indel Realignment step was performed.

sentieon driver -t NUMBER_THREADS -r REFERENCE -i REALIGNED_BAM/DEDUP_BAM \
  -q RECAL_DATA.TABLE --algo DNAscope [-d dnSNP] VARIANT_VCF

Alternatively, you can use a previously recalibrated BAM:

sentieon driver -t NUMBER_THREADS -r REFERENCE -i RECALIBRATED_BAM \
    --algo DNAscope [-d dnSNP] VARIANT_VCF

In both cases, using the recalibrated BAM or the BAM before recalibration plus the recalibration data table will give identical results; however, you should be careful not to use the recalibration data table together with the already recalibrated BAM, as that would apply the recalibration twice, leading to incorrect results.

The following inputs are required for the command:

  • NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system.
  • REFERENCE: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
  • REALIGNED_BAM: the location where the previous realignment stage stored the result.
  • RECAL_DATA.TABLE: the location where the previous BQSR stage stored the result.
  • RECALIBRATED_BAM: the location of the recalibrated BAM file.
  • VARIANT_VCF: the location and filename of the variant calling output file. A corresponding index file (.idx) will be created. The tool will output a compressed file by using .gz extension.

The following inputs are optional for the command:

  • dbSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) that will be used to label known variants. You can only use one dbSNP file.

4.2.2. Germline variant calling with a machine learning model (Beta)

Please check the appnote in https://support.sentieon.com/appnotes/dnascope_ml for details about using a machine learning model with DNAscope to improve accuracy.

4.2.3. Structural variant calling

In order to perform structural variant calling you need to add the option to output break-end (BND) information to the DNAscope command; this is done by enabling the bnd (break-end) variant type. Two individual commands are run to perform structural variant calling while additionally applying the BQSR calculated before. The input BAM file to the first command depends on whether the Indel Realignment step was performed.

sentieon driver -t NUMBER_THREADS -r REFERENCE -i REALIGNED_BAM/DEDUP_BAM \
  -q RECAL_DATA.TABLE --algo DNAscope --var_type bnd \
  [-d dbSNP] TMP_VARIANT_VCF
sentieon driver -t NUMBER_THREADS -r REFERENCE --algo SVSolver  \
  -v TMP_VARIANT_VCF STRUCTURAL_VARIANT_VCF

The following inputs are required for the command:

  • NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system.
  • REFERENCE: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
  • TMP_VARIANT_VCF: the location and filename of the variant calling output file from DNAscope, which includes the BND information. This is a temporary file when calling structural variants.
  • STRUCTURAL_VARIANT_VCF: the location and filename of the variant calling output file containing the structural variants. A corresponding index file (.idx) will be created. The tool will output a compressed file by using .gz extension.