3. Typical usage for DNAscope¶
The Sentieon® Genomics software includes an improved algorithm to perform the variant calling step of germline DNA analysis. The pipeline used for DNAscope is similar to the DNAseq® one described in Typical usage for DNAseq®. DNAscope can perform structural variant calling in addition to calling SNPs and small indels.
3.1. General¶
In this bioinformatics pipeline you will need the following inputs:
- The FASTA file containing the nucleotide sequence of the reference genome corresponding to the sample you will analyze.
- One or multiple FASTQ files containing the nucleotide sequence of the sample to be analyzed. These files contain the raw reads from the DNA sequencing. The software supports inputting FASTQ files compressed using GZIP. The software only supports files containing quality scores in Sanger format (Phred+33).
- (Optional) The Single Nucleotide Polymorphism database (dbSNP) data that you want to include in the pipeline. The data is used in the form of a VCF file; you can use a VCF file compressed with bgzip and indexed.
- (For SNPs and small indels calling) A machine learning model file.
The following steps compose the typical bioinformatics pipeline for a tumor-normal matched pair:
- Pre-process the sample using a DNAseq® pipeline like the one
introduced in Section 2, with the following stages:
- Map reads to reference.
- Calculate data metrics.
- Remove duplicates.
- (Optional for SV calling) Base quality score recalibration (BQSR).
- Variant calling using DNAscope with a machine learning model: This step identifies the sites where your data displays variation relative to the reference genome, and calculates genotypes for each sample at that site.
3.2. Step by step usage¶
For the mapping and duplicate removal stages, please refer to Section 2.2 for detailed usage instructions.
3.2.1. Germline variant calling with a machine learning model¶
It is recommended to use DNAscope with a machine learning model to perform variant calling with higher accuracy by improving the candidate detection and filtering.
Sentieon® can provide you with a model trained using a subset of the data from the GiAB truth-set found in https://github.com/genome-in-a-bottle. The model was created by processing samples HG001 and HG005 through a pipeline consisting of Sentieon® BWA-mem alignment and Sentieon® deduplication, and using the variant calling results to calibrate a model to fit the truth-set.
In addition, Sentieon® can assist you in the creation of models using your own data, which will calibrate the specifics of your sequencing and bio-informatics processing.
3.2.1.1. Using a machine learning model with DNAscope¶
Two individual commands are run to call variants and to apply the machine learning model. The input BAM file should come from a pipeline where only alignment and deduplication have been performed, to match the model creation methodology.
PCRFREE=true #PCRFREE=true means the sample is PCRFree, change it to false for PCR samples.
if [ "$PCRFREE" = true ] ; then
sentieon driver -t NUMBER_THREADS -r REFERENCE -i DEDUPED_BAM \
--algo DNAscope [ -d dbSNP ] --pcr_indel_model none --model ML_MODEL TMP_VARIANT_VCF
else
sentieon driver -t NUMBER_THREADS -r REFERENCE -i DEDUPED_BAM \
--algo DNAscope [ -d dbSNP ] --model ML_MODEL TMP_VARIANT_VCF
fi
sentieon driver -t NUMBER_THREADS -r REFERENCE --algo DNAModelApply \
--model ML_MODEL -v TMP_VARIANT_VCF VARIANT_VCF
Reminder
It is important to add option --pcr_indel_model NONE
when running DNAscope
if the data you are using is PCR Free.
Depending on whether PCR is involved, DNAscope
uses different priors for finding significant INDEL variants, which could be controlled by the --pcr_indel_model
option. The default --pcr_indel_model
setting is for PCR samples. Thus it is important to set --pcr_indel_model none
for PCR Free samples.
The following inputs are required for the command:
- NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system.
- REFERENCE: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
- DEDUPED_BAM: the location of the input BAM file.
- TMP_VARIANT_VCF: the location and filename of the variant calling output of DNAscope. This is a temporary file.
- VARIANT_VCF: the location and filename of the variant calling output. A corresponding index file will be created. The tool will output a compressed file by using .gz extension.
- ML_MODEL: the location of the machine learning model file. In the DNAscope command the model will be used to determine the settings used in variant calling.
Reminder
When running DNAscope
with a ML model, most of the advanced settings are determined by the model itself, and setting specific values to options other than the --pcr_indel_model
option could negatively affect the results.
The following inputs are optional for the command:
- dbSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) that will be used to label known variants. You can only use one dbSNP file.
3.2.1.2. Limitations of the machine learning model¶
When using DNAscope with a machine learning model, most of the options for DNAscope should not be used, even though the software will not give an error if the options are present. In particular, the following options should be avoided:
--emit_mode GVCF
: The use of DNAscope with a machine learning model is incompatible with the generation of GVCF outputs.--var_type BND
: The use of DNAscope with a machine learning model is incompatible with structural variant calling.
In addition, using an input BAM file created using a pipeline with additional stages such as INDEL realignment or BQSR will likely degrade the performance, as the impact of those stages was fitted into the model.
3.2.2. Structural variant calling¶
In order to perform structural variant calling you need to add the option to output break-end (BND) information to the DNAscope command; this is done by enabling the bnd (break-end) variant type. Two individual commands are run to perform structural variant calling; if you performed BQSR on the input BAM file, it is possible to input the recalibration table to the first command.
sentieon driver -t NUMBER_THREADS -r REFERENCE -i DEDUPED_BAM \
[-q RECAL_DATA.TABLE] --algo DNAscope --var_type bnd \
[-d dbSNP] TMP_VARIANT_VCF
sentieon driver -t NUMBER_THREADS -r REFERENCE --algo SVSolver \
-v TMP_VARIANT_VCF STRUCTURAL_VARIANT_VCF
The following inputs are required for the command:
- NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system.
- REFERENCE: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
- TMP_VARIANT_VCF: the location and filename of the variant calling output file from DNAscope, which includes the BND information. This is a temporary file when calling structural variants.
- STRUCTURAL_VARIANT_VCF: the location and filename of the variant calling output file containing the structural variants. A corresponding index file will be created. The tool will output a compressed file by using .gz extension.