5. Typical usage for TNscope¶
A use of Sentieon Genomics software is to perform the bioinformatics pipeline for Tumor only or Tumor-Normal analysis using the new TNscope algorithms for somatic variant and structural variant detection. Fig. 5.1 illustrates such a typical bioinformatics pipeline.
5.1. General¶
In this bioinformatics pipeline you will need the following inputs:
- The FASTA file containing the nucleotide sequence of the reference genome corresponding to the sample you will analyze.
- Two sets of FASTQ files containing the nucleotide sequence of the sample to be analyzed, one for the tumor sample and one for the matched normal sample. These files contain the raw reads from the DNA sequencing. The software supports inputting FASTQ files compressed using GZIP. The software only supports files containing quality scores in Sanger format (Phred+33).
- (Optional) The Single Nucleotide Polymorphism database (dbSNP) data that you want to include in the pipeline. The data is used in the form of a VCF file.
- (Optional) Multiple collections of known sites that you want to include in the pipeline. The data is used in the form of a VCF file.
The following steps compose the typical bioinformatics pipeline for a tumor-normal matched pair:
- Independently pre-process the Tumor and Normal samples using a DNA
seq pipeline like the one introduced in Section 2, with the following stages:
- Map reads to reference; you need to make sure that the SM sample tag is different between the tumor and the normal samples, as you will need it as an argument in the somatic variant calling. You also need to make sure that the RG:ID for both samples is different and unique.
- Calculate data metrics.
- Remove duplicates.
- Base quality score recalibration (BQSR).
- Somatic variant calling: this step identifies the sites where the cancer genome data displays somatic variations relative to the normal genome, and calculates genotypes at that site.
5.2. Step by step usage¶
For the mapping, duplicate removal, and base quality score recalibration stages, please refer to Section 2.2 for detailed usage instructions.
5.2.1. Somatic variant discovery with or without matched normal sample¶
A single command is run to call variants on the tumor-normal matched pair.
sentieon driver -t NUMBER_THREADS -r REFERENCE \
-i TUMOR_DEDUPED_BAM -q TUMOR_RECAL_DATA.TABLE \
-i NORMAL_DEDUPED_BAM -q NORMAL_RECAL_DATA.TABLE \
--algo TNscope \
--tumor_sample TUMOR_SAMPLE_NAME --normal_sample NORMAL_SAMPLE_NAME \
[--dbsnp DBSNP] OUT_TN_VCF
If you do not have a matched normal sample, you can skip the normal sample BAM file and sample name inputs, as those are not required; however, due to the absence of normal sample, germline variants will be present in the output file:
sentieon driver -t NUMBER_THREADS -r REFERENCE \
-i TUMOR_DEDUPED_BAM -q TUMOR_RECAL_DATA.TABLE \
--algo TNscope --tumor_sample TUMOR_SAMPLE_NAME \
[--dbsnp DBSNP] OUT_TN_VCF
In order to filter out germline variants when missing a matched normal sample, you can replace the matched normal sample with a generic panel of normal file generated using the same methodology as the one for TNhaplotyper2 described in Section 4.2.4.
A single command is run to call variants on the tumor only sample.
sentieon driver -t NUMBER_THREADS -r REFERENCE \
-i TUMOR_DEDUPED_BAM -q TUMOR_RECAL_DATA.TABLE \
--algo TNscope --tumor_sample TUMOR_SAMPLE_NAME \
--pon PANEL_OF_NORMAL_VCF [--dbsnp DBSNP] OUT_TN_VCF
The following inputs are required for the command:
- NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system.
- REFERENCE: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
- TUMOR_DEDUPED_BAM: the location of the pre-processed BAM file after deduplication for the TUMOR sample.
- TUMOR_RECAL_DATA.TABLE: the location where the BQSR stage for the TUMOR sample stored the result.
- NORMAL_DEDUPED_BAM: the location of the pre-processed BAM file after deduplication for the NORMAL sample.
- NORMAL_RECAL_DATA.TABLE: the location where the BQSR stage for the NORMAL sample stored the result.
- TUMOR_SAMPLE_NAME: sample name used for tumor sample in Map reads to reference stage.
- OUT_TN_VCF: the location and file name of the output file containing the variants.
The following inputs are optional for the command:
- NORMAL_SAMPLE_NAME: sample name used for normal sample in Map reads to reference stage.
- DBSNP: the location of the Single Nucleotide Polymorphism database (dbSNP). The variants in the dbSNP will be more likely to be marked as germline as they require more evidence of absence in the normal. You can only use one dbSNP file.
- PANEL_OF_NORMAL_VCF: the location and name of panel of normal VCF file.
5.2.2. Generating a Panel of Normal VCF file¶
In order to generate your own Panel of Normal VCF file to use with TNscope, you will need to run the following command on every normal sample you want to use in the panel:
sentieon driver -t NUMBER_THREADS -r REFERENCE -i NORMAL_RECALIBRATED_BAM \
--algo TNscope --tumor_sample NORMAL_SAMPLE_NAME OUT_NORMAL_VCF
The following inputs are required for the command:
- NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system.
- REFERENCE: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
- NORMAL_RECALIBRATED_BAM: the location of the pre-processed BAM file for the normal sample after Sentieon Readwrite stage.
- NORMAL_SAMPLE_NAME: sample name used for normal sample in Map reads to reference stage.
- OUT_NORMAL_VCF: the location and name of the output VCF file containing the relevant variants for the input normal sample.
After you have generated all VCF files you want to include in the panel, you need to merge them into a single Panel of Normal VCF. You can use bcftools for that purpose:
BCF=/path_to_bcftools
export BCFTOOLS_PLUGINS=$BCF/plugins
DIR=/path_to_normal_vcf_file
$BCF/bcftools merge -m all -f PASS,. --force-samples $DIR/*.vcf.gz |\
$BCF/bcftools plugin fill-AN-AC |\
$BCF/bcftools filter -i 'SUM(AC)>1' > panel_of_normal.vcf
5.2.3. Somatic variant discovery with a matched normal sample using a machine learning model (Beta)¶
Please check the appnote in https://support.sentieon.com/appnotes/tnscope_ml for details about using a machine learning model with TNscope to improve accuracy.