Description of output files and fields

Introduction

This document describes the output files of Sentieon® TNsnv, TNhaplotyper, TNhaplotyper2 and TNscope® algorithms and the meaning of the fields in those files. You can use the information in this document to better understand the files produced by Sentieon® tumor-normal variant calling software.

TNsnv

Introduction

An example command with TNsnv is as follows

sentieon driver -t NUMBER_THREDS -r REFERENCE.FASTA \
   -i NORMAL_RECALED.BAM -i TUMOR_RECALED.BAM \
   --interval INTERVAL \
   --algo TNsnv --dbsnp DBSNP.VCF \
   --tumor_sample TUMOR_SM --normal_sample NORMAL_SM \
   -call_stats_out CALL_STATS_OUTPUT.TXT
   --stdcov_out STD_COVERAGE.TXT \     Standard coverage output file
   --q20cov_out Q20_COVERAGE.TXT \     Q20 coverage output file
   --power_out POWER.TXT --tumor_depth_out TUMOR_DP.TXT \
   --normal_depth_out NORMAL_DP.TXT OUTPUT.VCF

This command line produces the following required output files:

  • OUTPUT.VCF

In addition, the following optional output files are produced:

  • CALL_STATS_OUTPUT.TXT
  • STD_COVERAGE.TXT
  • Q20_COVERAGE.TXT
  • POWER.TXT
  • TUMOR_DP.TXT
  • NORMAL_DP.TXT

The OUTPUT.VCF of TNsnv contains only limited output information. Users who desired a more detailed output format should examine the CALL_STATS_OUTPUT.TXT file.

OUTPUT.VCF

The OUTPUT.VCF file conforms to the VCF 4.2 specification. More information on the VCF format can be found at https://samtools.github.io/hts-specs/VCFv4.2.pdf. The INFO field annotations are described in detail below.

INFO annotation Description
DB The variant is present in the VCF file supplied with the --dbsnp option
MQ0 Total number of reads with Mapping Quality equal to 0
SOMATIC The variant occurs uniquely in the sample supplied with the --tumor_sample option
VT Variant type, can be SNP, INS or DEL

TNsnv also populates the FILTER field of the output VCF file. Variants are filtered using TNsnv internal quality filters. More information on the applied filters can be found in the failure_reasons row in the table in section 2.3.

FILTER Description
PASS The variant passes TNsnv internal quality filters
REJECT The variant fails TNsnv internal quality filters

Standard genotype fields are defined by the format specification. However, TNsnv also outputs the following non-standard fields.

GENOTYPE field Description
BQ Average base quality of bases supporting the alternate alleles
FA Fraction of reads supporting the alternate allele
SS Status of the variant. Not currently implemented, always set to 2

CALL_STATS_OUTPUT.TXT

The CALL_STATS_OUTPUT.TXT file is a tab-separated text file with the following columns for each candidate variant. The core statistic of the software is t_lod_fstar which is a measurement of the support for the mutation relative to the expected level of sequencing noise at the candidate site.

Column Description
Contig The contig (chromosome) with the candidate
Position The genomic coordinate of the candidate along the contig
Context The sequence 3bp to either side of the candidate
Ref_allele The reference allele at the candidate site
Alt_allele The alternate allele at the candidate site
Tumor_name The name of the tumor sample with the candidate mutation
Normal_name The name of the paired normal sample
Score Variant score. Not currently implemented, always set to 0.0
Dbsnp_site The variant is present in the VCF file supplied with the --dbsnp option (DBSNP) or is novel (NOVEL)
Covered The site has sufficient read coverage to detect a variant with a 0.3 allele fraction at 80% power
Power The product of tumor power and normal power, described below.
Tumor_power The power to detect a mutation at a 0.3 allele fraction at the observed sequencing depth in the tumor sample
Normal_power The power to detect a germline mutation at this site taking into account the presence of the site in dbSNP at the observed sequencing depth in the normal sample
Normal_power_nsp The power to detect a germline mutation in the normal sample given that the mutation is not in dbSNP
Normal_power_wsp The power to detect a germline mutation in the normal sample given that the mutation is in dbSNP
Total_reads Total number of reads in both the tumor and normal samples at this site
Map_Q0_reads Total number of reads in both the tumor and normal samples with mapping quality 0 at this site
Init_t_lod Log odds of the likelihood that the candidate mutation is real over the likelihood that the candidate mutation is a sequencing error before any read-based filters are applied
t_lod_fstar Log odds of the likelihood that the candidate mutation is real over the likelihood that the candidate mutation is a sequencing error
t_lod_fstar_forward t_lod_fstar calculated using only reads on the forward strand
t_lod_fstar_reverse t_lod_fstar calculated using only reads on the reverse strand
tumor_f Estimated allele fraction of the candidate mutation in the tumor sample
Contaminant_fraction Estimate of contamination of normal cells in the tumor sample
Contaminant_lod Log odds of the likelihood that the candidate is contamination over the likelihood that the candidate is a sequencing error
t_q20_count Count of the number of reads in the tumor sample with a base quality of at least 20
t_ref_count Number of reads supporting the reference allele in the tumor sample
t_alt_count Number of reads supporting the alternate allele in the tumor sample
t_ref_sum Sum of the quality scores of the bases supporting the reference allele in the tumor sample
t_alt_sum Sum of the quality scores of the bases supporting the alternate allele in the tumor sample
t_ref_max_mapq The maximum mapping quality of tumor reads supporting the reference allele
t_alt_max_mapq The maximum mapping quality of tumor reads supporting the alternate allele
t_ins_count The number of reads in the tumor sample that have an insertion in the surrounding five bases
t_del_count The number of reads in the tumor sample that have an insertion in the surrounding five bases
Normal_best_gt The most likely genotype of the normal sample
Init_n_lod Log odds of the likelihood that the normal sample is reference over the normal sample having the variant before any read-based filters are applied
normal_f Estimated allele fraction of the candidate mutation in the normal sample
n_q20_count Count of the number of reads in the normal sample with a base quality of at least 20
n_ref_count Number of reads supporting the reference allele in the normal sample
n_alt_count Number of reads supporting the alternate allele in the normal sample
n_ref_sum Sum of the quality scores of the bases supporting the reference allele in the normal sample
n_alt_sum Sum of the quality scores of the bases supporting the alternate allele in the normal sample
power_to_detect_positive_strand_artifact The power to detect strand bias to the positive strand at the given sequencing depth
power_to_detect_negative_strand_artifact The power to detect strand bias to the negative strand at the given sequencing depth
strand_bias_counts A vector of counts for the tumor sample in the order of (tumor_ref_pos, tumor_ref_neg, tumor_alt_pos, tumor_alt_neg) where ref and alt specify the reference and alternate alleles and pos and neg specify the positive and negative strands. The numbers do not match those in earlier columns due to differential filtering
tumor_alt_fpir_median Median position along forward strand reads for bases supporting the alternate allele in the tumor sample
tumor_alt_fpir_mad Mean absolute deviation of the positions along forward strand reads for bases supporting the alternate allele in the tumor sample
tumor_alt_rpir_median Median position along reverse strand reads for bases supporting the alternate allele in the tumor sample
tumor_alt_rpir_mad Mean absolute deviation of the positions along reverse strand reads for bases supporting the alternate allele in the tumor sample
observed_in_normals_count The number of reads supporting the candidate mutation in the normal sample
failure_reasons Reasons for rejecting the candidate somatic mutation. Possibilities include: (1) alt_allele_in_normal - The alternate allele has significant support in the normal sample. (2) clustered_read_position - The alternate allele is not distributed evenly over the length of the read. (3) fstar_tumor_lod - the candidate does not have significant support above noise. (4) germline_risk - there is evidence for the mutation in the normal sample at a dbSNP site (5) nearby_gap_events - Insertion and deletion events were identified at the locus. (6) normal_lod - there is evidence for the mutation in the normal sample. (7) poor_mapping_region_alternate_allele_mapq - Low mapping quality for the alternate allele. (8) poor_mapping_region_mapq0 - Too many reads with a mapping quality of 0 at the locus. (9) possible_contamination - Possible contamination of the normal sample with tumor. (10) strand_artifact - The mutation is likely a strand bias artifact. (11) triallelic_site - The site is not biallelic.
judgement The candidate is a true somatic variant (KEEP) or the candidate is not a likely somatic variant (REJECT).

STD_COVERAGE.TXT

A WIGGLE format file describing whether there is sufficient coverage to detect somatic variants at a 0.3 allele fraction in the tumor with 80% power. 1 indicates that the coverage at the locus passes this threshold, 0 otherwise.

Q20_COVERAGE.TXT

A WIGGLE format file describing whether there is sufficient coverage to detect somatic variants at a 0.3 allele fraction in the tumor with 80% power examining only bases with a quality of greater than 20. 1 indicates that the coverage at the locus passes this threshold, 0 otherwise.

POWER.TXT

A WIGGLE format file describing the power to detect a somatic variant at the observed coverage in the tumor and normal samples.

TUMOR_DP.TXT

A WIGGLE format file describing the observed sequence read depth in the tumor sample.

NORMAL_DP.TXT

A WIGGLE format file describing the observed sequence read depth in the normal sample.

TNhaplotyper

Introduction

An example command with TNhaplotyper is as follows

sentieon driver -t NUMBER_THREADS -r REFERENCE.FASTA \
   -i NORMAL_RECALED.BAM -i TUMOR_RECALED.BAM \
   --interval INTERVAL \
   --algo TNhaplotyper --dbsnp DBSNP.VCF \
   --tumor_sample TUMOR_SM --normal_sample NORMAL_SM \
   OUTPUT.VCF

This command line produces the following required output files:

  • OUTPUT.VCF

OUTPUT.VCF

The OUTPUT.VCF file conforms to the VCF 4.2 specification. More information on the VCF format can be found at https://samtools.github.io/hts-specs/VCFv4.2.pdf. The INFO field annotations are described in detail below.

The core statistics of the software are TLOD, which is a measure of the support for the mutation relative to the expected level of sequencing noise at the candidate site, and NLOD, which is a measure of the odds that the mutation is absent from the normal sample.

INFO annotation Description
DB The variant is present in the VCF file supplied with the --dbsnp option
ECNT Number of candidate variants in the active region, typically the number of candidate variants in the +/- 50 to 300 bp region
HCNT Number of haplotypes observed in the active region after assembly of the sequence reads
MAX_ED Maximum edit distance between the observed haplotypes in the active region
MIN_ED Minimum edit distance between the observed haplotypes in the active region
NLOD Log odds that the variant is not present in the normal sample (confidence that the variant is not a germline variant)
PON Number of times the variant is observed in the panel of normal samples
RPA The number of times the repeat is present for each allele for an indel within a short tandem repeat
RU The sequence of the repeated nucleotides for an indel within a short tandem repeat
STR The variant is an expansion or contraction of a short tandem repeat
TLOD Log odds that the variant is present in the tumor sample relative to the expected noise

TNhaplotyper also populates the FILTER field for the variants.

FILTER Description
PASS The variant is confidently a somatic mutation
alt_allele_in_normal The alternate allele is present in the paired normal sample and is unlikely to be a somatic variant
clustered_events Multiple events are present on the same haplotype as the variant which is indicative of a false-positive call
germline_risk There is evidence that the variant is present in the normal sample given that the variant is present in supplied dbSNP VCF and not present in the supplied COSMIC vcf
homologous_mapping_event More than three events are present at this locus in the tumor which is indicate of a false-positive call
low_t_alt_frac The variant is filtered due to a low alternate allele fraction in the tumor sample
multi_event_alt_allele_in_normal Multiple events are present in the tumor sample and the alternate allele appears in the normal sample
panel_of_normals The mutation is present in at least two samples in the panel of normals.
str_contraction The mutation is a contraction of a short tandem repeat
t_lod_fstar The mutation does not have significant support above noise
triallelic_site The mutation occurs at a triallelic site

Standard genotype fields are defined by the format specification. However, TNhaplotyper also outputs the following non-standard fields.

GENOTYPE Description
AF Fraction of reads supporting the alternate allele
ALT_F1R2 The number of reads in the F1R2 orientation supporting the alternate allele
ALT_F2R1 The number of reads in the F1R2 orientation supporting the alternate allele
FOXOG The fraction of alt reads indicating OxoG error. OxoG error is induced by DNA oxidation during library preparation and is a frequent source of false-positive calls. See PMID: 23303777.
PGT Physical phasing haplotype information describing how the alternate alleles are phased in relation to one another
PID Physical phasing ID information, connecting records within a phasing group by using unique IDs within a given sample, but not across samples
QSS Sum of base quality scores for each allele
REF_F1R2 The number of reads in the F1R2 orientation supporting the reference allele
REF_F2R1 The number of reads in the F2R1 orientation supporting the reference allele

TNhaplotyper2

Introduction

An example command with TNhaplotyper2 is as follows

sentieon driver -t NUMBER_THREADS -r REFERENCE.FASTA \
   -i NORMAL_RECALED.BAM -i TUMOR_RECALED.BAM \
   --algo TNhaplotyper2 --tumor_sample TUMOR_SM \
   --normal_sample NORMAL_SM \
   TMP.VCF \
   --algo OrientationBias --tumor_sample TUMOR_SM \
   ORIENTATION_DATA \
   --algo ContaminationModel --tumor_sample TUMOR_SM \
   --normal_sample NORMAL_SM \
   --vcf GERMLINE_RESOURCE \
   --tumor_segments CONTAMINATION_DATA.segments \
   CONTAMINATION_DATA

sentieon driver -r REFERENCE.FASTA \
   --algo TNfilter --tumor_sample TUMOR_SM \
   --normal_sample NORMAL_SM \
   -v TMP.VCF \
   --contamination CONTAMINATION_DATA \
   --tumor_segments CONTAMINATION_DATA.segments \
   --orientation_priors ORIENTATION_DATA \
   OUTPUT.VCF

This command line produces the following required output files:

  • OUTPUT.VCF

OUTPUT.VCF

The OUTPUT.VCF file conforms to the VCF 4.2 specification. More information on the VCF format can be found at https://samtools.github.io/hts-specs/VCFv4.2.pdf. The INFO field annotations are described in detail below.

The core statistics of the software are TLOD, which is a measure of the support for the mutation relative to the expected level of sequencing noise at the candidate site, and NLOD, which is a measure of the odds that the mutation is absent from the normal sample.

INFO annotation Description
AS_FilterStatus The filter status of each allele, with alleles separated by the pipe character
AS_SB_TABLE Forward and reverse read counts for each allele, with alleles separated by the pipe character
AS_UNIQ_ALT_READ_COUNT The number of reads with unique start and mate-end positions for each alternate allele
CONTQ Phred-scaled probability that the variant alleles are not due to contamination
DP Approximate read depth
ECNT Number of candidate variants in the active region, typically the number of candidate variants in the +/- 50 to 300 bp region
GERMQ The phred-scaled posterior probability that the alternate allele(s) are not germline variants
MBQ Median base quality of each allele
MFRL Median fragment length of each allele
MMQ Median mapping quality of each allele
MPOS Median distance from the end of the read for each alternate allele
NALOD Negative log 10 odds of the variant being an an artifact in the normal sample with the same allele fraction as the tumor sample for each alternate allele
NCount Count of N-bases in the read pileup
NLOD Log odds that the variant is not present in the normal sample (confidence that the variant is not a germline variant) for each alternate allele
OCM Number of reads supporting the alternate allele whose original alignment does not match the current contig
PON The variant is found in the panel of normal samples
POPAF Population allele frequency of the alternate alleles
ROQ Phred-scaled probability that the variant alleles are not due to a read orientation artifact
RPA The number of times the repeat is present for each allele for an indel within a short tandem repeat
RU The sequence of the repeated nucleotides for an indel within a short tandem repeat
SEQQ Phred-scaled probability that the variant alleles are not due to sequencing error
STR The variant is an expansion or contraction of a short tandem repeat
STRANDQ Phred-scaled probability of a strand-bias artifact
STRQ Phred-scaled probability that the alternate alleles are errors due to polymerase slippage
TLOD Log odds that the variant is present in the tumor sample relative to the expected noise

TNfilter also populates the FILTER field for the variants.

FILTER Description
PASS The site contains at least one allele that passes all filters
FAIL All variant alleles are filtered, but for different reasons
base_qual The median base quality of bases supporting the alternate allele is too low
clustered_events Multiple events are present on the same haplotype as the variant which is indicative of a false-positive call
contamination The alternate allele is present due to contamination
duplicate The alternate allele is overrepresented by apparent sequencing duplicates
fragment A large difference is observed in the median fragment length for reads supporting the reference and alternate alleles
germline There is evidence that the variant is germline
haplotype Variant is on the same haplotype as other filtered variants
low_allele_frac The variant allele fraction is below the threshold
map_qual The median mapping quality of reads supporting the alternate allele is too low
multiallelic The mutation occurs at a multialleleic site
n_ratio Too many 'N' bases at the site
normal_artifact The variant is likely an artifact in the normal sample
orientation The variant is likely an artifact due to orientation bias
panel_of_normals The site is present in the panel of normals
position The allele is close to the ends of the reads
slippage The variant is likely an artifact due to polymerase slippage
strand_bias Evidence for the alternate allele comes from only one read direction
strict_strand Evidence for the alternate allele is not significant on both directions
weak_evidence The mutation does not have significant support above noise

Standard genotype fields are defined by the format specification. However, TNhaplotyper2 also outputs the following non-standard fields.

GENOTYPE Description
AF Fraction of reads supporting the alternate allele
AD Allelic depths for the reference and alternate alleles
DP Approximate read depth
F1R2 The number of reads in the F1R2 orientation supporting each allele
F2R1 The number of reads in the F2R1 orientation supporting each allele
PGT Physical phasing haplotype information describing how the alternate alleles are phased in relation to one another
PID Physical phasing ID information, connecting records within a phasing group by using unique IDs within a given sample, but not across samples
PS Phasing set; typically the position of the first variant in the set
SB The forward and reverse read counts for the reference and alternate alleles

TNscope®

Introduction

An example command with TNscope® is as follows

sentieon driver -t NUMBER_THREADS -r REFERENCE.FASTA \
  -i NORMAL_RECALED.BAM -i TUMOR_RECALED.BAM \
  --interval INTERVAL \
  --algo TNscope --tumor_sample TUMOR_SM \
  --normal_sample NORMAL_SM --dbsnp DBSNP.VCF OUTPUT.VCF

This command line produces the following required output files:

  • OUTPUT.VCF

OUTPUT.VCF

The OUTPUT.VCF file conforms to the VCF 4.2 specification. More information on the VCF format can be found at https://samtools.github.io/hts-specs/VCFv4.2.pdf. The INFO field annotations are described in detail below.

The core statistics of the software are TLOD, which is a measure of the support for the mutation relative to the expected level of sequencing noise at the candidate site, and NLOD, which is a measure of the odds that the mutation is absent from the normal sample.

INFO annotation Description
CIEND The confidence interval around the END position for imprecise structural variants
CIPOS Confidence interval around POS for imprecise structural variants
DB The variant is present in the VCF file supplied with the --dbsnp option
DPR Average depth in the region surrounding the variant (+/-1bp)
ECNT Number of candidate variants in the active region, typically the number of candidate variants in the +/- 50 to 300 bp region
END The end position of the structural variant
FS Phred-scale p-value using Fisher's exact test to detect strand bias
HCNT The number of haplotypes observed in the active region after assembly of the sequence reads
IMPRECISE The breakpoints of the structural variant are not precisely known
MATEID Breakend mate
MAX_ED Maximum edit distance between the observed haplotypes in the active region
MIN_ED Minimum edit distance between the observed haplotypes in the active region
NLOD Log odds that the variant is not present in the normal sample (confidence that the variant is not a germline variant)
NLODF Log odds that the variant is not present in the normal sample (confidence that the variant is not a germline variant) given the allele fraction in the tumor sample
PON Number of times the variant is observed in the panel of normal samples
PV The p-value from a Fisher's exact test of the number of reads supporting the reference and alternate alleles in the tumor and normal samples
PV2 The p-value from a Fisher's exact test of the number of reads supporting the reference and alternate alleles in the tumor and normal samples using only high-confidence reads
RPA The number of times the repeat is present for each allele for an indel within a short tandem repeat
RU The sequence of the repeated nucleotides for an indel within a short tandem repeat
SOMATIC The variant occurs uniquely in the sample supplied with the --tumor_sample option
SOR Symmetric Odds Ratio of 2x2 contingency table to detect strand bias
STR The variant is an expansion or contraction of a short tandem repeat
SVLEN The difference in length between REF and ALT alleles of structural variants
SVTYPE The type of structural variant
TLOD Log odds that the variant is present in the tumor sample relative to expected noise
VAF The variant allele frequency. The fraction of reads supporting the alternate allele in the tumor sample.

TNscope® also populates the FILTER field for the variants.

FILTER Description
PASS The variant is confidently a somatic mutation
alt_allele_in_normal The alternate allele is present in the paired normal sample and is unlikely to be a somatic variant
clustered_events Multiple events are present on the same haplotype as the variant which is indicative of a false-positive call
germline_risk There is evidence that the variant is present in the normal sample given that the variant is present in supplied dbSNP VCF and not present in the supplied COSMIC vcf
homologous_mapping_event More than three events are present at this locus in the tumor which is indicate of a false-positive call
low_t_alt_frac The variant is filtered due to a low alternate allele fraction in the tumor sample
multi_event_alt_allele_in_normal Multiple events are present in the tumor sample and the alternate allele appears in the normal sample
panel_of_normals The mutation is observed in at least two samples in the panel of normals
str_contraction The mutation is a contraction of a short tandem repeat
t_lod_fstar The mutation does not have significant support above noise
triallelic_site The mutation occurs at a triallelic site

Standard genotype fields are defined by the format specification. However, TNscope® also outputs the following non-standard fields.

GENOTYPE field Description
AF Fraction of reads supporting the alternate allele
AFDP Read depth used to calculate AF
AFDPLOWMQ Read depth used to calculate AF including reads with low mapping quality
AFLOWMQ Allele fraction of the event in the tumor including low mapq reads
ALT_F1R2 The number of reads in the F1R2 orientation supporting the alternate allele
ALT_F2R1 The number of reads in the F1R2 orientation supporting the alternate allele
ALTHC Depth of reads supporting the highest confidence alternate allele
ALTHCLOWMQ Depth of reads supporting the highest confidence alternate allele including reads with low mapping quality
BaseQRankSumPS Z-score from Wilcoxon rank sum test of Alt vs. Ref base qualities per sample
ClippingRankSumPS Z-score from Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases per sample
DPHC Depth of high-confidence reads supporting the reference or alternate allele
DPHCLOWMQ Depth of high-confidence reads supporting the reference or alternate allele including reads with low mapping quality
FOXOG The fraction of alt reads indicating OxoG error. OxoG error is induced by DNA oxidation during library preparation and is a frequent source of false-positive calls. See PMID: 23303777.
MQRankSumPS Z-score from Wilcoxon rank sum test of Alt vs. Ref read mapping qualities per sample
NBQPS Mean Neighboring Base Quality, including 5bp on both sides per sample
PGT Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another
PID Physical phasing ID information, connecting records within a phasing group by using unique IDs within a given sample, but not across samples
QSS Sum of base quality scores for each allele
ReadPosEndDistPS Z-score from Wilcoxon rank sum test of mean distance from either end of read per sample
ReadPosRankSumPS Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias per sample
REF_F1R2 The number of reads in the F1R2 orientation supporting the reference allele
REF_F2R1 The number of reads in the F2R1 orientation supporting the reference allele