Troubleshooting

Preparing reference file for use

If your reference FASTA file has not been pre-processed such that the data specified in Table 2.1 is not available to the software, you will need to process it as explained in https://www.broadinstitute.org/gatk/guide/article?id=2798.

You will need to do the following steps:

  1. Generate the BWA index using BWA. This will create the “.fasta.amb”, “.fasta.ann”, “.fasta.bwt”, “.fasta.pac” and “.fasta.sa” files.

    sentieon bwa index reference.fasta
    
  2. Generate the FASTA file index using samtools. This will create the “.fasta.fai” file.

    samtools faidx reference.fasta
    
  3. Generate the sequence dictionary using Picard. This will create the “.dict” file.

java -jar picard.jar CreateSequenceDictionary REFERENCE=reference.fasta \
  OUTPUT=reference.dict

Preparing RefSeq file for use

RefSeq files are used to aggregate the results of the CoverageMetrics algorithm to the gene level.

In order to use RefSeq files downloaded from the ucsc genome browser, they need to be sorted by chromosome and loci. To perform the sorting you will need to do the following steps:

  1. Strip the header from the file

    grep -v “^#” FILE.refSeq > FILE.refSeq.headerless
    grep -e “^#” FILE.refSeq > FILE.refSeq.header
    
  2. Sort the loci first using unix sort.

    sort -k 5 -n FILE.refSeq.headerless > FILE.refSeq.presorted
    
  3. Use GATK sortByRef.pl (available from https://raw.githubusercontent.com/broadgsa/gatk/3.4/public/perl/sortByRef.pl) using the FASTA index fai to sort by chromosome.

    perl sortByRef.pl --k 3 FILE.refSeq.presorted FASTA.fai --tmp ~/tmp \
        > FILE_sorted_headerless.refSeq
    
  4. Put the header back to the file.

    cat FILE.refSeq.header > FILE_sorted.refSeq
    cat FILE_sorted_headerless.refSeq >> FILE_sorted.refSeq
    

Common usage problems

Following is a list of symptoms of common problems as well as solutions for them.

Driver or Util fails with Error: can not open file (xxx) in mode(r), Too many open files

The root cause for this error is that the limit of concurrently open files is not set to be high enough for your system.

You can solve this error by setting the system ulimit -n. In a Linux based system:

  1. Check the limit of maximum number of open files in your system, by running the following command:

    ulimit -n
    
  2. Set a higher limit by editing file /etc/security/limits.conf as root, and add the following 2 lines:

    * soft nofile 16384
    * hard nofile 16384
    
  3. If your system is running Ubuntu, you also need to add this line to you shell profile ~/.bashrc

    ulimit -n 16384
    
  4. You need to log out of your system and log back in for the changes to take effect. After logging in, check that the change was applied correctly by running the following command:

    ulimit -n
    
  5. The command should return 16384.

Driver fails with error: Contig XXX from vcf/bam is not present in the reference, or error Contig XXX has different size in vcf/bam than in the reference

The root cause for this error is that the input VCF or BAM file is incompatible with the reference fasta file. Either there are contigs in the file not present in the reference, or the contigs have different sizes. This is most likely caused by using VCF or BAM files processed with a different reference.

Driver reports warning: Contigs in the vcf file XXX do not match any contigs in the reference

The root cause for this warning is that the input VCF file is incompatible with the reference fasta file, and the contigs in the file are not present in the reference. This is most likely caused by using VCF files from a different reference.

License message: No more license available for Sentieon…

This message is produced when you request to run the Sentieon® software in more threads than your license currently allows you to. This happens because you are concurrently running commands that collectively request more threads than the number of cores your license supports.

The Sentieon® commands will be idle while waiting for free licenses, but the commands will not fail.

Driver fails with error: Readgroup XX is present in multiple BAM files with different attributes

This error is produced when you input two different BAM files containing readgroups with the same ID but different attributes, for instance when in TNseq® and TNscope® the tumor and normal sample BAM files have an RG ID of “1”.

Before you are able to use the BAM files you will need to edit them to make the RG ID unique, for instance by adding the SM name to the RG ID. You can check Section 8.7 for an example of a work-around for this issue.

Alternatively, you can use the samtools addreplacerg functionality to modify the RG ID of the input BAM files and make them unique:

#add the new RG and modify all reads in the BAM file
RGtag=$(samtools view -H $INPUT_BAM|grep ^@RG|sed "s|ID:$ORIG_RGID|ID:$NEW_RGID|g")
samtools addreplacerg -r "$RGtag" -o $TMP_BAM $INPUT_BAM
#reheader the BAM to remove the original RG that is no longer used
samtools view -H $TMP_BAM|grep -v "^@RG.*$ORIGINAL_RGID" \
    |samtools reheader - $TMP_BAM > $OUTPUT_BAM
rm $TMP_BAM

Driver reports warning: none of the QualCal tables is applicable to the input BAM files

This warning means that none of the information in the recalibration table input file can be applied to the input BAM file, which is likley due to using a recalibration table that does not correspond to the BAM file.

This warning could be produced when the input BAM file to QualCal does not have the correct fields in the RG fields. For instance, this could happen if the PL tag of the RG is set to something different than ILLUMINA; in this case, you will need to modify the BAM header to include/modify the missing/incorrect fields, for which you can use the samtools reheader functionality.

BWA uses an abnormal amount of memory when using FASTQ files created from a BAM file

When you use FASTQ files created by converting an already sorted BAM file, it may happen that all the unmapped reads are grouped together at the end of the FASTQ inputs. In that case, BWA may use an abnormal amount of memory at the end of the alignment because poorly mapped or unmappable reads require additional memory.

In order to reduce the abnormal memory usage, you should first re-sort the bam file to make sure that the unmapped reads are not grouped together. You can use samtools to do that:

samtools sort -n -@ 32 input.bam | samtools fastq -@ 32 \
-s >(gzip -c > single.fastq.gz) -0 >(gzip -c > unpaired.fastq.gz) \
-1 >(gzip -c > output_1.fastq.gz) -2 >(gzip -c > output_2.fastq.gz) -

BWA fails with the error: Killed

This error is produced when BWA receives a SIGKILL signal from the operating system. If the system is low on available memory, SIGKILL may have been sent by the kernel's Out Of Memory (OOM) manager. You can check the kernel logs on your system to confirm the SIGKILL signal was sent by the OOM manager.

To resolve this error, you might reduce BWA's memory usage with the bwt_max_mem environmental variable. You can refer to Controlling memory usage in BWA for more information.

KPNS - Known Problems No Solutions

Lack of support for gzipped vcf files not compressed with bgzip

Normal gzipped files do not allow for random or indexed access to the information contained on them, only files compressed with bgzip are indexable. As such, the Sentieon® software does not support gzipped VCF files as input. In order to use these files you will need to uncompress them using gunzip and either use them uncompressed or recompress them with bgzip. Alternativelly, you can use util vcfconvert to recompress and index the files.

sentieon util vcfconvert INPUT.vcf.gz OUTPUT.vcf.gz

Lack of support for gzipped fasta files

Currently the software does not support gzipped FASTA files as input. You need to gunzip the files before using them.

FASTQ files required to have SANGER quality format

If your FASTQ files have been encoded with IlluminaTM sequencing technology before 1.8, the read quality scores will not be in SANGER format, which may produce unexpected results. The Sentieon® Genomics software will not detect that you are using the unsupported format.

Driver fails with error: ImportError: No module named argparse

This error is produced when running tnhapfilter in an environment where the python version is 2.6.x and the argparse module is not present. You will need to install the argparse module to your python installation; you can do this by running pip install argparse or whichever other package manager you use.