BWA binary

The BWA binary performs alignment of DNA-seq data.

In Sentieon® version 202112, a bug that resulting in an incorrect MAPQ for a small number of alignments in the original bwa mem is fixed. A compatibility environment variable ksw_compat=1 is provided and setting this environment variable will cause Sentieon® bwa mem behave the same way as described in http://bio-bwa.sourceforge.net/bwa.shtml, providing an incorrect MAPQ for some alignments. The default behavior implements the correct MAPQ calculation for all alignments and provides a speedup on some newer hardware.

The BWA binary has two modes of interest, “mem” mode to align FASTQ files against a reference FASTA file, and “shm” mode to load the FASTA index file in memory to be shared among multiple BWA processes running in the same server.

BWA mem syntax

You can run the following command to align a single-ended FASTQ1 file or a pair-ended set of 2 FASTQ files against the FASTA reference, which will produce the mapped reads to stdout, to be piped onto util sort:

<SENTIEON_FOLDER>/bin/sentieon bwa mem OPTIONS FASTA FASTQ1 [FASTQ2]

The arguments (OPTIONS) for this command include:

  • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted the bwa binary will use 1 thread.

  • -p: determines whether the first input FASTQ file contains interleaved pair-ended reads. If this argument is used, only use a single FASTQ input, as the second FASTQ2 file will be ignored.

  • -M: determines whether to make split reads as secondary.

  • -R READGROUP_STRING: Read Group header line that all reads will be attached to. The recommended READGROUP_STRING is @RG\tID:$readgroup\tSM:$sample\tPL:$platform\tPU:$platform_unit

    • $readgroup is a unique ID that identifies the reads.

    • $sample is the name of the sample the reads belong to.

    • $platform is the sequencing technology, typically ILLUMINA.

    • $platform_unit is the sequencing element that performed the sequencing.

  • -K CHUNK_SIZE: determines the size of the group of reads that will be mapped at the same time. If this argument is not set, the results will depend on the number of threads used.

BWA shm syntax

You can run the following command to load the FASTA index file in memory:

<SENTIEON_FOLDER>/bin/sentieon bwa shm FASTA

You can run the following command to list FASTA indices files stored in memory:

<SENTIEON_FOLDER>/bin/sentieon bwa shm -l

You can run the following command to remove all FASTA indices files stored in memory, thus freeing memory when no longer necessary:

<SENTIEON_FOLDER>/bin/sentieon bwa shm -d

The arguments (OPTIONS) for this command include:

  • -t NUMBER_THREADS: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted the bwa binary will use 1 thread.

  • -f FILE: location of a temporary file that will be used to reduce peak memory usage.

Controlling memory usage in BWA

By default BWA will use about 24 GB in a Linux system and 8 GB in a Mac system. You can control the memory usage via the bwt_max_mem environment variable, which can be used to enhance the speed performance by using more memory, or limit the memory usage at the expense of speed performance. For example, you will get faster alignment by adding the following to your scripts:

export bwt_max_mem=50G

Bear in mind that the number you use in the bwt_max_mem environmental variable is not a hard limit, but an estimate of the memory used in BWA; as such, if BWA memory usage does not go beyond anything lower a certain value, that means it is the minimum required memory for the specific reference, setting bwt_max_mem to a smaller value than the minimum required memory won’t change the BWA mem jobs’s memory usage.

Using an existing BAM file as input

If you do not have access to the FASTQ inputs, but only have an already aligned and sorted BAM file, you can use it as input and redo the alignment by running samtools:

samtools collate -@ 32 -Ou INPUT_BAM tmp- | samtools fastq -@ 32 -s \
/dev/null -0 /dev/null - | <SENTIEON_FOLDER>/bin/sentieon bwa mem -t 32 -R \
'@RG\tID:id\tLB:lib\tSM:sample\tPL:ILLUMINA' -M -K 1000000 -p $ref /dev/stdin \
| <SENTIEON_FOLDER>/bin/sentieon util sort -t 32 -o OUTPUT_BAM --sam2bam -

Alternatively, you could first create the FASTQ files and then process them as you would normally do:

samtools collate -n -@ 32 -uO INPUT_BAM tmp- | samtools fastq -@ 32 \
-s >(gzip -c > single.fastq.gz) -0 >(gzip -c > unpaired.fastq.gz) \
-1 >(gzip -c > output_1.fastq.gz) -2 >(gzip -c > output_2.fastq.gz) -

If you do this, you may encounter an abnormal memory usage in BWA; if that is the case, you can follow the instructions in BWA uses an abnormal amount of memory when using FASTQ files created from a BAM file .