7. Typical usage for RNA variant calling¶
One of the typical uses of Sentieon Genomics software is to perform the bioinformatics pipeline for RNA analysis recommended in the Broad institute best practices described in http://gatkforums.broadinstitute.org/wdl/discussion/3892/the-gatk-best-practices-for-variant-calling-on-rnaseq-in-full-detail.
7.1. General¶
In this bioinformatics pipeline you will need the following inputs:
- The FASTA file containing the nucleotide sequence of the reference genome corresponding to the sample you will analyze.
- A BAM file containing the information for the sample to be analyzed. These files contain the reads from the RNA sequencing aligned using STAR.
- (Optional) The Single Nucleotide Polymorphism database (dbSNP) data that you want to include in the pipeline. The data is used in the form of a VCF file; you can use a VCF file compressed with bgzip and indexed.
- (Optional) Multiple collections of known sites that you want to include in the pipeline. The data is used in the form of a VCF file; you can use a VCF file compressed with bgzip and indexed.
The following steps compose the typical bioinformatics pipeline:
- Remove duplicates: This step detects reads indicative that the same RNA molecules were sequenced several times. These duplicates are not informative and should not be counted as additional evidence.
- Split reads at junction: This step splits the RNA reads into exon segments by getting rid of Ns while maintaining grouping information, and hard-clips any sequences overhanging into the intron regions. Additionally, the step will reassign the mapping qualities from STAR to be consistent with what is expected in subsequent steps by converting from quality 255 to 60.
- Indel realignment: This step performs a local realignment around indels. This step is necessary as reads mapped on the edges of indels often get mapped with mismatching bases that are mapping artifacts.
- Base quality score recalibration (BQSR): This step modifies the quality scores assigned to individual read bases of the sequence read data. This action removes experimental biases caused by the sequencing methodology.
- Variant calling: This step identifies the sites where your data displays variation relative to the reference genome, and calculates genotypes for each sample at that site.
7.2. Step by step usage¶
The Remove duplicates stage, Indel realignment stage and Base quality score recalibration (BQSR) stages have identical usage to the DNA seq stages; for these stages, you can refer to Section 3.2 for detailed usage instructions.
7.2.1. Split reads at junction¶
A single command is run to perform the splitting of reads into exon segments and reassigning the mapping qualities from STAR.
sentieon driver -t NUMBER_THREADS -r REFERENCE -i DEDUP_BAM \
--algo RNASplitReadsAtJunction --reassign_mapq 255:60 SPLIT_BAM
The following inputs are required for the command:
- NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system.
- REFERENCE: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
- DEDUP_BAM: the location where the previous deduping stage stored the result.
- SPLIT_BAM: the location and filename of the BAM output file containing the split reads. A corresponding index file (.bai) will be created.
7.2.2. RNA Variant calling¶
A single command is run to call variants and additionally apply the BQSR
calculated before. The RNA variant calling should be done using the
Haplotyper algorithm; additionally you should use the option
--trim_soft_clip
and a lower minimum phred-scaled confidence threshold
than for DNAseq variant calling; you should set call_conf
to 20 and
emit_conf
to 20 instead of the default of 30.
sentieon driver -t NUMBER_THREADS -r REFERENCE -i REALIGNED_BAM \
-q RECAL_DATA.TABLE --algo Haplotyper --trim_soft_clip \
--call_conf 20 --emit_conf 20 [-d dbSNP] VARIANT_VCF
The following inputs are required for the command:
- NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system.
- REFERENCE: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
- REALIGNED_BAM: the location where the previous realignment stage stored the result.
- RECAL_DATA.TABLE: the location where the previous BQSR stage stored the result.
- VARIANT_VCF: the location and filename of the variant calling output file. A corresponding index file (.idx) will be created. The tool will output a compressed file by using .gz extension.
The following inputs are optional for the command:
- dbSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) that will be used to label known variants. You can only use one dbSNP file.