6. Typical usage for RNA variant calling¶
One of the typical uses of Sentieon® Genomics software is to perform the bioinformatics pipeline for RNA analysis recommended in the Broad institute best practices described in https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-.
6.1. General¶
In this bioinformatics pipeline you will need the following inputs:
- The FASTA file containing the nucleotide sequence of the reference genome corresponding to the sample you will analyze.
- One or multiple FASTQ files containing the nucleotide sequence of the sample to be analyzed. These files contain the raw reads from the DNA sequencing. The software supports inputting FASTQ files compressed using GZIP. The software only supports files containing quality scores in Sanger format (Phred+33).
- (Optional) The Single Nucleotide Polymorphism database (dbSNP) data that you want to include in the pipeline. The data is used in the form of a VCF file; you can use a VCF file compressed with bgzip and indexed.
- (Optional) Multiple collections of known sites that you want to include in the pipeline. The data is used in the form of a VCF file; you can use a VCF file compressed with bgzip and indexed.
The following steps compose the typical bioinformatics pipeline:
- Map reads to reference: This step aligns the reads contained in the FASTQ files to map to a reference genome contained in the FASTA file. This step ensures that the data can be placed in context.
- Remove duplicates: This step detects reads indicative that the same RNA molecules were sequenced several times. These duplicates are not informative and should not be counted as additional evidence.
- Split reads at junction: This step splits the RNA reads into exon segments by getting rid of Ns while maintaining grouping information, and hard-clips any sequences overhanging into the intron regions. Additionally, the step will reassign the mapping qualities from STAR to be consistent with what is expected in subsequent steps by converting from quality 255 to 60.
- (Optional) Base quality score recalibration (BQSR): This step modifies the quality scores assigned to individual read bases of the sequence read data. This action removes experimental biases caused by the sequencing methodology.
- Variant calling: This step identifies the sites where your data displays variation relative to the reference genome, and calculates genotypes for each sample at that site.
6.2. Step by step usage¶
6.2.1. Map reads to reference¶
A single command is run to efficiently perform the alignment using STAR, and the creation of the BAM file and the sorting using Sentieon® software:
sentieon STAR --runThreadN NUMBER_THREADS --genomeDir STAR_REFERENCE \
--readFilesIn SAMPLE SAMPLE2 --readFilesCommand "zcat" \
--outStd BAM_Unsorted --outSAMtype BAM Unsorted --outBAMcompression 0 \
--outSAMattrRGline ID:GROUP_NAME SM:SAMPLE_NAME PL:PLATFORM \
--twopassMode Basic --twopass1readsN -1 --sjdbOverhang READ_LENGTH_MINUS_1 \
| sentieon util sort -r REFERENCE -o SORTED_BAM -t NUMBER_THREADS -i -
Inputs and options for STAR are described in its manual.
The following inputs are required for the command:
- NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system. We recommend that you use the same number of threads for both STAR and for the util binary.
- STAR_REFERENCE: the location of the reference FASTA file STAR genomeDir. You should make sure that all the necessary reference data required by STAR is available in the same location with consistent naming.
- REFERENCE: the location of the reference FASTA file. You should make sure that the FASTA file and corresponding FAI index file match the build and naming convention of the one in the STAR_REFERENCE genomeDir.
- SAMPLE: the location of the sample FASTQ file. If the data comes from pair ended sequencing technology, you will also need to input SAMPLE2 as the corresponding pair sample FASTQ file. You will need to make sure that the program used in option --readFilesCommand matches the compression status of the input FASTQ, i.e.: if the FASTQ files are compressed, you should use zcat, while if the FASTQ files are not compressed, you should use cat.
- GROUP_NAME: Readgroup identifier that will be added to the
readgroup header line. The
RG:ID
needs to be unique among all the datasets that you plan on using. - SAMPLE_NAME: name of the sample that will be added to the readgroup header line.
- PLATFORM: name of the sequencing platform used to sequence the DNA. Possible options are ILLUMINA when the fastq files have been produced in an IlluminaTM machine; IONTORRENT when the fastq files have been produced in a Life TechnologiesTM Ion-TorrentTM machine; ELEMENT when the fastq files have been produced in an Element BiosciencesTM machine; DNBSEQ when the fastq files have been produced in a MGITM machine; ULTIMA when the fastq files have been produced in an Ultima GenomicsTM machine.
- READ_LENGTH_MINUS_1: the read length of your input data, minus 1.
- SORTED_BAM: the location and filename of the sorted mapped BAM output file. A corresponding index file (.bai) will be created.
6.2.2. Remove or mark duplicates¶
The Remove duplicates stage has identical usage to the DNAseq stage; for this stage, you can refer to Section 2.2 for detailed usage instructions.
6.2.3. Split reads at junction¶
A single command is run to perform the splitting of reads into exon segments and reassigning the mapping qualities from STAR.
sentieon driver -t NUMBER_THREADS -r REFERENCE -i DEDUPED_BAM \
--algo RNASplitReadsAtJunction --reassign_mapq 255:60 SPLIT_BAM
The following inputs are required for the command:
- NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system.
- REFERENCE: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
- DEDUPED_BAM: the location where the previous deduping stage stored the result.
- SPLIT_BAM: the location and filename of the BAM output file containing the split reads. A corresponding index file (.bai) will be created.
6.2.4. Base quality score recalibration (BQSR; optional)¶
The Base quality score recalibration (BQSR) stage has identical usage to the DNAseq stage; for this stage, you can refer to Section 2.2 for detailed usage instructions.
6.2.5. RNA Variant calling¶
A single command is run to call variants and additionally apply the BQSR
calculated before. The RNA variant calling can be done using either the
Haplotyper algorithm or the DNAscope algorithm. For the command you should use
the option --trim_soft_clip
and a lower minimum phred-scaled confidence
threshold than for DNAseq® variant calling, which means you should set call_conf
to
20 and emit_conf
to 20 instead of the default of 30.
sentieon driver -t NUMBER_THREADS -r REFERENCE -i SPLIT_BAM \
[-q RECAL_DATA.TABLE] --algo Haplotyper --trim_soft_clip \
--call_conf 20 --emit_conf 20 [-d dbSNP] VARIANT_VCF
If you want to use DNAscope for the calling, the command would be:
sentieon driver -t NUMBER_THREADS -r REFERENCE -i SPLIT_BAM \
[-q RECAL_DATA.TABLE] --algo DNAscope --trim_soft_clip \
--call_conf 20 --emit_conf 20 [-d dbSNP] VARIANT_VCF
The following inputs are required for the command:
- NUMBER_THREADS: the number of computer threads that will be used in the calculation. We recommend that the number does not exceed the number of computing cores available in your system.
- REFERENCE: the location of the reference FASTA file. You should make sure that the reference is the same as the one used in the mapping stage.
- SPLIT_BAM: the location where the previous RNASplitReadsAtJunction stage stored the result.
- VARIANT_VCF: the location and filename of the variant calling output file. A corresponding index file will be created. The tool will output a compressed file by using .gz extension.
The following inputs are optional for the command:
- RECAL_DATA.TABLE: the location where the previous BQSR stage stored the result.
- dbSNP: the location of the Single Nucleotide Polymorphism database (dbSNP) that will be used to label known variants. You can only use one dbSNP file.