Welcome to Sentieon® Software Quick Start Guide. This guide is meant to help you get started quickly. You will get information for setting up the environment and running your first Sentieon® job from BWA to Variant Calling. For more details, please check out our Software Manual.
- Download Quick Start PDF:
What do you need to get started?¶
To get started using Sentieon® software, you will need the following:
- Hardware requirements: A Linux server with the following configuration:
- Linux running one of the following distributions or higher: RedHat/CentOS 6.5, Debian 7.7, OpenSUSE-13.2, or Ubuntu-14.04.
- 16GB of memory for small panel or whole exome or 64GB for whole genome.
- (Recommended) High-speed SSD drives are preferred for ideal I/O performance to get maximum CPU utilization.
- Software requirements:
- Python 2.6.x, Python 2.7.x, or python3.x is required. You can check Python version by typing the following:
python --version
- Sentieon® software release package:
Download the package from the link provided by the technical support at Sentieon.
Decompress the package by running the following command, where VERSION is the version you are using, for example 202308.03:
tar xvzf sentieon-genomics-VERSION.tar.gz
- License requirements: Please see the Appendix for more details on how to set up your license. IT support may be needed.
- Environment requirements:
- If Python 2.6.x, Python 2.7.x, or python3.x is not the default Python version, you can set the following environment variable.
export SENTIEON_PYTHON=Python_location
If you are using a localhost license file, set the following environment variable, where LICENSE_DIR is where the license file is located, and LICENSE_FILE.lic is the license file name.
export SENTIEON_LICENSE=LICENSE_DIR/LICENSE_FILE.licIf user is using a license server, set the following environmental variable, where LICSRVR_HOST and LICSRVR_PORT are the hostname and port of the license server. Please see the next section for more details.
export SENTIEON_LICENSE=LICSRVR_HOST:LICSRVR_PORTFor convenience, set the binary path as shown below, where PATH_TO_ SENTIEON_BINARY_DIRECTORY is where Sentieon® binary is installed.
export SENTIEON_INSTALL_DIR=PATH_TO_SENTIEON_BINARY_DIRECTORYFor improved performance when using NFS storage, set the SENTIEON_TMPDIR environmental variable to point to local scratch fast storage.
export SENTIEON_TMPDIR=/tmp
Start your first Sentieon® DNAscope job¶
Sentieon® Inc. provides a quick start package that includes a sample script and data to help you quickly test the installation and to diagnose potential problems.
The quick start package includes data for a single chromosome, both sequence data of a sample and reference materials. The job script uses the Sentieon DNAscope pipeline for a set of pair-ended Illumina fastq files:
- BWA: Map reads to the reference.
- Metrics and LocusCollector: Collect reads’ statistics.
- Dedup: Remove duplicate reads.
- Variant calling: DNAscope variant calling.
Note
DNAscope is only recommended for use with samples from diploid organisms. For other samples, please use DNAseq.
Run the quick start package¶
To get started, copy the downloaded quick start package to a new directory, and unpack it by running the following:
tar xzvf sentieon_quickstart.tar.gz
Here is what is included in the package:
sentieon_quickstart.sh
: the sample shell script that drives the entire pipeline.reference
: a directory that contains human genome reference files and database files of known SNP sites.models
: a directory that contains DNAscope model files.- FASTQ files: sample sequence files.
Before running the script, you need to make sure that the environment variables are properly set as described above, including the license and path to the directory.
Then open your favorite editor to edit the user settings in
sentieon_quickstart.sh
.
# Update with the location of the Sentieon software package SENTIEON_INSTALL_DIR=/home/release/sentieon-genomics-202308.03 # Update with the location of temporary fast storage and uncomment #SENTIEON_TMPDIR=/tmp # It is important to assign meaningful names in actual cases. # It is particularly important to assign different read group names. sample="sample_name" group="read_group_name" platform="ILLUMINA" # Other settings nt=16 #number of threads to use in computation # Is the data prepared with a PCR free library prep PCRFREE=true
Note
- In the user setting shell script
sentieon_quickstart.sh
: - It is important to assign meaningful names in actual cases.
- It is particularly important to assign different read group names.
To get the number of the CPU cores, user can run nproc as shown below.
nproc
To better understand the rest of the sentieon_quickstart.sh
script,
please read the comment in each section, and the corresponding chapters
in the manual.
Now, launch the script by simply running sentieon_quickstart.sh
, and
watch the result unfold. The entire run takes about 3 - 5 minutes on a
typical Linux server. Actual time varies depending on the computation
environment.
sh sentieon_quickstart.sh &
Understanding the results¶
Below is a list of the files, their meaning and references. For more details, please refer to documentation.
- Quick start test output files
File name | Description |
---|---|
sorted.bam | Coordinate-sorted BAM file after alignment with Sentieon® BWA mem. |
score.txt | Duplicate read data file. |
aln_metrics.txt | Alignment and general statistics of the two pair sequence reads. |
gc_summary.txt | GC bias statistics summary. |
gc_metrics.txt, qc-report.pdf | GC bias statistics data file and report PDF. |
qd_metrics.txt, qd-report.pdf | Base quality score distribution data file and report PDF. |
mq_metrics.txt, mq-report.pdf | Cycle-dependence of the mean quality score data file and report PDF. |
is_metrics.txt, is-report.pdf | Insert size distribution data file and report PDF. |
deduped.bam | Output BAM file of Dedup stage, with duplicated reads removed. |
dnascope.vcf.gz | Output VCF file of DNAscope variant calling. |
Description of the Sentieon® tools¶
The table below shows the different Sentieon® products and tools and their purpose. It is also noted if a tool implements functionality equivalent to an existing GATK pipeline tool.
Sentieon® product | Sentieon® tool | Typical use | Equivalent GATK pipeline tool |
---|---|---|---|
Sentieon® BWA | Sentieon® BWA | Read alignment and mapping | BWA |
DNAscope | DNAscope | Improved germline SNV/Indel/SV calling | |
DNAseq® | Genotyper | Germline SNV/Indel calling, non haplotype based | UnifiedGenotyper |
DNAseq® | Haplotyper | Germline SNV/Indel calling | HaplotypeCaller |
DNAseq® | GVCFtyper | Joint calling of cohorts, demonstrated up to 200,000 samples | GenotypeGVCFs |
DNAseq® | VarCal | Calculate Variant Quality Score Recalibration | VariantRecalibrator |
DNAseq® | ApplyVarCal | Apply Variant Quality Score Recalibration | ApplyRecalibration |
RNAseq | RNASplitReadsAtJunction | RNA SNV/Indel calling | SplitNCigarReads |
RNAseq | Haplotyper | RNA SNV/Indel calling | HaplotypeCaller |
TNseq® | TNsnv | Somatic SNV calling, non haplotype based | MuTect |
TNseq® | TNhaplotyper | Somatic SNV/Indel calling | MuTect2 |
TNseq® | TNhaplotyper2 + TNfilter | Somatic SNV/Indel calling | Mutect2 and FilterMutectCalls from GATK4 |
TNscope® | TNscope® | Improved somatic SNV/Indel/SV calling | |
General tools | Dedup and LocusCollector | Perform deduplication | Picard MarkDuplicates |
General tools | Realigner | Perform Indel relaignment for non-haplotype based callers | RealignerTargetCreator and IndelRealigner |
General tools | QualCal | Perform Base Quality Score Recalibration | BaseRecalibrator, AnalyzeCovariates |
General tools | ReadWriter | Create BAM files | PrintReads |
General tools | AlignmentStat | QC metrics | Picard CollectAlignmentSummaryMetrics |
General tools | BaseDistributionByCycle | QC metrics | Picard CollectBaseDistributionByCycle |
General tools | CollectVCMetrics | QC metrics | Picard CollectVariantCallingMetrics |
General tools | ContaminationAssessment | QC metrics | ContEst |
General tools | CoverageMetrics | QC metrics | DepthOfCoverage |
General tools | GCBias | QC metrics | Picard CollectGcBiasMetrics |
General tools | HsMetricAlgo | QC metrics | Picard CollectHsMetrics |
General tools | InsertSizeMetricAlgo | QC metrics | Picard CollectInsertSizeMetrics |
General tools | MeanQualityByCycle | QC metrics | Picard MeanQualityByCycle |
General tools | QualDistribution | QC metrics | Picard QualityScoreDistribution |
General tools | QualityYield | QC metrics | Picard CollectQualityYieldMetrics |
General tools | SequenceArtifactMetricsAlgo | QC metrics | Picard CollectSequencingArtifactMetrics, ConvertSequencingArtifactToOxoG |
General tools | WgsMetricsAlgo | QC metrics | Picard CollectWgsMetrics |
Appendix - Set up license¶
Sentieon® software is a license-controlled software. The user is required to properly set up the license in order to run the software.
We provide two types of the licenses:
- Single machine evaluation license: this license is used for evaluating the Sentieon® software in a single machine. It allows new users to get quickly started on using the software without requiring help from the IT department. In order to use this license, the computer where you plan on running the Sentieon® software requires external Internet access.
- Cluster license: this license is used in a cluster environment. With this license, a floating license server lightweight process is running on one node in the cluster, serving licenses though TCP to all other nodes that have network connection to the license server. This license server is running in a special non-computing node on the cluster periphery that has unrestricted access to the outside world through HTTPS, and serves the licenses to the rest of the nodes in the cluster by listening to a specific TCP port that needs to be open within the cluster.
Setting up a single machine evaluation license¶
To use the single machine evaluation license, the computing node needs have access to the Internet. This allows Sentieon® software to validate the license.
To use a single machine evaluation license, follow the steps below:
- Copy the license file to the computing node. For example, the license file LICENSE_FILE.lic is now located at LICENSE_DIR .
- Set up environment variable as below:
export SENTIEON_LICENSE=LICENSE_DIR/LICENSE_FILE.lic
Setting up license server¶
As shown in Fig. 1, license server requires the following:
- The license server should have access to the Internet to perform license validation.
- The computing nodes should have access to the license server via a host name LICSRVR_HOST
- The machine the license server is running has an open port for the license services to listen on, and the computing nodes have access to that port. Here we assume the available port is LICSRVR_PORT
You may need IT support to get LICSRVR_HOST:LICSRVR_PORT, and confirm that the above requirements are met.
Note
If the license server is behind a firewall, separated from the computing nodes through a NAT, the license server’s hostname/IP visible to the nodes may be different from its actual hostname/IP. If this is the case, you will need to bind the license server on the actual IP address, while the compute node requests license from the IP address after NAT. Please contact Sentieon support for more details.
Follow these steps to obtain license file, set up and test the license server:
- Send the following information to Sentieon® to receive the license file:
- FQDN (fully qualified domain name) LICSRVR_HOST of the designated machine to run license service.
- The designated port LICSRVR_PORT to Sentieon® to receive the license file.
Copy the received license file to the license server LICSRVR_HOST. We assume the license file is located in LICENSE_PATH/LICENSE_FILE. Run the following command on the license server to start the license server process:
<SENTIEON_INSTALL_DIR>/bin/sentieon licsrvr --start --log LOG_FILE LICENSE_PATH/LICENSE_FILE
Alternatively, you can follow the instructions in section 8.8 - Running the license server (LICSRVR) as a system service in the Sentieon® Genomics Manual, to configure and start the license server as a system daemon.
Go to the Sentieon® installation directory. Run the following commands on the license server to confirm the license server is up and running.
<SENTIEON_INSTALL_DIR>/bin/sentieon licclnt ping -s LICSRVR_HOST:LICSRVR_PORT
If the command returns without an error message, the license server is up and running.
Login to one of the computing node, go to the Sentieon® installation directory, and run the above command again:
<SENTIEON_INSTALL_DIR>/bin/sentieon licclnt ping -s LICSRVR_HOST:LICSRVR_PORT
If the command returns without an error message, the computing node now can access the license server, too.
Set up the following environment variable and you are good to go.
export SENTIEON_LICENSE=LICSRVR_HOST:LICSRVR_PORT