Using a Sentieon® AMI from the AWS Marketplace¶
Introduction¶
This documents describes how to use the Sentieon® provided AMI to use the Sentieon® genomics tools inside Amazon AWS by starting instances that are pre-configured with the necessary software to run DNA sequencing analysis pipelines and process genomic data on the cloud. If you have any additional questions, please contact the technical support at Sentieon® Inc. at support@sentieon.com.
Using Sentieon® on AWS via the Sentieon® Marketplace AMI¶
License information¶
In order to use the AMI you will need to "bring your own license". In addition, the Sentieon® AMI ships with a fully-featured trial license valid for 14 days after the first time you launch the AMI from your AWS account. To continue using the Sentieon® tools after the trial period is completed, please contact info@sentieon.com and provide Sentieon® your name, email and AWS accountId.
Contents of the AMI¶
The AMI contains the Sentieon® genomics tools as well as some sample scripts to help you process DNA sequencing data and perform germline and somatic variant calling. The Sentieon® genomics tools are located in the folder indicated by the SENTIEON_INSTALL_DIR environmental variable. The sample scripts are located in the scripts subfolder of the home folder of the user sentieon, and will allow you to easily run pre-configured FASTQ-to-VCF variant calling pipelines.
Launching an instance using the AWS Marketplace¶
Subscribe to the Sentieon® Marketplace AMI¶
To get started with the Sentieon® AWS Marketplace AMI, you will first need to subscribe to the Sentieon® AMI; the screenshots contained in this document use version 201808.07 as an example, but you should use the latest version available in the marketplace. To subscribe to the AMI, visit https://aws.amazon.com/marketplace/pp/B07WJ7CZBX in an internet browser and click "Continue to Subscribe". It may take a few minutes for the subscription to become activated.
The Sentieon® Genomics Marketplace Page
Configure the Sentieon® Marketplace AMI¶
After completing the subscription process, click "Continue to Configuration". On the configuration page, select the software version that you would like to use and the region to run the Sentieon® AMI and then click "Continue to Launch".
The AMI configuration page
Launch an instance with the Sentieon® AMI¶
On the launch page, choose the appropriate options for the instance type, VPC, subnet, and security group. For most applications, it is recommended that the Sentieon® tools are run on c5d.9xlarge instances as the fast SSD storage attached to these instances is recommended for the best performance. The Sentieon® AMI requires outbound TCP communication with aws.sentieon.com for license validation, and this outbound communication must be allowed by both the VPC and security group for the AMI to function properly.
The AMI launch page
Choose a key pair that will allow you to connect to the instance once it is launched. If necessary, new key paris may be created or added to AWS EC2 in the AWS EC2 console, https://console.aws.amazon.com/ec2/v2/home#KeyPairs. When you are finished with instance configuration, click "Launch" to start the instance.
Monitor the running instance¶
After launch the following page will appear. Click the link to the EC2 console to monitor your deployed instance.
The launch page after successfully launching an instance
Logging into the instance¶
The AMI uses the 'sentieon' username, so after launching the instance you can connect to it by running:
ssh -i <path_to_my_ssh_key> sentieon@<Instance_name_or_IP>
Initial configuration the AMI¶
You will need to perfom some initial configuration after the first time the AWS instance has been started; these steps are only required once per instance.
- Mount the attached SSD:
sudo mkfs /dev/nvme1n1
mkdir ssd
sudo mount /dev/nvme1n1 ssd
sudo chown -R sentieon ssd
- Setup your aws credentials:
aws configure
- (Not necessary for short term evaluations) Set an environmental variable telling the software what your licenseKey is. The licenseKey will be provided when you contact info@sentieon.com to request a license.
export SENTIEON_LICENSE_KEY=XXXX
Running germline pipelines in the AMI¶
The DNAseq® germline variant calling pipeline consists of the following stages: alignment with Sentieon® BWA, deduplication, Base Quality Score Recalibration and haplotyper germline variant calling to produce a VCF as well as a GVCF output. To run the germline variant calling pipeline, do the following:
1. Modify the variables FASTQ_1, FASTQ_2, SAMPLE, and GROUP in
the file ~/scripts/pipeline-aws-DNAseq.sh to point to the location of the
FASTQ input files, as well as determine the SM sample name and RGID read group
name.
2. Modify the variable BUNDLE_FILE in the file
~/scripts/pipeline-aws-DNAseq.sh to indicate the reference bundle files you want
to use in the pipeline.
3. Modify the variable OUTPUT_BUCKET in the file
~/scripts/pipeline-aws-DNAseq.sh to point to the s3 bucket where you want to
store the results.
4. Run the script with command:
nohup bash ~/scripts/pipeline-aws-DNAseq.sh &
Running somatic pipelines in the AMI¶
The TNseq® somatic variant calling pipeline for a tumor-normal pair consists of the following stages: for both the tumor and normal samples alignment with Sentieon® BWA, deduplication, Base Quality Score Recalibrationi; co-realignment of tumor and normal sample together and TNsnv and TNhaplotyper somatic variant calling to produce a VCF output. To run the TNseq® somatic variant calling pipeline, do the following:
1. Modify the variables TUMOR_FASTQ_1, TUMOR_FASTQ_2, NORMAL_FASTQ_1,
NORMAL_FASTQ_2, SAMPLE, and GROUP in the file
~/scripts/pipeline-aws-TNseq.sh to point to the location of the FASTQ input
files, as well as determine the SM sample name and RGID read group name.
2. Modify the variable BUNDLE_FILE in the file
~/scripts/pipeline-aws-TNseq.sh to indicate the reference bundle files you
want to use in the pipeline.
3. Modify the variable OUTPUT_BUCKET in the file
~/scripts/pipeline-aws-TNseq.sh to point to the s3 bucket where you want to
store the results.
- Run the script with command:
nohup bash ~/scripts/pipeline-aws-TNseq.sh &
Running TNscope® somatic pipelines in the AMI¶
The TNscope® somatic variant calling pipeline for a tumor-normal pair consists of the following stages: for both the tumor and normal samples alignment with Sentieon® BWA, deduplication, Base Quality Score Recalibration; TNscope® somatic variant calling to produce a VCF output. To run the TNscope® somatic variant calling pipeline, do the following:
1. Modify the variables TUMOR_FASTQ_1, TUMOR_FASTQ_2, NORMAL_FASTQ_1,
NORMAL_FASTQ_2, SAMPLE, and GROUP in the file
~/scripts/pipeline-aws-TNscope.sh to point to the location of the FASTQ input
files, as well as determine the SM sample name and RGID read group name.
2. Modify the variable BUNDLE_FILE in the file
~/scripts/pipeline-aws-TNscope.sh to indicate the reference bundle files you
want to use in the pipeline.
3. Modify the variable OUTPUT_BUCKET in the file
~/scripts/pipeline-aws-TNscope.sh to point to the s3 bucket where you want to
store the results.
- Run the script with command:
nohup bash ~/scripts/pipeline-aws-TNscope.sh &
Additional information¶
Reference bundle files¶
Sentieon® supplies some pre-packaged tar files containing the reference genome and other commonly used reference files:
- b37: s3://sentieon-bundle/aws/b37decoy.tar.gz
- hg19: s3://sentieon-bundle/aws/hg19.tar.gz
- hg38: s3://sentieon-bundle/aws/hg38decoy.tar.gz
These files can be used with the DNAseq®, TNseq®, or TNscope® pipelines or with a
custom pipeline. Alternatively, the DNAseq®, TNseq® and TNscope® pipeline contain
the variables, FASTA, BEDFILE, DBSNP, and KNOWN_SITES that can
be used in place of a bundle file.
It is also possible to build a custom reference bundle. The reference bundles
consist of data files along with a manifest.json file. The manifest.json
file contains the following JSON keys that are recognized by the prepackaged
pipelines:
fasta- The reference genome FASTA file.rmdecoy_bed- A BED file of intervals to use when calling variants.dbsnp- A dbSNP VCF file of know variants to use during variant calling.known_sites- A semicolon separated list of known sites to use during base quality score recalibration.
License control¶
The Sentieon® software tools are controlled by a license and outbound TCP communication between the instance running the Sentieon® AMI and aws.sentieon.com is necessary for license control. The Sentieon® Marketplace AMI will use AWS EC2 instance identity documents to verify the instance identity, https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-identity-documents.html.
Getting help¶
Additional information on the Sentieon® software can be found at https://www.sentieon.com/support/. To contact Sentieon® technical support, please send an email to support@sentieon.com.