Using a Sentieon® AMI from the AWS Marketplace

Introduction

This documents describes how to use the Sentieon® provided AMI to use the Sentieon® genomics tools inside Amazon AWS by starting instances that are pre-configured with the necessary software to run DNA sequencing analysis pipelines and process genomic data on the cloud. If you have any additional questions, please contact the technical support at Sentieon® Inc. at support@sentieon.com.

Using Sentieon® on AWS via the Sentieon® Marketplace AMI

License information

In order to use the AMI you will need to "bring your own license". In addition, the Sentieon® AMI ships with a fully-featured trial license valid for 14 days after the first time you launch the AMI from your AWS account. To continue using the Sentieon® tools after the trial period is completed, please contact info@sentieon.com and provide Sentieon® your name, email and AWS accountId.

Contents of the AMI

The AMI contains the Sentieon® genomics tools as well as some sample scripts to help you process DNA sequencing data and perform germline and somatic variant calling. The Sentieon® genomics tools are located in the folder indicated by the SENTIEON_INSTALL_DIR environmental variable. The sample scripts are located in the scripts subfolder of the home folder of the user sentieon, and will allow you to easily run pre-configured FASTQ-to-VCF variant calling pipelines.

Launching an instance using the AWS Marketplace

Subscribe to the Sentieon® Marketplace AMI

To get started with the Sentieon® AWS Marketplace AMI, you will first need to subscribe to the Sentieon® AMI; the screenshots contained in this document use version 201808.07 as an example, but you should use the latest version available in the marketplace. To subscribe to the AMI, visit https://aws.amazon.com/marketplace/pp/B07WJ7CZBX in an internet browser and click "Continue to Subscribe". It may take a few minutes for the subscription to become activated.

../_images/ami-fig1.png

The Sentieon® Genomics Marketplace Page

Configure the Sentieon® Marketplace AMI

After completing the subscription process, click "Continue to Configuration". On the configuration page, select the software version that you would like to use and the region to run the Sentieon® AMI and then click "Continue to Launch".

../_images/ami-fig2.png

The AMI configuration page

Launch an instance with the Sentieon® AMI

On the launch page, choose the appropriate options for the instance type, VPC, subnet, and security group. For most applications, it is recommended that the Sentieon® tools are run on c5d.9xlarge instances as the fast SSD storage attached to these instances is recommended for the best performance. The Sentieon® AMI requires outbound TCP communication with aws.sentieon.com for license validation, and this outbound communication must be allowed by both the VPC and security group for the AMI to function properly.

../_images/ami-fig3.png

The AMI launch page

Choose a key pair that will allow you to connect to the instance once it is launched. If necessary, new key paris may be created or added to AWS EC2 in the AWS EC2 console, https://console.aws.amazon.com/ec2/v2/home#KeyPairs. When you are finished with instance configuration, click "Launch" to start the instance.

Monitor the running instance

After launch the following page will appear. Click the link to the EC2 console to monitor your deployed instance.

../_images/ami-fig4.png

The launch page after successfully launching an instance

Logging into the instance

The AMI uses the 'sentieon' username, so after launching the instance you can connect to it by running:

ssh -i <path_to_my_ssh_key> sentieon@<Instance_name_or_IP>

Initial configuration the AMI

You will need to perfom some initial configuration after the first time the AWS instance has been started; these steps are only required once per instance.

  1. Mount the attached SSD:
sudo mkfs /dev/nvme1n1
mkdir ssd
sudo mount /dev/nvme1n1 ssd
sudo chown -R sentieon ssd
  1. Setup your aws credentials:
aws configure
  1. (Not necessary for short term evaluations) Set an environmental variable telling the software what your licenseKey is. The licenseKey will be provided when you contact info@sentieon.com to request a license.
export SENTIEON_LICENSE_KEY=XXXX

Running germline pipelines in the AMI

The DNAseq® germline variant calling pipeline consists of the following stages: alignment with Sentieon® BWA, deduplication, Base Quality Score Recalibration and haplotyper germline variant calling to produce a VCF as well as a GVCF output. To run the germline variant calling pipeline, do the following:

1. Modify the variables FASTQ_1, FASTQ_2, SAMPLE, and GROUP in the file ~/scripts/pipeline-aws-DNAseq.sh to point to the location of the FASTQ input files, as well as determine the SM sample name and RGID read group name.

2. Modify the variable BUNDLE_FILE in the file ~/scripts/pipeline-aws-DNAseq.sh to indicate the reference bundle files you want to use in the pipeline.

3. Modify the variable OUTPUT_BUCKET in the file ~/scripts/pipeline-aws-DNAseq.sh to point to the s3 bucket where you want to store the results.

4. Run the script with command: nohup bash ~/scripts/pipeline-aws-DNAseq.sh &

Running somatic pipelines in the AMI

The TNseq® somatic variant calling pipeline for a tumor-normal pair consists of the following stages: for both the tumor and normal samples alignment with Sentieon® BWA, deduplication, Base Quality Score Recalibrationi; co-realignment of tumor and normal sample together and TNsnv and TNhaplotyper somatic variant calling to produce a VCF output. To run the TNseq® somatic variant calling pipeline, do the following:

1. Modify the variables TUMOR_FASTQ_1, TUMOR_FASTQ_2, NORMAL_FASTQ_1, NORMAL_FASTQ_2, SAMPLE, and GROUP in the file ~/scripts/pipeline-aws-TNseq.sh to point to the location of the FASTQ input files, as well as determine the SM sample name and RGID read group name.

2. Modify the variable BUNDLE_FILE in the file ~/scripts/pipeline-aws-TNseq.sh to indicate the reference bundle files you want to use in the pipeline.

3. Modify the variable OUTPUT_BUCKET in the file ~/scripts/pipeline-aws-TNseq.sh to point to the s3 bucket where you want to store the results.

  1. Run the script with command: nohup bash ~/scripts/pipeline-aws-TNseq.sh &

Running TNscope® somatic pipelines in the AMI

The TNscope® somatic variant calling pipeline for a tumor-normal pair consists of the following stages: for both the tumor and normal samples alignment with Sentieon® BWA, deduplication, Base Quality Score Recalibration; TNscope® somatic variant calling to produce a VCF output. To run the TNscope® somatic variant calling pipeline, do the following:

1. Modify the variables TUMOR_FASTQ_1, TUMOR_FASTQ_2, NORMAL_FASTQ_1, NORMAL_FASTQ_2, SAMPLE, and GROUP in the file ~/scripts/pipeline-aws-TNscope.sh to point to the location of the FASTQ input files, as well as determine the SM sample name and RGID read group name.

2. Modify the variable BUNDLE_FILE in the file ~/scripts/pipeline-aws-TNscope.sh to indicate the reference bundle files you want to use in the pipeline.

3. Modify the variable OUTPUT_BUCKET in the file ~/scripts/pipeline-aws-TNscope.sh to point to the s3 bucket where you want to store the results.

  1. Run the script with command: nohup bash ~/scripts/pipeline-aws-TNscope.sh &

Additional information

Reference bundle files

Sentieon® supplies some pre-packaged tar files containing the reference genome and other commonly used reference files:

  • b37: s3://sentieon-bundle/aws/b37decoy.tar.gz
  • hg19: s3://sentieon-bundle/aws/hg19.tar.gz
  • hg38: s3://sentieon-bundle/aws/hg38decoy.tar.gz

These files can be used with the DNAseq®, TNseq®, or TNscope® pipelines or with a custom pipeline. Alternatively, the DNAseq®, TNseq® and TNscope® pipeline contain the variables, FASTA, BEDFILE, DBSNP, and KNOWN_SITES that can be used in place of a bundle file.

It is also possible to build a custom reference bundle. The reference bundles consist of data files along with a manifest.json file. The manifest.json file contains the following JSON keys that are recognized by the prepackaged pipelines:

  • fasta - The reference genome FASTA file.
  • rmdecoy_bed - A BED file of intervals to use when calling variants.
  • dbsnp - A dbSNP VCF file of know variants to use during variant calling.
  • known_sites - A semicolon separated list of known sites to use during base quality score recalibration.

License control

The Sentieon® software tools are controlled by a license and outbound TCP communication between the instance running the Sentieon® AMI and aws.sentieon.com is necessary for license control. The Sentieon® Marketplace AMI will use AWS EC2 instance identity documents to verify the instance identity, https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-identity-documents.html.

Getting help

Additional information on the Sentieon® software can be found at https://www.sentieon.com/support/. To contact Sentieon® technical support, please send an email to support@sentieon.com.