Python API

Introduction

This document describes how to use the python API functionality of the Sentieon® tools to create your own algorithms that will use the underlying modules implemented by Sentieon®. Using this functionality, you will be able to implement custom logic while relying on the Sentieon® high performance tools.

The following concepts will be used in this document:

  • Algorithm: the overall logic for processing the data across the complete genome. The algorithm will execute the engine logic after splitting the genome into fragments.

  • Engine: the logic used to process each of the fragments of the genome.

  • Fragment: the genome is split into fragments, with the algorithm using the engine to process each fragment independently, potentially in parallel. The default fragment size is 1GBases.

  • Step: the fragment is further sub-divided into steps, and the engine processes each step sequentially. The default step size is 1000 Bases

../_images/python_api-fig1.png

Python recipe usage

In order to use a python recipe script to drive the Sentieon® software, you will use the following command:

sentieon driver OPTIONS --algo Python -- RECIPE_PY RecipeAlgoClassName [ARGS]

In the above command, the recipe file is called RECIPE_PY. The python file should contain a class named RecipeAlgoClassName defining the logic of the recipe.

The standard driver options can be used; please refer to the manual for more information about these options:

  • -r REFERENCE

  • -i BAM

  • -t NUMBER_THREADS

  • --temp_dir TMPDIR

The tools can be run either in traversal mode where the genome will be processed in a prescribed pattern with the engines processing fragments in steps, or in non-traversal mode where there will be no engine and the processing can be done arbitrarily. The traversal mode is the default mode; and using the option --passthru will indicate that you want to run the tools in in non-traversal mode.

Python recipe code structure

The python file implementing a certain algorithm needs to include a class for each algorithm you want to run. The algorithm name will identify which algorithm from the python script should be executed. The algorithm class should contain the following methods:

  • __init__(self, ctx, *args): algorithm constructor.

    • ctx: a sentieon.Context object, used to pass all the necessary information between the algorithm, the engine and the Sentieon® tools. Please see the next section and Appendix II for more details on sentieon.Context object.

    • args: a list of command line arguments.

  • open(self): method executed at the beginning of the algorithm, typically used to setup the environment.

  • close(self): method executed at the conclusion of the algorithm, typically used to close open files and do cleanup.

Depending on the mode, additional methods are required:

  • Non-traversal mode:

    • run(self): method executed after open() and before close().

  • Traversal mode:

    • engine(self, ctx): method to create an engine. An engine is created for each thread used. The ctx is a sentieon.Context object from the algorithm.

    • accumulate(self, result): method to accumulate the results of each fragment, as returned by the Engine.stop() function. The accumulation queue will be automatically ordered by the genomic coordinates.

If run in traversal mode, the python file implementing the algorithm needs to include an engine class that will perform the calculation for each fragment. The engine class will only be accessed by the algorithm class. The engine class should contain the following methods:

  • __init__(self, ctx): engine class constructor, where the ctx is a sentieon.Context object from the algorithm.

  • start(self): method executed when starting the processing of the fragment.

  • stop(self): method executed when processing of the fragment has completed. The return value of this method will be sent to the algorithm class accumulate() method as the second argument.

  • step(self, start, end): method executed on each step of a fragment.

Please note that an engine instance will only be instantiated once for each thread, therefore, it will survive over many fragments. To this end, proper variable initialization and cleanup needs to be done in start(self) and end(self) for each fragment.

The sentieon.Context object

The sentieon.Context object used in the python recipe code implements the API mechanism to use the Sentieon® functionality in the recipe. More details about this object can be found in Appendix II below.

The attributes for the sentieon.Context object are:

  • fragment: a list of 3 numbers (chromosome, start, end), identifying the current fragment; this attribute is read-only. This attribute is only available when used from the engine class.

  • contig: a list of 3 numbers (chromosome, 0, length), identifying the current contig; this attribute is read-only. This attribute is only available when used from the engine class.

  • read_groups: a list of all read groups; this attribute is read-only

The methods available for the sentieon.Context object are:

  • fetch_ref(start, end): fetch the reference sequence at [start, end]. The result is a string. When using this method from the algorithm class, the argument list is fetch_ref(chromosome, start, end), since the chromosome contig is not known.

  • fetch_reads(start, end, [filter]): fetch the reads at [start, end] with an optional filter (flags bits). When using this method from the algorithm class, the argument list is fetch_reads(chromosome, start, end,[filter]), since the chromosome contig is not known. The result is a list of Read objects defined in Appendix II below.

  • tempfile(base, suffix_length): return a temporary filename with the given base name and suffix length.

Recipe example

The following python code will output a file containing the number of reads and average depth in each fragment, and the average depth across the processed genome. You can execute the code by running:

sentieon driver -i IN_BAM -r FASTA --algo Python -- recipe.py Algo OUT_TXT
import os
import sentieon
import sys

class Algo(object):
  def __init__(self, ctx, *args):
    #initialize the output filenames
    self.outf = args[0]
    #initialize dictionary containing the average results
    self.results = {'num_reads':0, 'coverage':0, 'depth':0,'length':0}

  def open(self):
    #open the output files for writing
    self.fp = open(self.outf, 'w')
    self.fp_sum = open(os.path.splitext(self.outf)[0] + '.summary' + \
      os.path.splitext(self.outf)[1], 'w')
    #write the header
    self.fp.write("Fragment\tLength\tNumber of reads\tCoverage\tDepth\n")

  def close(self):
    #write the average results in the summary file
    self.fp_sum.write("--------------------------------------------\n")
    self.fp_sum.write("Summary of depth average\n")
    self.fp_sum.write("Number of reads = " + str(self.results['num_reads']) + "\n")
    self.fp_sum.write("Genome length processed = " \
      + str(self.results['length']) + "\n")
    if self.results['length'] != 0:
      self.fp_sum.write("Average coverage = " \
        + str(self.results['coverage'] / self.results['length']) + "\n")
      self.fp_sum.write("Average depth = " \
        + str(self.results['depth'] / self.results['length']) + "\n")
    #close the output file
    self.fp.close()
    self.fp_sum.close()

  def engine(self, ctx):
    #run the engine on each fragment
    return Engine(self, ctx)

  def accumulate(self, results):
    #get the results from the temporary files, one per fragment
    with open(results) as fp:
      for line in fp:
        #append the temporary results to the output
        self.fp.write(line)
        #update calculate the averages
        self.results['num_reads'] += int(line.split("\t")[2])
        self.results['coverage'] += int(line.split("\t")[3])
        self.results['depth'] += int(line.split("\t")[4])
        self.results['length'] += int(line.split("\t")[1])
    #remove temporary file
    os.unlink(results)


class Engine(object):
  def __init__(self, algo, ctx):
    self.ctx = ctx

  def start(self):
    #set temporary filename to be in the form of tmp_xxxx.txt)
    self.tmpf = self.ctx.tempfile('tmp.txt', 4)
    #set the temporary file
    self.fp = open(self.tmpf, 'w')

  def stop(self):
    #close the file and return it to the algo
    self.fp.close()
    return self.tmpf

  def step(self, start, end):
    #get the chromonome
    chr = self.ctx.fragment[0]
    #get the reference and check how many are non N bases
    ref = self.ctx.fetch_ref(start, end)
    length = end-start-ref.count('N')
    #get all reads in the fragment
    reads = self.ctx.fetch_reads(start, end)
    #calculate the depth at each locus of the fragment by iterating through all reads
    depths = [0] * (end-start)
    if len(reads) != 0:
      for read in reads:
        blks = read.cigar_blocks
        for a in blks:  # (op,sloc,eloc,soff,eoff)
          if a[4] <= a[3]:
            continue
          for loc in xrange(a[1], a[2]):
            if loc >= start and loc < end:
              #add 1 to the depth at each locus the read covers
              depths[loc-start] += 1
    #use faster built in method to calculate depth
    coverage = self.ctx.coverage(start,end,reads)
    #print results onto the temporary file
    print >> self.fp, '%s:%d-%d\t%d\t%d\t%d\t%d' % (chr, start, end, \
      length,len(reads),sum(coverage),sum(depths))

Appendix I. Parallel processing with Python API

Due to the poor multithreading performance of python, it is recommended to execute the Python commands in a single thread; however, in order to obtain the best performance of the Sentieon® Python API, it is also recommended to process the genome in shards by using the built-in distributed processing capabilities of the Sentieon® tools. Please refer to the Distributed mode App note from Sentieon® for more details.

In order to process the genome in shards you may need to implement merging of the partial results for the final output in the non-traverse mode. To implement the non-traverse mode, the method run() needs to be present in the algorithm class.

class Algo(object):
  def __init__(self, ctx, *args):
    #initialize the output filename
    if ctx.mode == 'passthru':
      if args[0] == '--merge':
        self.outf = args[1]
        self.merge_file_list = []
        for arg in args[2:]:
          self.merge_file_list.append(arg)
      else:
        print 'passthru mode requires the --merge option'
        exit()
    else:
      self.outf = args[0]
    #initialize dictionary containing the average results
    self.results = {'num_reads':0, 'coverage':0, 'depth':0,'length':0}
  # ...
  def run(self):
    #merge the outputs as provided in the arg list
    for filename in self.merge_file_list:
      with open(filename, 'rb') as file:
        for line in file:
          #print line
          if not line.startswith("Fragment"):
            self.fp.write(line)
            self.results['num_reads'] += int(line.split("\t")[2])
            self.results['coverage'] += int(line.split("\t")[3])
            self.results['depth'] += int(line.split("\t")[4])
            self.results['length'] += int(line.split("\t")[1])

Additionally you will need to use the shell script below to process each of the shards in parallel before the merging.

#!/bin/sh
BAM=$1
FASTA=$2
BIN=$3
TMP=tmp/job$$
mkdir -p $TMP
#set the maximum number of concurrent jobs at the number of available threads
NPROC=$(nproc)
#set the shard size

STEP=$((100*1000*1000))

#get the shard list using the script from the distribution mode documentation
#  SHARD_LIST=$(sh determine_shards.sh $BAM $STEP)
#run the python recipe on each shard in a single thread, with up
# to $(nproc) processes in parallel
sh determine_shards.sh $BAM $STEP | xargs -P $NPROC -I % sh -c -f \
  "$BIN/bin/sentieon driver -t 1 --shard % --temp_dir tmp -r $FASTA \
  -i $BAM --algo Python -- recipe.py Algo $TMP/%.out > $TMP/%.err 2>&1"
#determine the part files to be merged
parts=
for shard in $(sh determine_shards.sh $BAM $STEP); do
  parts="$parts $TMP/$shard.out"
done
#merge the results, you may need to do a boundary aware merge using the python recipe
$BIN/bin/sentieon driver --temp_dir tmp -r $FASTA --passthru --algo Python \
  -- recipe.py Algo --merge output_merge $parts
rm -fr $TMP

Appendix II. More details on Context object

The sentieon.Context object has the following attributes, depending on the class:

attribute

class

meaning

mode

both

The operation mode. The possible values are: traversal, traversal-sharded and non-traversal.

extension

both

A tuple that specifies the left and right extension (in bp) to the read cache.

fragment

engine

A list of 3 numbers (chromosome,start,end), identifying the current fragment; this attribute is read-only.

config

engine

A list of 3 numbers (chromosome,0,length) , identifying the current contig; this attribute is read-only.

read_groups

both

A list of all read groups; this attribute is read-only.

The sentieon.Context object has the following methods, depending on the class:

attribute

class

meaning

advance(start)

engine

Method to advance the trailing edge of the read cache sliding window to s.

fetch_ref(start, end)

engine

Method to fetch the reference sequence at [start,end]. The result is a string.

fetch_ref(chr, start, end)

algo

Method to fetch the reference sequence at [s,e] on chr

fetch_reads(start, end, [filter])

engine

Method to fetch the reads at [start,end) with an optional filter; the filter is a set of SAM flags bits.

fetch_reads(chr, start, end, [filter])

algo

Method to fetch the reads at [start, end) with an optional filter; the filter is a set of SAM flags bits.

coverage(start, end, reads)

both

Method to return the coverage from a list of reads. The result is a list of depth at each locus between [start, end)

tempfile(base, suffix_length)

both

Method to create a temporary file name with the given base name and suffix length

The result of the fetch_reads() method is a list of Read objects, with the following attributes and methods:

attribute method

meaning

name

The read name

group

A dictionary containing the read group.

bases

The base sequence string for the read.

quals

The base quality scores as a byte-array.

flags

The read flags.

cigar

The read cigar string.

cigar_blocks

The list of cigar blocks, each as a tuples (op, sloc, eloc, soff, eoff).

chr

The read contig.

start

The read alignment start.

end

The read alignment end.

mate_chr

The contig of the mate read.

mate_start

The alignment start of the mate read.

insert_size

The insert size.

map_qual

The mapping quality score.

aux([tag])

A method returning a list of (tag,val) tuples, or the tag value if tag is specified.