Python API¶
Introduction¶
This document describes how to use the python API functionality of the Sentieon® tools to create your own algorithms that will use the underlying modules implemented by Sentieon®. Using this functionality, you will be able to implement custom logic while relying on the Sentieon® high performance tools.
The following concepts will be used in this document:
Algorithm: the overall logic for processing the data across the complete genome. The algorithm will execute the engine logic after splitting the genome into fragments.
Engine: the logic used to process each of the fragments of the genome.
Fragment: the genome is split into fragments, with the algorithm using the engine to process each fragment independently, potentially in parallel. The default fragment size is 1GBases.
Step: the fragment is further sub-divided into steps, and the engine processes each step sequentially. The default step size is 1000 Bases
![../_images/python_api-fig1.png](../_images/python_api-fig1.png)
Python recipe usage¶
In order to use a python recipe script to drive the Sentieon® software, you will use the following command:
sentieon driver OPTIONS --algo Python -- RECIPE_PY RecipeAlgoClassName [ARGS]
In the above command, the recipe file is called RECIPE_PY. The python file should contain a class named RecipeAlgoClassName defining the logic of the recipe.
The standard driver options can be used; please refer to the manual for more information about these options:
-r REFERENCE
-i BAM
-t NUMBER_THREADS
--temp_dir TMPDIR
…
The tools can be run either in traversal mode where the genome will be
processed in a prescribed pattern with the engines processing fragments
in steps, or in non-traversal mode where there will be no engine and the
processing can be done arbitrarily. The traversal mode is the default
mode; and using the option --passthru
will indicate that you want to run
the tools in in non-traversal mode.
Python recipe code structure¶
The python file implementing a certain algorithm needs to include a class for each algorithm you want to run. The algorithm name will identify which algorithm from the python script should be executed. The algorithm class should contain the following methods:
__init__(self, ctx, *args)
: algorithm constructor.ctx
: a sentieon.Context object, used to pass all the necessary information between the algorithm, the engine and the Sentieon® tools. Please see the next section and Appendix II for more details on sentieon.Context object.args
: a list of command line arguments.
open(self)
: method executed at the beginning of the algorithm, typically used to setup the environment.close(self)
: method executed at the conclusion of the algorithm, typically used to close open files and do cleanup.
Depending on the mode, additional methods are required:
Non-traversal mode:
run(self)
: method executed afteropen()
and beforeclose()
.
Traversal mode:
engine(self, ctx)
: method to create an engine. An engine is created for each thread used. The ctx is a sentieon.Context object from the algorithm.accumulate(self, result)
: method to accumulate the results of each fragment, as returned by theEngine.stop()
function. The accumulation queue will be automatically ordered by the genomic coordinates.
If run in traversal mode, the python file implementing the algorithm needs to include an engine class that will perform the calculation for each fragment. The engine class will only be accessed by the algorithm class. The engine class should contain the following methods:
__init__(self, ctx)
: engine class constructor, where the ctx is a sentieon.Context object from the algorithm.start(self)
: method executed when starting the processing of the fragment.stop(self)
: method executed when processing of the fragment has completed. The return value of this method will be sent to the algorithm class accumulate() method as the second argument.step(self, start, end)
: method executed on each step of a fragment.
Please note that an engine instance will only be instantiated once for
each thread, therefore, it will survive over many fragments. To this
end, proper variable initialization and cleanup needs to be done in
start(self)
and end(self)
for each fragment.
The sentieon.Context object¶
The sentieon.Context
object used in the python recipe code implements
the API mechanism to use the Sentieon® functionality in the recipe. More
details about this object can be found in Appendix II below.
The attributes for the sentieon.Context
object are:
fragment
: a list of 3 numbers (chromosome, start, end), identifying the current fragment; this attribute is read-only. This attribute is only available when used from the engine class.contig
: a list of 3 numbers (chromosome, 0, length), identifying the current contig; this attribute is read-only. This attribute is only available when used from the engine class.read_groups
: a list of all read groups; this attribute is read-only
The methods available for the sentieon.Context
object are:
fetch_ref(start, end)
: fetch the reference sequence at [start, end]. The result is a string. When using this method from the algorithm class, the argument list isfetch_ref(chromosome, start, end)
, since the chromosome contig is not known.fetch_reads(start, end, [filter])
: fetch the reads at [start, end] with an optional filter (flags bits). When using this method from the algorithm class, the argument list isfetch_reads(chromosome, start, end,[filter])
, since the chromosome contig is not known. The result is a list of Read objects defined in Appendix II below.tempfile(base, suffix_length)
: return a temporary filename with the given base name and suffix length.
Recipe example¶
The following python code will output a file containing the number of reads and average depth in each fragment, and the average depth across the processed genome. You can execute the code by running:
sentieon driver -i IN_BAM -r FASTA --algo Python -- recipe.py Algo OUT_TXT
import os
import sentieon
import sys
class Algo(object):
def __init__(self, ctx, *args):
#initialize the output filenames
self.outf = args[0]
#initialize dictionary containing the average results
self.results = {'num_reads':0, 'coverage':0, 'depth':0,'length':0}
def open(self):
#open the output files for writing
self.fp = open(self.outf, 'w')
self.fp_sum = open(os.path.splitext(self.outf)[0] + '.summary' + \
os.path.splitext(self.outf)[1], 'w')
#write the header
self.fp.write("Fragment\tLength\tNumber of reads\tCoverage\tDepth\n")
def close(self):
#write the average results in the summary file
self.fp_sum.write("--------------------------------------------\n")
self.fp_sum.write("Summary of depth average\n")
self.fp_sum.write("Number of reads = " + str(self.results['num_reads']) + "\n")
self.fp_sum.write("Genome length processed = " \
+ str(self.results['length']) + "\n")
if self.results['length'] != 0:
self.fp_sum.write("Average coverage = " \
+ str(self.results['coverage'] / self.results['length']) + "\n")
self.fp_sum.write("Average depth = " \
+ str(self.results['depth'] / self.results['length']) + "\n")
#close the output file
self.fp.close()
self.fp_sum.close()
def engine(self, ctx):
#run the engine on each fragment
return Engine(self, ctx)
def accumulate(self, results):
#get the results from the temporary files, one per fragment
with open(results) as fp:
for line in fp:
#append the temporary results to the output
self.fp.write(line)
#update calculate the averages
self.results['num_reads'] += int(line.split("\t")[2])
self.results['coverage'] += int(line.split("\t")[3])
self.results['depth'] += int(line.split("\t")[4])
self.results['length'] += int(line.split("\t")[1])
#remove temporary file
os.unlink(results)
class Engine(object):
def __init__(self, algo, ctx):
self.ctx = ctx
def start(self):
#set temporary filename to be in the form of tmp_xxxx.txt)
self.tmpf = self.ctx.tempfile('tmp.txt', 4)
#set the temporary file
self.fp = open(self.tmpf, 'w')
def stop(self):
#close the file and return it to the algo
self.fp.close()
return self.tmpf
def step(self, start, end):
#get the chromonome
chr = self.ctx.fragment[0]
#get the reference and check how many are non N bases
ref = self.ctx.fetch_ref(start, end)
length = end-start-ref.count('N')
#get all reads in the fragment
reads = self.ctx.fetch_reads(start, end)
#calculate the depth at each locus of the fragment by iterating through all reads
depths = [0] * (end-start)
if len(reads) != 0:
for read in reads:
blks = read.cigar_blocks
for a in blks: # (op,sloc,eloc,soff,eoff)
if a[4] <= a[3]:
continue
for loc in xrange(a[1], a[2]):
if loc >= start and loc < end:
#add 1 to the depth at each locus the read covers
depths[loc-start] += 1
#use faster built in method to calculate depth
coverage = self.ctx.coverage(start,end,reads)
#print results onto the temporary file
print >> self.fp, '%s:%d-%d\t%d\t%d\t%d\t%d' % (chr, start, end, \
length,len(reads),sum(coverage),sum(depths))
Appendix I. Parallel processing with Python API¶
Due to the poor multithreading performance of python, it is recommended to execute the Python commands in a single thread; however, in order to obtain the best performance of the Sentieon® Python API, it is also recommended to process the genome in shards by using the built-in distributed processing capabilities of the Sentieon® tools. Please refer to the Distributed mode App note from Sentieon® for more details.
In order to process the genome in shards you may need to implement merging of the partial results for the final output in the non-traverse mode. To implement the non-traverse mode, the method run() needs to be present in the algorithm class.
class Algo(object):
def __init__(self, ctx, *args):
#initialize the output filename
if ctx.mode == 'passthru':
if args[0] == '--merge':
self.outf = args[1]
self.merge_file_list = []
for arg in args[2:]:
self.merge_file_list.append(arg)
else:
print 'passthru mode requires the --merge option'
exit()
else:
self.outf = args[0]
#initialize dictionary containing the average results
self.results = {'num_reads':0, 'coverage':0, 'depth':0,'length':0}
# ...
def run(self):
#merge the outputs as provided in the arg list
for filename in self.merge_file_list:
with open(filename, 'rb') as file:
for line in file:
#print line
if not line.startswith("Fragment"):
self.fp.write(line)
self.results['num_reads'] += int(line.split("\t")[2])
self.results['coverage'] += int(line.split("\t")[3])
self.results['depth'] += int(line.split("\t")[4])
self.results['length'] += int(line.split("\t")[1])
Additionally you will need to use the shell script below to process each of the shards in parallel before the merging.
#!/bin/sh
BAM=$1
FASTA=$2
BIN=$3
TMP=tmp/job$$
mkdir -p $TMP
#set the maximum number of concurrent jobs at the number of available threads
NPROC=$(nproc)
#set the shard size
STEP=$((100*1000*1000))
#get the shard list using the script from the distribution mode documentation
# SHARD_LIST=$(sh determine_shards.sh $BAM $STEP)
#run the python recipe on each shard in a single thread, with up
# to $(nproc) processes in parallel
sh determine_shards.sh $BAM $STEP | xargs -P $NPROC -I % sh -c -f \
"$BIN/bin/sentieon driver -t 1 --shard % --temp_dir tmp -r $FASTA \
-i $BAM --algo Python -- recipe.py Algo $TMP/%.out > $TMP/%.err 2>&1"
#determine the part files to be merged
parts=
for shard in $(sh determine_shards.sh $BAM $STEP); do
parts="$parts $TMP/$shard.out"
done
#merge the results, you may need to do a boundary aware merge using the python recipe
$BIN/bin/sentieon driver --temp_dir tmp -r $FASTA --passthru --algo Python \
-- recipe.py Algo --merge output_merge $parts
rm -fr $TMP
Appendix II. More details on Context object¶
The sentieon.Context
object has the following attributes, depending on
the class:
attribute |
class |
meaning |
---|---|---|
mode |
both |
The operation mode. The possible values are: traversal, traversal-sharded and non-traversal. |
extension |
both |
A tuple that specifies the left and right extension (in bp) to the read cache. |
fragment |
engine |
A list of 3 numbers (chromosome,start,end), identifying the current fragment; this attribute is read-only. |
config |
engine |
A list of 3 numbers (chromosome,0,length) , identifying the current contig; this attribute is read-only. |
read_groups |
both |
A list of all read groups; this attribute is read-only. |
The sentieon.Context
object has the following methods, depending on the
class:
attribute |
class |
meaning |
---|---|---|
advance(start) |
engine |
Method to advance the trailing edge of the read cache sliding window to s. |
fetch_ref(start, end) |
engine |
Method to fetch the reference sequence at [start,end]. The result is a string. |
fetch_ref(chr, start, end) |
algo |
Method to fetch the reference sequence at [s,e] on chr |
fetch_reads(start, end, [filter]) |
engine |
Method to fetch the reads at [start,end) with an optional filter; the filter is a set of SAM flags bits. |
fetch_reads(chr, start, end, [filter]) |
algo |
Method to fetch the reads at [start, end) with an optional filter; the filter is a set of SAM flags bits. |
coverage(start, end, reads) |
both |
Method to return the coverage from a list of reads. The result is a list of depth at each locus between [start, end) |
tempfile(base, suffix_length) |
both |
Method to create a temporary file name with the given base name and suffix length |
The result of the fetch_reads() method is a list of Read objects, with the following attributes and methods:
attribute method |
meaning |
---|---|
name |
The read name |
group |
A dictionary containing the read group. |
bases |
The base sequence string for the read. |
quals |
The base quality scores as a byte-array. |
flags |
The read flags. |
cigar |
The read cigar string. |
cigar_blocks |
The list of cigar blocks, each as a tuples (op, sloc, eloc, soff, eoff). |
chr |
The read contig. |
start |
The read alignment start. |
end |
The read alignment end. |
mate_chr |
The contig of the mate read. |
mate_start |
The alignment start of the mate read. |
insert_size |
The insert size. |
map_qual |
The mapping quality score. |
aux([tag]) |
A method returning a list of (tag,val) tuples, or the tag value if tag is specified. |