Performing set subtraction on HTSeq GenomicIntervals
HTSeq is a wonderfully useful Python library for analyzing high-throughput sequencing data. I’ve been using it in recent weeks to load genome alignments reads from SAM and BAM files, then intersect ranges specified in GFF files with these alignments to determine read depth at every position in each range.
As my analysis progressed, I wished to perform a set subtraction operation between sets of GenomicIntervals, which are data structures representing data one would typically find in a BED or GFF file—i.e., they include chromosome, start position, end position, and strand. Given a set of intervals representing regions of enriched read depth within my genome alignment, I wanted to subtract a second set of intervals representing annotated genes from the reference genome, leaving me with only enriched regions not associated with annotated genes.
There exist numerous means of subtracting one set of intervals from another. One particularly alluring option uses interval trees uses interval trees, which permit efficient manipulation of interval sets. Alas, I found the code sufficiently difficult to understand that I opted to implement a different solution. For fun, I implemented the interval tree structure described in the aforelinked lecture, but it only allows for determination of overlap between two interval sets. I did not extend it to support the set subtraction operation that was my original motivation, opting instead for a simpler approach.
The algorithm I implemented is largely on that in the interval package. This method is, alas, computationally naive relative to ones using interval trees. Nevertheless, though it implements a brute-force approach, it runs amply fast for my use, in which 8800 annotated gene intervals are subtracted from 550 read-enriched intervals. The only moderately interesting portion is the subtraction method:
def __sub__(self, other): results = GenomicIntervalSet() for chrom in self._intervals.keys(): minuhend = self._intervals[chrom] subtrahend = other._intervals[chrom] create_gi = lambda start, end: GenomicInterval(chrom, start, end, '.') chr_result = minuhend for j in subtrahend: temp =  for i in chr_result: if IntervalCmp.overlaps(i, j): if IntervalCmp.contains(i, j): temp.append( create_gi(i.start, j.start) ) temp.append( create_gi(j.end, i.end) ) elif IntervalCmp.contains(j, i): # Do nothing, as all of i is consumed by j. pass elif IntervalCmp.precedes(j, i): temp.append( create_gi(j.end, i.end) ) elif IntervalCmp.follows(j, i): temp.append( create_gi(i.start, j.start) ) else: temp.append(i) chr_result = temp for interval in chr_result: # Copy intervals to eliminate strand (which would otherwise be # preserved when no intervals are in subtrahend), and to ensure that # new intervals are returned to user, rather than (some) references to # ones that he provided as input. interval = create_gi(interval.start, interval.end) results.add(interval) return results
Thanks to Wikipedia, I learned was that, in A - B, A is termed the minuhend, while B is called the subtrahend. In my code, I initially set the result of the subtraction to be the entire minuhend. Then, I take the first interval in the subtrahend, determining any overlap it has with the tentative result and removing it. I repeat this for each of the remaining subtrahend intervals, yielding the final result.
I hope that this code may be of some use to other HTSeq users dealing with sets of GenomicIntervals. My implementation could, if necessary, be easily extended to support other set operations such as union and intersection.