A common task encountered in bioinformatics is the need to process a sequence bit-by-bit, sometimes with overlapping regions. I have provided an example of a very simple; easy to extend; and stand-alone python iterator that returns a single defined window of any python string object per iteration to allow simple, intuitive handling of sliding window tasks. The code is simple to understand, and does not depend on other packages.
Consider the task of scanning a sequence of length l for a pattern representing a transcription factor binding site (TFBS) or other feature-of-interest. You must start at the first position in the sequence (i = 1), retrieve a chunk of the a specific length (k) and test it for the probability that it embodies the feature’s described qualities. You must then return to the original sequence and retrieve a chunk of length k with a position off-set of one DNA letter forward into the sequence (i = 2). This process is repeated until the last chunk of length k is encountered at position i = l–k.
This type of process is called a sliding window. For example, consider the sequence “ATCGATGCTA”. It has an l of 10. If the feature that we are interested in has been described to usually be 5 bp long, we would define our k (window size) to be 5. Our step size (how far to advance the chunk’s starting position each time) will generally be 1 for this type of problem, but one might conceive reasons to use larger step sizes for different purposes. I have diagrammed the result of the sliding window procedure for our hypothetical sequence below.
ATCGATGCTA ---------- ATCGA TCGAT CGATG GATGC ATGCT TGCTA
This is a perfect job for a python iterator, and the code that I am providing here is basically that (technically it returns a generator). Its a simple Python function that when initiated with a sequence, a window size, and a step size returns an object that behaves as if it is a list that emits each defined sequence chunk when called in a for loop. This way you can use something like the following code to access each chunk and do with it what you will.
chunks = slidingWindow(sequence,winSize,step) for chunk in chunks: whatYouWill(chunk)
This is very similar to the technique we used to parse FastQ files but uses the yield statement to convert a simple function into a generator factory (thanks to Jake and gasstationwithoutpumps for suggesting this trick that had slipped by me!). This is a very “Pythonic” solution to these types of problems.
The source code follows and is in the git repo as well.
def slidingWindow(sequence,winSize,step=1): """Returns a generator that will iterate through the defined chunks of input sequence. Input sequence must be iterable.""" # Verify the inputs try: it = iter(sequence) except TypeError: raise Exception("**ERROR** sequence must be iterable.") if not ((type(winSize) == type(0)) and (type(step) == type(0))): raise Exception("**ERROR** type(winSize) and type(step) must be int.") if step > winSize: raise Exception("**ERROR** step must not be larger than winSize.") if winSize > len(sequence): raise Exception("**ERROR** winSize must not be larger than sequence length.") # Pre-compute number of chunks to emit numOfChunks = ((len(sequence)-winSize)/step)+1 # Do the work for i in range(0,numOfChunks*step,step): yield sequence[i:i+winSize]
This and other python code lives in a gitHub repository. To get the latest version of the repo, see the README file in the project’s home directory: http://github.com/xguse/scipherPyProj. This function lives in the “basicDefs” module.
I am not claiming that this code is perfect or even better than average. It works well for me and I think it may be useful to others, but as with all code you simply copy off of a webpage: you use it at your own risk. That said, I am always eager to learn better python. So if any other master python-monks feel like commenting on where I could improve, I am always happy to hear about it!
A true iterator version:
# This is my original solution. It is a true iterator class. It is included here only for completeness. I have replaced it in the repo with the much clearer code above. class SlidingWindow(object): """Returns iterator that will emit chunks of size 'winSize' each time self.next() is called.""" def __init__(self,sequence,winSize,step=1): """Returns iterator that will emit chunks of size 'winSize' and 'step' forward in the seq each time self.next() is called.""" # verification code if not type(sequence) == type(''): raise Exception("**ERROR** type(sequence) must be str.") if not ((type(winSize) == type(0)) and (type(step) == type(0))): raise Exception("**ERROR** type(winSize) and type(step) must be int.") if step > winSize: raise Exception("**ERROR** step must not be larger than winSize.") if winSize > len(sequence): raise Exception("**ERROR** winSize must not be larger than sequence length.") self._seq = sequence self._step = step self._start = 0 self._stop = winSize def __iter__(self): return self def next(self): """Returns next window chunk or ends iteration if the sequence has ended.""" try: assert self._stop <= len(self._seq), "Not True!" chunk = self._seq[self._start:self._stop] self._start += self._step self._stop += self._step return chunk except AssertionError: raise StopIteration