A common task encountered in bioinformatics is the need to process a sequence bit-by-bit, sometimes with overlapping regions. I have provided an example of a very simple; easy to extend; and stand-alone python iterator that returns a single defined window of any python string object per iteration to allow simple, intuitive handling of sliding window tasks. The code is simple to understand, and does not depend on other packages.
Consider the task of scanning a sequence of length l for a pattern representing a transcription factor binding site (TFBS) or other feature-of-interest. You must start at the first position in the sequence (i = 1), retrieve a chunk of the a specific length (k) and test it for the probability that it embodies the feature’s described qualities. You must then return to the original sequence and retrieve a chunk of length k with a position off-set of one DNA letter forward into the sequence (i = 2). This process is repeated until the last chunk of length k is encountered at position i = l–k.
This type of process is called a sliding window. For example, consider the sequence “ATCGATGCTA”. It has an l of 10. If the feature that we are interested in has been described to usually be 5 bp long, we would define our k (window size) to be 5. Our step size (how far to advance the chunk’s starting position each time) will generally be 1 for this type of problem, but one might conceive reasons to use larger step sizes for different purposes. I have diagrammed the result of the sliding window procedure for our hypothetical sequence below. Continue reading