Simple Sliding Window Iterator in Python

Python_Icon

A common task encountered in bioinformatics is the need to process a sequence bit-by-bit, sometimes with overlapping regions.  I have provided an example of a very simple; easy to extend; and stand-alone python iterator that returns a single defined window of any python string object per iteration to allow simple, intuitive handling of sliding window tasks.  The code is simple to understand, and does not depend on other packages.

Consider the task of scanning a sequence of length l for a pattern representing a transcription factor binding site (TFBS) or other feature-of-interest.  You must start at the first position in the sequence (i = 1), retrieve a chunk of the a specific length (k) and test it for the probability that it embodies the feature’s described qualities.  You must then return to the original sequence and retrieve a chunk of length k with a position off-set of one DNA letter forward into the sequence (i = 2).  This process is repeated until the last chunk of length k is encountered at position i = lk.

This type of process is called a sliding window.  For example, consider the sequence “ATCGATGCTA”.  It has an l of 10.  If the feature that we are interested in has been described to usually be 5 bp long, we would define our k (window size) to be 5.  Our step size (how far to advance the chunk’s starting position each time) will generally be 1 for this type of problem, but one might conceive reasons to use larger step sizes for different purposes.  I have diagrammed the result of the sliding window procedure for our hypothetical sequence below.

ATCGATGCTA
----------
ATCGA
 TCGAT
  CGATG
   GATGC
    ATGCT
     TGCTA

This is a perfect job for a python iterator, and the code that I am providing here is basically that (technically it returns a generator).  Its a simple Python function that when initiated with a sequence, a window size, and a step size returns an object that behaves as if it is a list that emits each defined sequence chunk when called in a for loop.  This way you can use something like the following code to access each chunk and do with it what you will.

chunks = slidingWindow(sequence,winSize,step)
for chunk in chunks:
    whatYouWill(chunk)

This is very similar to the technique we used to parse FastQ files but uses the yield statement to convert a simple function into a generator factory (thanks to Jake and gasstationwithoutpumps for suggesting this trick that had slipped by me!).  This is a very “Pythonic” solution to these types of problems.

The source code follows and is in the git repo as well.

def slidingWindow(sequence,winSize,step=1):
    """Returns a generator that will iterate through
    the defined chunks of input sequence.  Input sequence
    must be iterable."""

    # Verify the inputs
    try: it = iter(sequence)
    except TypeError:
        raise Exception("**ERROR** sequence must be iterable.")
    if not ((type(winSize) == type(0)) and (type(step) == type(0))):
        raise Exception("**ERROR** type(winSize) and type(step) must be int.")
    if step > winSize:
        raise Exception("**ERROR** step must not be larger than winSize.")
    if winSize > len(sequence):
        raise Exception("**ERROR** winSize must not be larger than sequence length.")

    # Pre-compute number of chunks to emit
    numOfChunks = ((len(sequence)-winSize)/step)+1

    # Do the work
    for i in range(0,numOfChunks*step,step):
        yield sequence[i:i+winSize]

GitHub Repo:

This and other python code lives in a gitHub repository. To get the latest version of the repo, see the README file in the project’s home directory: http://github.com/xguse/scipherPyProj.  This function lives in the “basicDefs” module.

Standard Disclaimer:

I am not claiming that this code is perfect or even better than average.  It works well for me and I think it may be useful to others, but as with all code you simply copy off of a webpage: you use it at your own risk.  That said, I am always eager to learn better python.  So if any other master python-monks feel like commenting on where I could improve, I am always happy to hear about it!

Happy Coding!

A true iterator version:

# This is my original solution.  It is a true iterator class.  It is included here only for completeness. I have replaced it in the repo with the much clearer code above.
class SlidingWindow(object):
    """Returns iterator that will emit chunks of size 'winSize' each time self.next()
    is called."""
    def __init__(self,sequence,winSize,step=1):
        """Returns iterator that will emit chunks of size 'winSize' and 'step' forward in
        the seq each time self.next() is called."""

        # verification code
        if not type(sequence) == type(''):
            raise Exception("**ERROR** type(sequence) must be str.")
        if not ((type(winSize) == type(0)) and (type(step) == type(0))):
            raise Exception("**ERROR** type(winSize) and type(step) must be int.")
        if step > winSize:
            raise Exception("**ERROR** step must not be larger than winSize.")
        if winSize > len(sequence):
            raise Exception("**ERROR** winSize must not be larger than sequence length.")
        self._seq = sequence
        self._step = step
        self._start = 0
        self._stop = winSize

    def __iter__(self):
        return self

    def next(self):
        """Returns next window chunk or ends iteration if the sequence has ended."""
        try:
            assert self._stop <= len(self._seq), "Not True!"
            chunk = self._seq[self._start:self._stop]
            self._start += self._step
            self._stop  += self._step
            return chunk
        except AssertionError:
            raise StopIteration
Advertisements

25 thoughts on “Simple Sliding Window Iterator in Python

  1. Jake Biesinger

    Are you trying for backward compatibility with earlier versions of python? Why not a single function that *yields* values?

    Reply
    1. gasstationwithoutpumps

      I had the same immediate response. Write the loop as you normally would, but replace the body with “yield chunk”

      The code ends up much clearer and in only one piece, rather than needing “class” “__init__”, “__iter__”, and “__next__”. Class-based iterators are fine if you have a data structure which needs an iterator added to it, but if you are just iterating over an existing object like a string, a generator function is simpler and clearer.

      Reply
    1. gasstationwithoutpumps

      Definitely better, though I’d be tempted to compute the range more exactly, rather than testing length on each iteration. That is probably my old C habits kicking in, though, as the increase in speed would be tiny and the loss of clarity (and hence chance of error) could be large.

      Reply
  2. Jake Biesinger

    I also think precomputing the range would be a good idea.

    One interesting change would be to make this work for any iterable, not just strings. The yielded value would be a list that sliced the input iterable. If the input were a string, you’d probably want a string back so maybe that wouldn’t be worth the special case.

    Reply
  3. xguse Post author

    Once again: thanks for the input. Code and repo updated. There is something altogether awesome and depressing when your function spends more lines validating inputs than doing its job… Thanks again y’all!

    Jake: I am of the opinion that its up to the user to make sure that they get out what they expect. Its worth a little ‘point-of-use’ validation to not have to write an ALMOST identical function for a slightly diff purpose. I had actually meant to remove the string requirement but forgot. Thanks for pointing that out.

    Reply
  4. gasstationwithoutpumps

    OK, more nit-picking: Have you thought about what happens when winSize or step are negative or zero? Does it do something reasonable? Can it be made to? Add these behaviors to the docstring, or add error messages for unaccepted values. Can you fix the error messages (using “ERROR: …”.format(…) ) to echo the illegal values or types when a problem is detected?

    Reply
  5. Eric

    Hi,

    Thank you for posting this code. I think there is a mistake though in how it computes the ranges. If I get this right, the line that reads like ‘for i in range(0,numOfChunks,step):’ should in fact be ‘for i in range(0,numOfChunks*step,step):’. Try with bigger values of step with your version and you will get far too few sequences.

    Apologies if I am mistaken, but by fixing it like I suggest, I got a much more likely output (sequence length = 2000, window size = 40 and step = 50 would otherwise give me only one sequence).

    Cheers!

    Eric

    Reply
    1. xguse Post author

      Eric: actually, if you were to input those example values you would raise an exception bc I enforce that the step must not be larger than winSize. This was done under the assumption that one would not want to leave any part of the sequence unexamined and a step larger than winSize will result in jumping over parts of the sequence.

      If this is what you want you could probably do what you suggest to achieve it. But you would need to remove this check code:

      if step > winSize:
      raise Exception("**ERROR** step must not be larger than winSize.")

      Reply
      1. xguse Post author

        Eric,

        I was just looking at this again and you are completely right. I misunderstood your point it seems. I thought you were talking about stepSizes larger than windowSize. I have changed the code to reflex your comments. THANK YOU!

        Gus

  6. Jenny

    Hey, just wanted to say thank you for this brilliant explanation of sliding window analysis! I am using it for my dissertation project and this was the best and most comprehensive explanation I could find!
    🙂

    Reply
  7. agforsyth

    There are many rolling window iterators out there that operate in linear time — yours does not, since you slice once for every iteration. Also, you test if the object passed is an iterable, but they you use it like a sequence — meaning your version only works on tuples, lists, and the like (with support __getitem__) rather than all iterables. There is one without these problems in an old version of the itertools docs, and many on Stack Overflow, including http://stackoverflow.com/questions/7113724/iterator-with-memory/7113802#7113802

    Reply
    1. xguse Post author

      Hello agforsyth!

      First let me apologize if two replies show up here. WordPress seems to have messed up posting the first.
      Thanks for coming by! And thanks for your comments. I will be improving the code soon; I hope.

      About enforcing iterables but coding an implementation that does not support all iter-types: I was trying to address user suggestions to support generalization. The issue you point out was absolutly an oversight on my part; thanks for the heads up. I probably should not have added it to the code at all, however, bc the purpose of this code was to be a easy-to-understand and simple-to-use example of how sliding windows are usually employed in *bioinformatics sequence analysis*. In these cases, the sequence is virtually guaranteed to be a string.

      Your comments on the time-space of my code is interesting but a bit either misleading or wrong IMO, at least when compared with the code that you imply is better. Out of curiously, I obtained and timed the code that you linked to at StackOverflow (how *friendly* of you to provide an undisclosed endorsement and link to your own code from my blog btw). I tested both methods across six orders of magnitude of N = len(sequence) and my code is reliably 1.5 times faster than yours at all lengths. This implies that both methods occupy the same “Big O” time-space yet mine is still faster.

      Memory will be another matter. I admit that my code will eventually raise MemErrors, but so far, its practical use has shown me no problems here. I WILL be attempting to fix this too if it does not significantly obscure the code, since this blog is aimed at folks who may not understand the intricacies of python yet.

      I would welcome elaboration of your time-space comments, however, as I am always looking to learn more!

      Anyway, thanks for dropping by and I appreciate the comments!

      Reply
      1. jakebiesinger

        Why would your code raise memory errors? As long as you’re not keeping those slices around once they’re processed, you should be fine right?

      2. xguse Post author

        Yeah… I was thinking if the window size got out of control, but the windows will always be smaller than the sequence so that claim was poorly thought out.

  8. agforsyth

    See http://wiki.python.org/moin/TimeComplexity for more info about Time complexity. Baslically, list slicing is O(k), where k is the length of the slice — meaning that the bigger the “window” the slower your code. Testing at different sequence lengths won’t make any difference. Consider using a buffer or memoryview instead of string slicing — creating those are constant time operation. I had no intention of “endorsing” my code, it just took me less time to provide a link to that than to look up another easy-to-understand implementation that supports any iterable.

    Reply
    1. xguse Post author

      Please forgive my continued confusion. You say that my code does not operate in linear time bc I slice in each iteration and that slicing is O(n) which is basically the definition of linear time. I am MOSTLY a biologist who has done his best to bone up on enough computational science to try to avoid MAJOR time/Mem problems but do not have as firm a grasp on BigO stuff as I thought it seems.

      So while you say that slicing is linear {O(n) when n=winSize}, you say my code is NOT linear because I slice every iteration. Would you mind proposing a time that you claim it IS then? Are you thinking O(iter^n) or something?

      Reply
      1. agforsyth

        Think about nested loops. If you have an outer loop with n-iterations and an inner loop with k-iterations, then your complexity is O(n*k). In this case, slicing the window is the inner loop, and iterating over the list is the outer loop.

      2. xguse Post author

        Got it. Thanks for your patience! I am not sure I will change the code for this since the current form mirrors well the explanation in the text, but I will certainly be looking into your comments for my own project code from now on!

    2. jakebiesinger

      I guess the better benchmark would be to have increased window lengths, not increased sequence lengths.

      Reply
      1. xguse Post author

        Yes I got that. However, in MY personal use case it doesnt really matter since the window size will never really grow beyond 10.

        Just for fun I did a quick %timeit in ipython anyway varying the window size and mine (slidingWindow) still came out better than the stackOverflow one (window)….

        In [33]: %timeit [x for x in slidingWindow(range(1000000),10)]
        1 loops, best of 3: 282 ms per loop

        In [34]: %timeit [x for x in window(range(1000000),10)]
        1 loops, best of 3: 424 ms per loop
        ————–
        In [35]: %timeit [x for x in slidingWindow(range(1000000),100)]
        1 loops, best of 3: 576 ms per loop

        In [36]: %timeit [x for x in window(range(1000000),100)]
        1 loops, best of 3: 1.27 s per loop
        ————–
        In [37]: %timeit [x for x in slidingWindow(range(1000000),1000)]
        1 loops, best of 3: 4.27 s per loop

        In [38]: %timeit [x for x in window(range(1000000),1000)]
        1 loops, best of 3: 9.29 s per loop

      2. jakebiesinger

        Surprising. How about if you leave off the tuple conversion (which is O(k) and probably worse than your string slicing) and just return the deque at each iteration?

        Orsince you expect a string back, do

        yield ”.join(d)

        which is again O(k) but might be better than tuple creation.

      3. xguse Post author

        Yeah, the tuple conversion in the window() def was a big problem. Dropping that gives:

        In [2]: %timeit [x for x in slidingWindow(range(1000000),1000)]
        1 loops, best of 3: 4.44 s per loop

        In [3]: %timeit [x for x in window(range(1000000),1000)]
        10 loops, best of 3: 140 ms per loop

  9. Evan

    Hi! Just came across your post and it sounds close to what I’m looking for; however, I’m looking to iterate a sliding 2D 3×3 numpy array of ints over a larger 2D 256×256 numpy array of ints, and to perform a calculation on that 3×3 area at each iteration. Basically, to do image processing with a mask. Do you have any ideas how your example could be adapted to this use case?

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s