Tag Archives: bioinformatics

Simple Sliding Window Iterator in Python

Python_Icon

A common task encountered in bioinformatics is the need to process a sequence bit-by-bit, sometimes with overlapping regions.  I have provided an example of a very simple; easy to extend; and stand-alone python iterator that returns a single defined window of any python string object per iteration to allow simple, intuitive handling of sliding window tasks.  The code is simple to understand, and does not depend on other packages.

Consider the task of scanning a sequence of length l for a pattern representing a transcription factor binding site (TFBS) or other feature-of-interest.  You must start at the first position in the sequence (i = 1), retrieve a chunk of the a specific length (k) and test it for the probability that it embodies the feature’s described qualities.  You must then return to the original sequence and retrieve a chunk of length k with a position off-set of one DNA letter forward into the sequence (i = 2).  This process is repeated until the last chunk of length k is encountered at position i = lk.

This type of process is called a sliding window.  For example, consider the sequence “ATCGATGCTA”.  It has an l of 10.  If the feature that we are interested in has been described to usually be 5 bp long, we would define our k (window size) to be 5.  Our step size (how far to advance the chunk’s starting position each time) will generally be 1 for this type of problem, but one might conceive reasons to use larger step sizes for different purposes.  I have diagrammed the result of the sliding window procedure for our hypothetical sequence below. Continue reading

Advertisements

Simple Python FastQ Parser

Python_Icon

UPDATED (Sun Feb 19 14:56:28 PST 2012)

High-throughput sequencing (HTS) is rapidly advancing our ability to understand how the genome responds to its environment.  It also presents a challenge to those tasked with analyzing the results.  Massive files can be produced that can overwhelm a modest computer’s store of available memory.  The simplest way around this problem is to only work with a small part of the file at a time.  I have provided an example of a very simple; easy to extend; and stand-alone python parser that returns a single fastQ record at a time to provide memory efficient access to these commonly massive files.  It is also small, simple to understand, and does not depend on other packages.

Continue reading