UPDATED (Sun Feb 19 14:56:28 PST 2012)
High-throughput sequencing (HTS) is rapidly advancing our ability to understand how the genome responds to its environment. It also presents a challenge to those tasked with analyzing the results. Massive files can be produced that can overwhelm a modest computer’s store of available memory. The simplest way around this problem is to only work with a small part of the file at a time. I have provided an example of a very simple; easy to extend; and stand-alone python parser that returns a single fastQ record at a time to provide memory efficient access to these commonly massive files. It is also small, simple to understand, and does not depend on other packages.
The fastQ formated file is a very common output file from HTS . It consists of many fastQ ‘records’ that are each 4 lines long and provide some header information along with the sequence of a single short read and a string of coded quality scores that correspond to probability that each position’s listed nucleotide was called incorrectly. An example record is shown below.
@Required_read_ID_header GATTTGGGGTTCAAAGCAGTATC +Optional_header !''*((((***+))%%%++)(%%
The lines are defined as follows:
- This line starts with a symbol (usually the ‘@’character) followed by a required header string that identifies the individual read.
- The string giving the deduced sequence of this read.
- Another symbol (usually the ‘+’ character) followed by an optional additional header string. The Symbol is required but the header is not.
- A string of the same length as the sequence in line 3 that codes the quality score for each corresponding position in the sequence string. (This is usually some variation of the Phred Score)
My code returns a parser object that will iterate through a fastQ file, returning a single parsed read record each time. The user can then attend to this information however they like, one record at a time. It also provides some simple validation of the data so that if the fastQ file is defective somewhere deep in its contents the user will be notified to the relative location of the problem and the script will terminate to prevent propagating erroneous information to down-stream analysis steps.
parser = ParseFastQ(filePath) # optional arg: headerSymbols allows changing the header symbols for record in parser: ... do something with record ...
Since the parser is an iterator, the above will gracefully exit the loop when the end of file (EOF) is reached unless the EOF occurs in the midst of an incomplete record. In that case, an error will be raised. The convenience method “getNextReadSeq()” simply returns the read sequence alone. You can extend this parser by adding more such methods to customize it for your purposes. Once this method reaches the EOF it will return the None object. You will need to write your scripts to expect this if you use this method.
Finally, I am not claiming that it is perfect or even better than average. It works well for me and I think it may be useful to others, but as with all code you simply copy off of a webpage: you use it at your own risk. That said, I am always eager to learn better python. So if any other master python-monks feel like commenting on where I could improve, I am always happy to hear about it!
class ParseFastQ(object): """Returns a read-by-read fastQ parser analogous to file.readline()""" def __init__(self,filePath,headerSymbols=['@','+']): """Returns a read-by-read fastQ parser analogous to file.readline(). Exmpl: parser.next() -OR- Its an iterator so you can do: for rec in parser: ... do something with rec ... rec is tuple: (seqHeader,seqStr,qualHeader,qualStr) """ if filePath.endswith('.gz'): self._file = gzip.open(filePath) else: self._file = open(filePath, 'rU') self._currentLineNumber = 0 self._hdSyms = headerSymbols def __iter__(self): return self def next(self): """Reads in next element, parses, and does minimal verification. Returns: tuple: (seqHeader,seqStr,qualHeader,qualStr)""" # ++++ Get Next Four Lines ++++ elemList =  for i in range(4): line = self._file.readline() self._currentLineNumber += 1 ## increment file position if line: elemList.append(line.strip('\n')) else: elemList.append(None) # ++++ Check Lines For Expected Form ++++ trues = [bool(x) for x in elemList].count(True) nones = elemList.count(None) # -- Check for acceptable end of file -- if nones == 4: raise StopIteration # -- Make sure we got 4 full lines of data -- assert trues == 4,\ "** ERROR: It looks like I encountered a premature EOF or empty line.\n\ Please check FastQ file near line number %s (plus or minus ~4 lines) and try again**" % (self._currentLineNumber) # -- Make sure we are in the correct "register" -- assert elemList.startswith(self._hdSyms),\ "** ERROR: The 1st line in fastq element does not start with '%s'.\n\ Please check FastQ file near line number %s (plus or minus ~4 lines) and try again**" % (self._hdSyms,self._currentLineNumber) assert elemList.startswith(self._hdSyms),\ "** ERROR: The 3rd line in fastq element does not start with '%s'.\n\ Please check FastQ file near line number %s (plus or minus ~4 lines) and try again**" % (self._hdSyms,self._currentLineNumber) # -- Make sure the seq line and qual line have equal lengths -- assert len(elemList) == len(elemList), "** ERROR: The length of Sequence data and Quality data of the last record aren't equal.\n\ Please check FastQ file near line number %s (plus or minus ~4 lines) and try again**" % (self._currentLineNumber) # ++++ Return fatsQ data as tuple ++++ return tuple(elemList)
Also more python code lives in a gitHub repository. To get the latest version of the repo, see the README file in the project’s home directory: http://github.com/xguse/scipherPyProj.