UPDATED (Sun Feb 19 14:56:28 PST 2012)
High-throughput sequencing (HTS) is rapidly advancing our ability to understand how the genome responds to its environment. It also presents a challenge to those tasked with analyzing the results. Massive files can be produced that can overwhelm a modest computer’s store of available memory. The simplest way around this problem is to only work with a small part of the file at a time. I have provided an example of a very simple; easy to extend; and stand-alone python parser that returns a single fastQ record at a time to provide memory efficient access to these commonly massive files. It is also small, simple to understand, and does not depend on other packages.
The fastQ formated file is a very common output file from HTS . It consists of many fastQ ‘records’ that are each 4 lines long and provide some header information along with the sequence of a single short read and a string of coded quality scores that correspond to probability that each position’s listed nucleotide was called incorrectly. An example record is shown below.
@Required_read_ID_header GATTTGGGGTTCAAAGCAGTATC +Optional_header !''*((((***+))%%%++)(%%
The lines are defined as follows:
- This line starts with a symbol (usually the ‘@’character) followed by a required header string that identifies the individual read.
- The string giving the deduced sequence of this read.
- Another symbol (usually the ‘+’ character) followed by an optional additional header string. The Symbol is required but the header is not.
- A string of the same length as the sequence in line 3 that codes the quality score for each corresponding position in the sequence string. (This is usually some variation of the Phred Score)
My code returns a parser object that will iterate through a fastQ file, returning a single parsed read record each time. The user can then attend to this information however they like, one record at a time. It also provides some simple validation of the data so that if the fastQ file is defective somewhere deep in its contents the user will be notified to the relative location of the problem and the script will terminate to prevent propagating erroneous information to down-stream analysis steps.
Usage:
parser = ParseFastQ(filePath) # optional arg: headerSymbols allows changing the header symbols for record in parser: ... do something with record ...
Since the parser is an iterator, the above will gracefully exit the loop when the end of file (EOF) is reached unless the EOF occurs in the midst of an incomplete record. In that case, an error will be raised. The convenience method “getNextReadSeq()” simply returns the read sequence alone. You can extend this parser by adding more such methods to customize it for your purposes. Once this method reaches the EOF it will return the None object. You will need to write your scripts to expect this if you use this method.
Finally, I am not claiming that it is perfect or even better than average. It works well for me and I think it may be useful to others, but as with all code you simply copy off of a webpage: you use it at your own risk. That said, I am always eager to learn better python. So if any other master python-monks feel like commenting on where I could improve, I am always happy to hear about it!
Happy Coding!
class ParseFastQ(object): """Returns a read-by-read fastQ parser analogous to file.readline()""" def __init__(self,filePath,headerSymbols=['@','+']): """Returns a read-by-read fastQ parser analogous to file.readline(). Exmpl: parser.next() -OR- Its an iterator so you can do: for rec in parser: ... do something with rec ... rec is tuple: (seqHeader,seqStr,qualHeader,qualStr) """ if filePath.endswith('.gz'): self._file = gzip.open(filePath) else: self._file = open(filePath, 'rU') self._currentLineNumber = 0 self._hdSyms = headerSymbols def __iter__(self): return self def next(self): """Reads in next element, parses, and does minimal verification. Returns: tuple: (seqHeader,seqStr,qualHeader,qualStr)""" # ++++ Get Next Four Lines ++++ elemList = [] for i in range(4): line = self._file.readline() self._currentLineNumber += 1 ## increment file position if line: elemList.append(line.strip('\n')) else: elemList.append(None) # ++++ Check Lines For Expected Form ++++ trues = [bool(x) for x in elemList].count(True) nones = elemList.count(None) # -- Check for acceptable end of file -- if nones == 4: raise StopIteration # -- Make sure we got 4 full lines of data -- assert trues == 4,\ "** ERROR: It looks like I encountered a premature EOF or empty line.\n\ Please check FastQ file near line number %s (plus or minus ~4 lines) and try again**" % (self._currentLineNumber) # -- Make sure we are in the correct "register" -- assert elemList[0].startswith(self._hdSyms[0]),\ "** ERROR: The 1st line in fastq element does not start with '%s'.\n\ Please check FastQ file near line number %s (plus or minus ~4 lines) and try again**" % (self._hdSyms[0],self._currentLineNumber) assert elemList[2].startswith(self._hdSyms[1]),\ "** ERROR: The 3rd line in fastq element does not start with '%s'.\n\ Please check FastQ file near line number %s (plus or minus ~4 lines) and try again**" % (self._hdSyms[1],self._currentLineNumber) # -- Make sure the seq line and qual line have equal lengths -- assert len(elemList[1]) == len(elemList[3]), "** ERROR: The length of Sequence data and Quality data of the last record aren't equal.\n\ Please check FastQ file near line number %s (plus or minus ~4 lines) and try again**" % (self._currentLineNumber) # ++++ Return fatsQ data as tuple ++++ return tuple(elemList)
GitHub Repo:
Download the latest version of this code from GitHub’s Gist site here.
Also more python code lives in a gitHub repository. To get the latest version of the repo, see the README file in the project’s home directory: http://github.com/xguse/scipherPyProj.
Nice one mate, TYVM.
Not sure if you check for this because I didn’t read the code very carefully. Double check that you handle the case where the quality string starts with a “@” or “+” character. Not sure if “+” is much of an issue, but I have run into the “@” issue a few times and some parsers just die when that happens. What I do is to double check that my qual string isn’t empty and assume that it is a qual string if it is empty and I come across a @ before I expect to see the next id.
Thanks for your “heads up”, John. I believe that I handle this by requiring that the file be formatted with no empty lines. When this assumption is enforced, the data have a known structure so I actually gather and parse all four lines of a fastQ as a single ordered entity. This way, I only have to enforce that the line 1 and 3 have the expected header symbols. This parser itself is agnostic to the contents of the quality strings.
Gus
long overdue: Thanks a lot for this parser!
Thomas
Pingback: Simple Sliding Window Iterator in Python « scipher
Just had to face this issue, and went with a more succinct version:
from itertools import ifilter, islice
def readFastq(fastqfile):
fastqiter = ifilter(lambda l: l, fastqfile) # skip blank lines
fastqiter = (l.strip(‘\n’) for l in fastqiter) # strip trailing newlines
while True:
values = list(islice(fastqiter, 4))
if len(values) == 4:
header1,seq,header2,qual = values
elif len(values) == 0:
raise StopIteration
else:
raise EOFError(“Failed to parse four lines from fastq file!”)
if header1.startswith(‘@’) and header2.startswith(‘+’):
yield header1[1:], seq, qual
else:
raise ValueError(“Invalid header lines: %s and %s” % (header1, header2))
Cursed formatting…. make that:
from itertools import ifilter, islice
def readFastq(fastqfile):
fastqiter = ifilter(lambda l: l, fastqfile) # skip blank lines
fastqiter = (l.strip('\n') for l in fastqiter) # strip trailing newlines
while True:
values = list(islice(fastqiter, 4))
if len(values) == 4:
header1,seq,header2,qual = values
elif len(values) == 0:
raise StopIteration
else:
raise EOFError("Failed to parse four lines from fastq file!")
if header1.startswith('@') and header2.startswith('+'):
yield header1[1:], seq, qual
else:
raise ValueError("Invalid header lines: %s and %s" % (header1, header2))
one more try?
I checked the code and it was not working for fastq with sequences and quality in multiple lines. Also it did not check whether the sequence and quality are equal in length. However I wrote a simple script in pure python to perform all.
First: Thanks for coming by and I am glad you seem to have solved your needs!
I am afraid I am not quite clear on what you say is broken in my code.
could alternately describe all and none of the existing fastq file formats. Can you give me more details?
Regarding:
Thank you and the code has been updated to do this.
Hey thanks for the reply to my comment.
Sorry for the unclear comment. When I say the fastq with sequences and quality scores in multiple lines, I mean like the example below:
@GRYJ9W001AWBAN
TCAGACGAGTGCGTCCAGTAATTAAGCCAAGGCGCATAGGATATA
CAGAAGTTAGTGGAACGCACTCGTCTGAGCGGGCTGGCAAGCGCA
+
FFFFFFFFFFDDA:::==55555@@BBBBBBBDAAAAA:::=BBA
???>>A===8889:99:22///34
Here the nucleotide sequence and quality scores are in more than one line.
I run your updated code and still not running for this type of fastq file.
I am glad that you got your data sorted but I do not think that this format is typical. And as the whole point of this code is to be a SIMPLE parser and allow people to build complexity on this frame if they need more complexity, I do not plan to change the code to accommodate all possible FastQ variations. I based this parser on the standard FastQ definition. Random alterations to accommodate the user’s specific alternative needs is left as an exercise for the reader.
I have to say that I have NEVER seen such FastQ records. Why are they so disorganized? What source technology produces this output and why cant it manage to keep the data to a single line? If this is a Samtools ‘consensus’ output, it is a very specialized and non-standard formatting compared to the vast majority of FastQ formatting. My code also does not break PE records that have been joined into a single record. It is meant as a simple tool to get people started and a frame on which to customize.
Thanks again for reading and for your comments!
May you also extend this article by an example to explain how to use this code for newbies. Thanks in advance.