Simple Python FastQ Parser

Python_Icon

UPDATED (Sun Feb 19 14:56:28 PST 2012)

High-throughput sequencing (HTS) is rapidly advancing our ability to understand how the genome responds to its environment.  It also presents a challenge to those tasked with analyzing the results.  Massive files can be produced that can overwhelm a modest computer’s store of available memory.  The simplest way around this problem is to only work with a small part of the file at a time.  I have provided an example of a very simple; easy to extend; and stand-alone python parser that returns a single fastQ record at a time to provide memory efficient access to these commonly massive files.  It is also small, simple to understand, and does not depend on other packages.

The fastQ formated file is a very common output file from HTS .  It consists of many fastQ ‘records’ that are each 4 lines long and provide some header information along with the sequence of a single short read and a string of coded quality scores that correspond to probability that each position’s listed nucleotide was called incorrectly.  An example record is shown below.

@Required_read_ID_header
GATTTGGGGTTCAAAGCAGTATC
+Optional_header
!''*((((***+))%%%++)(%%

The lines are defined as follows:

  1. This line starts with a symbol (usually the ‘@’character) followed by a required header string that identifies the individual read.
  2. The string giving the deduced sequence of this read.
  3. Another symbol (usually the ‘+’ character) followed by an optional additional header string.  The Symbol is required but the header is not.
  4. A string of the same length as the sequence in line 3 that codes the quality score for each corresponding position in the sequence string. (This is usually some variation of the Phred Score)

My code returns a parser object that will iterate through a fastQ file, returning a single parsed read record each time.  The user can then attend to this information however they like, one record at a time.  It also provides some simple validation of the data so that if the fastQ file is defective somewhere deep in its contents the user will be notified to the relative location of the problem and the script will terminate to prevent propagating erroneous information to down-stream analysis steps.

Usage:

parser = ParseFastQ(filePath)  # optional arg: headerSymbols allows changing the header symbols
for record in parser:
    ... do something with record ...

Since the parser is an iterator, the above will gracefully exit the loop when the end of file (EOF) is reached unless the EOF occurs in the midst of an incomplete record.  In that case, an error will be raised.  The convenience method “getNextReadSeq()” simply returns the read sequence alone.  You can extend this parser by adding more such methods to customize it for your purposes.  Once this method reaches the EOF it will return the None object. You will need to write your scripts to expect this if you use this method.

Finally, I am not claiming that it is perfect or even better than average.  It works well for me and I think it may be useful to others, but as with all code you simply copy off of a webpage: you use it at your own risk.  That said, I am always eager to learn better python.  So if any other master python-monks feel like commenting on where I could improve, I am always happy to hear about it!

Happy Coding!

class ParseFastQ(object):
    """Returns a read-by-read fastQ parser analogous to file.readline()"""
    def __init__(self,filePath,headerSymbols=['@','+']):
        """Returns a read-by-read fastQ parser analogous to file.readline().
        Exmpl: parser.next()
        -OR-
        Its an iterator so you can do:
        for rec in parser:
            ... do something with rec ...

        rec is tuple: (seqHeader,seqStr,qualHeader,qualStr)
        """
        if filePath.endswith('.gz'):
            self._file = gzip.open(filePath)
        else:
            self._file = open(filePath, 'rU')
        self._currentLineNumber = 0
        self._hdSyms = headerSymbols
        
    def __iter__(self):
        return self
    
    def next(self):
        """Reads in next element, parses, and does minimal verification.
        Returns: tuple: (seqHeader,seqStr,qualHeader,qualStr)"""
        # ++++ Get Next Four Lines ++++
        elemList = []
        for i in range(4):
            line = self._file.readline()
            self._currentLineNumber += 1 ## increment file position
            if line:
                elemList.append(line.strip('\n'))
            else: 
                elemList.append(None)
        
        # ++++ Check Lines For Expected Form ++++
        trues = [bool(x) for x in elemList].count(True)
        nones = elemList.count(None)
        # -- Check for acceptable end of file --
        if nones == 4:
            raise StopIteration
        # -- Make sure we got 4 full lines of data --
        assert trues == 4,\
               "** ERROR: It looks like I encountered a premature EOF or empty line.\n\
               Please check FastQ file near line number %s (plus or minus ~4 lines) and try again**" % (self._currentLineNumber)
        # -- Make sure we are in the correct "register" --
        assert elemList[0].startswith(self._hdSyms[0]),\
               "** ERROR: The 1st line in fastq element does not start with '%s'.\n\
               Please check FastQ file near line number %s (plus or minus ~4 lines) and try again**" % (self._hdSyms[0],self._currentLineNumber) 
        assert elemList[2].startswith(self._hdSyms[1]),\
               "** ERROR: The 3rd line in fastq element does not start with '%s'.\n\
               Please check FastQ file near line number %s (plus or minus ~4 lines) and try again**" % (self._hdSyms[1],self._currentLineNumber) 
        # -- Make sure the seq line and qual line have equal lengths --
        assert len(elemList[1]) == len(elemList[3]), "** ERROR: The length of Sequence data and Quality data of the last record aren't equal.\n\
               Please check FastQ file near line number %s (plus or minus ~4 lines) and try again**" % (self._currentLineNumber) 
        
        # ++++ Return fatsQ data as tuple ++++
        return tuple(elemList)

GitHub Repo:

Download the latest version of this code from GitHub’s Gist site here.

Also more python code lives in a gitHub repository. To get the latest version of the repo, see the README file in the project’s home directory: http://github.com/xguse/scipherPyProj.

13 thoughts on “Simple Python FastQ Parser

  1. John St. John

    Not sure if you check for this because I didn’t read the code very carefully. Double check that you handle the case where the quality string starts with a “@” or “+” character. Not sure if “+” is much of an issue, but I have run into the “@” issue a few times and some parsers just die when that happens. What I do is to double check that my qual string isn’t empty and assume that it is a qual string if it is empty and I come across a @ before I expect to see the next id.

    Reply
    1. xguse Post author

      Thanks for your “heads up”, John. I believe that I handle this by requiring that the file be formatted with no empty lines. When this assumption is enforced, the data have a known structure so I actually gather and parse all four lines of a fastQ as a single ordered entity. This way, I only have to enforce that the line 1 and 3 have the expected header symbols. This parser itself is agnostic to the contents of the quality strings.

      Gus

      Reply
  2. Pingback: Simple Sliding Window Iterator in Python « scipher

  3. Jake Biesinger

    Just had to face this issue, and went with a more succinct version:

    from itertools import ifilter, islice
    def readFastq(fastqfile):
    fastqiter = ifilter(lambda l: l, fastqfile) # skip blank lines
    fastqiter = (l.strip(‘\n’) for l in fastqiter) # strip trailing newlines
    while True:
    values = list(islice(fastqiter, 4))
    if len(values) == 4:
    header1,seq,header2,qual = values
    elif len(values) == 0:
    raise StopIteration
    else:
    raise EOFError(“Failed to parse four lines from fastq file!”)

    if header1.startswith(‘@’) and header2.startswith(‘+’):
    yield header1[1:], seq, qual
    else:
    raise ValueError(“Invalid header lines: %s and %s” % (header1, header2))

    Reply
    1. Jake Biesinger

      Cursed formatting…. make that:

      from itertools import ifilter, islice

      def readFastq(fastqfile):
      fastqiter = ifilter(lambda l: l, fastqfile) # skip blank lines
      fastqiter = (l.strip('\n') for l in fastqiter) # strip trailing newlines
      while True:
      values = list(islice(fastqiter, 4))
      if len(values) == 4:
      header1,seq,header2,qual = values
      elif len(values) == 0:
      raise StopIteration
      else:
      raise EOFError("Failed to parse four lines from fastq file!")

      if header1.startswith('@') and header2.startswith('+'):
      yield header1[1:], seq, qual
      else:
      raise ValueError("Invalid header lines: %s and %s" % (header1, header2))

      Reply
      1. Jake Biesinger

        one more try?

        
        from itertools import ifilter, islice
        
        def readFastq(fastqfile):
            fastqiter = ifilter(lambda l: l, fastqfile)  # skip blank lines
            fastqiter = (l.strip('\n') for l in fastqiter)  # strip trailing newlines
            while True:
                values = list(islice(fastqiter, 4))
                if len(values) == 4:
                    header1,seq,header2,qual = values
                elif len(values) == 0:
                    raise StopIteration
                else:
                    raise EOFError("Failed to parse four lines from fastq file!")
        
                if header1.startswith('@') and header2.startswith('+'):
                    yield header1[1:], seq, qual
                else:
                    raise ValueError("Invalid header lines: %s and %s" % (header1, header2))
        
  4. ram

    I checked the code and it was not working for fastq with sequences and quality in multiple lines. Also it did not check whether the sequence and quality are equal in length. However I wrote a simple script in pure python to perform all.

    Reply
    1. xguse Post author

      First: Thanks for coming by and I am glad you seem to have solved your needs!

      I am afraid I am not quite clear on what you say is broken in my code.

      not working for fastq with sequences and quality in multiple lines

      could alternately describe all and none of the existing fastq file formats. Can you give me more details?

      Regarding:

      it did not check whether the sequence and quality are equal in length.

      Thank you and the code has been updated to do this.

      Reply
  5. ram

    Hey thanks for the reply to my comment.
    Sorry for the unclear comment. When I say the fastq with sequences and quality scores in multiple lines, I mean like the example below:
    @GRYJ9W001AWBAN
    TCAGACGAGTGCGTCCAGTAATTAAGCCAAGGCGCATAGGATATA
    CAGAAGTTAGTGGAACGCACTCGTCTGAGCGGGCTGGCAAGCGCA
    +
    FFFFFFFFFFDDA:::==55555@@BBBBBBBDAAAAA:::=BBA
    ???>>A===8889:99:22///34

    Here the nucleotide sequence and quality scores are in more than one line.
    I run your updated code and still not running for this type of fastq file.

    Reply
    1. xguse Post author

      I am glad that you got your data sorted but I do not think that this format is typical. And as the whole point of this code is to be a SIMPLE parser and allow people to build complexity on this frame if they need more complexity, I do not plan to change the code to accommodate all possible FastQ variations. I based this parser on the standard FastQ definition. Random alterations to accommodate the user’s specific alternative needs is left as an exercise for the reader.

      I have to say that I have NEVER seen such FastQ records. Why are they so disorganized? What source technology produces this output and why cant it manage to keep the data to a single line? If this is a Samtools ‘consensus’ output, it is a very specialized and non-standard formatting compared to the vast majority of FastQ formatting. My code also does not break PE records that have been joined into a single record. It is meant as a simple tool to get people started and a frame on which to customize.

      Thanks again for reading and for your comments!

      Reply
  6. Bioinfo

    May you also extend this article by an example to explain how to use this code for newbies. Thanks in advance.

    Reply

Leave a comment