C++ Class: FastQFile

From Genome Analysis Wiki
Revision as of 13:34, 6 May 2010 by Mktrost (talk | contribs)
Jump to navigationJump to search

Reading/Validating FastQ Files In Your Program

The FastQFile class allows a user to easily read/validate a fastq file.

Public Class Methods

Method Name Description
FastQFile::FastQFile(int minReadLength = 10, int maxReportedErrors = 20) Initializes the FastQFile class setting the minimum length that a base sequence must be for it to be valid to minReadLength and setting the maximum number of errors that should be reported in detail before suppressing the errors to maxReportedErrors.
void FastQFile::disableMessages() Disables cout messages. This does not include the Base Composition. Those are turned on/off based with a parameter to validateFastQFile.
void FastQFile::enableMessages() Enables cout messages. This does not include the Base Composition. Those are turned on/off based on a parameter to validateFastQFile.
void FastQFile::setQuitAfterErrorNum(int quitAfterErrorNum) Set the number of errors after which to quit reading/validating a file. (Defaults to -1)

-1 indicates to not quit until the entire file has been read/validated.

0 indicates to quit without reading/validating anything.

FastQStatus FastQFile::openFile(const char* fileName, BaseAsciiMap::SPACE_TYPE spaceType = BaseAsciiMap::UNKNOWN) Open the specified fastq file with the specified base space type which defaults to UNKOWN (base the decision on the first character of the sequence).

Returns FastQStatus to indicate if the open was successful or not.

FastQStatus FastQFile::closeFile() Close the opened fastq file.

Returns FastQStatus to indicate if the close was successful or not.

bool FastQFile::isOpen() Returns true if a fastq file is open, false if one is not open.
bool FastQFile::isEof() Returns true if it is the end of the file, false if not.
bool FastQFile::keepReadingFile() Returns true if the file should continue to be read. Returns false on EOF or on a read error (corrupt gzip files never indicate EOF).

This method should be used for loops that continually read sequences instead of FastQFile::isEof()

FastQStatus FastQFile::validateFastQFile(const String &filename, bool printBaseComp, BaseAsciiMap::SPACE_TYPE spaceType) Validates the specified fastq file using the specified SpaceType, printing the base composition if specified by printBaseComp.

Returns the fastq validation status - SUCCESS on a successfully validated fastq file.

Invalid error messages are printed to cout until maxReportedErrors is hit.

FastQStatus FastQFile::readFastQSequence() Read the next fastq sequence from the file

Returns FASTQ_SUCCESS if it was successfully read and was valid. Otherwise the failure status is returned.

Invalid error messages are printed to cout until maxReportedErrors is hit.

BaseAsciiMap::SPACE_TYPE FastQFile::getSpaceType() Return the SpaceType for the File

Public Class Attributes

These attributes allow access to the contents of the last read fastq sequence. They are public String variables in order to avoid copying the strings.

Attribute Description
myRawSequence String containing the raw sequence for the last fastq sequence that was read.
mySequenceIdLine String containing the sequence identifier line for the last fastq sequence that was read.
mySequenceIdentifier String containing the sequence identifier for the last fastq sequence that was read.
myPlusLine String containing the plus line for the last fastq sequence that was read.
myQualityString String containing the quality string for the last fastq sequence that was read.

Public Class Enums

enum FastQStatus
Enum Value Description
FASTQ_SUCCESS method finished successfully.
FASTQ_FAILURE method failed to complete successfully.
FASTQ_INVALID sequence was invalid.

Usage Example

Generally this class is used to read a fastq file and perform operations on the sequences. Here is an example of how this would be done.

   FastQFile fastQFile;
   String filename = <your filename>;
   // Open the fastqfile with the default UNKNOWN space type which will determine the 
   // base type from the first character in the sequence.
   if(fastQFile.openFile(filename) != FASTQ_SUCCESS)
   {
      // Failed to open the specified file.
      // Report the error and exit (handled by error).
      error("Failed to open file: %s", filename.c_str());
      return (<your return info to indicate failure>);
   }
   // Keep reading the file until there are no more fastq sequences to process.
   while (fastQFile.keepReadingFile())
   {
      // Read one sequence. This call will read all the lines for 
      // one sequence.
      /////////////////////////////////////////////////////////////////
      // NOTE: It is up to you if you want to process only for success:
      //    if(readFastQSequence() == FASTQ_SUCCESS)
      // or for FASTQ_SUCCESS and FASTQ_INVALID: 
      //    if(readFastQSequence() != FASTQ_FAILURE)
      // Do NOT try to process on a FASTQ_FAILURE
      /////////////////////////////////////////////////////////////////
      if(fastQFile.readFastQSequence() == FastQFile::FASTQ_SUCCESS)
      {
         // The sequence is valid.
         <Your Processing Here>
         // For example if you want to print the lines of the sequence:
         printf("The Sequence ID Line is: %s", fastQFile.mySequenceIdLine.c_str());
         printf("The Sequence ID is: %s", fastQFile.mySequenceIdentifier.c_str());
         printf("The Sequence Line is: %s", fastQFile.myRawSequence.c_str());
         printf("The Plus Line is: %s", fastQFile.myPlusLine.c_str());
         printf("The Quality String Line is: %s", fastQFile.myQualityString.c_str());
      }
   }
   // Finished processing all of the sequences in the file.
   // Close the input file.
   fastQFile.closeFile();
   return(<your return info>); // It is up to you to determine your return.

Validation Criteria Used For Reading a Sequence

FastQ Validation Criteria

Reading Sequence Assumptions

  • The Sequence Identifier is separated by an optional comment by a " ".
  • No validation is required on the optional comment field of the Sequence Identifier Line.
  • The Sequence Identifier and the '+' Lines cannot wrap lines. The are each completely contained on one line.
  • Raw Sequences and Quality Strings may wrap lines
  • All lines are part of the Raw Sequence Line until a line that starts with a '+' is discovered.
  • All lines are considered part of the quality string until at least the length of the associated raw sequence is hit (or the end of the file is reached). This is due to the fact that '@' is a valid quality character, so does not necessarily indicate the start of a Sequence Identifier Line.


Additional Features


Additional Wishlist - Not Implemented

  • Add an option that would reject raw sequence and quality strings that wrap over multiple lines. It would only allow 1 line per raw sequence/quality string.
  • Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).