LibStatGen: FASTQ

From Genome Analysis Wiki
Revision as of 17:18, 30 March 2010 by Goncalo (talk | contribs)
Jump to: navigation, search

Where to find the fastqFile Library and the FastQValidator

The fastQFile and FastQValidator code can be downloaded at: http://www.sph.umich.edu/csg/mktrost/fastQFile/

How to Use the fastQFile Library

  • Library Name: libfqf.a
  • Additional Libraries Needed: libcsg/libcsg.a thirdParty/samtools/libbam.a
    • Note: When you include the libraries, make sure you include them in this order:
<path to base pipeline directory>/fastQFile/libfqf.a <path to base pipeline directory>/libcsg/libcsg.a <path to base pipeline directory>/thirdParty/samtools/libbam.a
  • Include Files: FastQFile.h
  • Class Name: FastQFile
    • Constructor Parameters:
      • int minReadLength - The minimum length that a base sequence must be for it to be valid.
      • int maxReportedErrors - The maximum number of errors that should be reported in detail before suppressing the errors.
  • disableMessages : Disables cout messages. This does not include the Base Composition. Those are turned on/off based on a parameter to validateFastQFile.
    • Parameters: NONE
    • Return Value: NONE
  • disableMessages : Enables cout messages. This does not include the Base Composition. Those are turned on/off based on a parameter to validateFastQFile.
    • Parameters: NONE
    • Return Value: NONE
  • openFile : Open a FastQ File
    • Parameters:
      • String filename - fastq file to be opened.
      • BaseAsciiMap::SPACE_TYPE spaceType:
        • BASE_SPACE - Bases only (A,C,G,T,N)
        • COLOR_SPACE - Color space only (0,1,2,3,.)
        • UNKNOWN - Base Decision on the first Raw Sequence Character (Default).
    • Return Value
      • FastQStatus: FASTQ_SUCCESS if successfully opened, FASTQ_FAILURE if not.
  • closeFile : Close a FastQ File
    • Parameters: NONE
    • Return Value
      • bool: FastQStatus - FASTQ_SUCCESS if successfully closed, FASTQ_FAILURE if not.
  • isOpen : Determine if a FastQ File is open Method
    • Parameters: NONE
    • Return Value
      • bool: true if a file is open, false if not.
  • validateFastQFile : Validate an entire FastQ File
    • Parameters:
      • String filename - fastq file to be validated.
      • bool printBaseComp - whether or not to print the base composition (true = print; false = don't print)
      • BaseAsciiMap::SPACE_TYPE spaceType:
        • BASE_SPACE - Bases only;
        • COLOR_SPACE - Color space only;
        • UNKNOWN - Base Decision on the first Raw Sequence Character (Default).
    • Return Value
      • bool: true if there were no errors in the file, false otherwise.
    • Notes:
      • Invalid information is printed to cout until maxReportedErrors is hit.
  • readFastQSequence : Read a FastQ Sequence From the File
    • Parameters: NONE
    • Return Value
      • int: FASTQ_SUCCESS if successfully read and valid, FASTQ_FAILURE if not successfully read, FASTQ_INVALID if the sequence was invalid.
    • Notes:
      • Invalid information is printed to cout until maxReportedErrors is hit.
  • getSpaceType : Get the Space Type for the File
    • Parameters: NONE
    • Return Value
      • BaseAsciiMap::SPACETYPE:
        • COLOR_SPACE if the file is color space (0,1,2,3,.)
        • BASE_SPACE if the file is base space (A,C,G,T,N)
        • UNKNOWN if it has yet to be determined.
  • Access last read Sequence Lines
    • Public String Variables to avoid having to copy the strings:
      • mySequenceIdLine
      • mySequenceIdentifier
      • myRawSequence
      • myPlusLine
      • myQualityString


Libraries & Classes

  • libfqf.a
    • BaseCount - wrapper around an array that has one index per base and an extra index for a total count of all bases. This class is used to keep a count of the number of times each index has occurred. It can print a percentage of the occurrence of each base against the total number of bases.
    • BaseComposition - class that tracks the composition of base by read location.
    • FastQFile - class that reads/validates a fastq file.
  • libcsg.a
    • String (StringBasics) - String class for string operations
    • BaseAsciiMap - Class for determining if a character is a valid base.
    • InputFile - Class for opening and operating on a file.


Library Output

When a sequence is read, error messages for the first maxReportedErrors are output for failed Validation Criteria. For Example:

ERROR on Line 25: The sequence identifier line was too short.
ERROR on Line 29: First line of a sequence does not begin wtih @
ERROR on Line 33: No Sequence Identifier specified before the comment.


FastQValidator

The FastQ_Validator was built using the FastQFile class. More details on that program are at the supplied link.


Use of the FastQFile class

Generally this class is used to read a fastq file and perform operations on the sequences. Here is an example of how this would be done.

   FastQFile fastQFile;
   String filename = <your filename>;
   // Open the fastqfile with the default UNKNOWN space type which will determine the 
   // base type from the first character in the sequence.
   if(fastQFile.openFile(filename) != FASTQ_SUCCESS)
   {
      // Failed to open the specified file.
      // Report the error and exit (handled by error).
      error("Failed to open file: %s", filename.c_str());
      return (<your return info to indicate failure>);
   }
   // Keep reading the file until there are no more fastq sequences to process.
   while (!fastQFile.isEof())
   {
      // Read one sequence. This call will read all the lines for 
      // one sequence.
      /////////////////////////////////////////////////////////////////
      // NOTE: It is up to you if you want to process only for success:
      //    if(readFastQSequence() == FASTQ_SUCCESS)
      // or for FASTQ_SUCCESS and FASTQ_INVALID: 
      //    if(readFastQSequence() != FASTQ_FAILURE)
      // Do NOT try to process on a FASTQ_FAILURE
      /////////////////////////////////////////////////////////////////
      if(fastQFile.readFastQSequence() == FastQFile::FASTQ_SUCCESS)
      {
         // The sequence is valid.
         <Your Processing Here>
         // For example if you want to print the lines of the sequence:
         printf("The Sequence ID Line is: %s", fastQFile.mySequenceIdLine.c_str());
         printf("The Sequence ID is: %s", fastQFile.mySequenceIdentifier.c_str());
         printf("The Sequence Line is: %s", fastQFile.myRawSequence.c_str());
         printf("The Plus Line is: %s", fastQFile.myPlusLine.c_str());
         printf("The Quality String Line is: %s", fastQFile.myQualityString.c_str());
      }
   }
   // Finished processing all of the sequences in the file.
   // Close the input file.
   fastQFile.closeFile();
   return(<your return info>); // It is up to you to determine your return.

Validation Criteria Used For Reading a Sequence

Sequence Identifier Line
Validation Criteria Error Message
Line is at least 2 characters long ('@' and at least 1 for the sequence identifier) ERROR on Line <current line #>: The sequence identifier line was too short.
Line starts with an '@' ERROR on Line <current line #>: First line of a sequence does not begin wtih @
Line does not contain a space between the '@' and the first sequence identifier (which must be at least 1 character). ERROR on Line <current line #>: No Sequence Identifier specified before the comment.
Every entry in the file should have a unique identifier. ERROR on Line <current line #>: Repeated Sequence Identifier: <identifier> at Lines <previous line #> <current line #>


Raw Sequence Line
Validation Criteria Error Message
A base sequence should have non-zero length. ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: 0 < <config min read length>
All characters in the base sequence must be in the allowable set specified via configuration.
  • Base Only: A C T G N a c t g n
  • Color Space Only: 0 1 2 3 .(period)
ERROR on Line <current line #>: Invalid character ('<invalid char>') in base sequence.
Reads should be of a configurable minimum length since many mappers will get into trouble with very short reads.
  • If the raw sequence spans lines, the sum of the lengths of all lines are validated, not each individual line.
ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: <read length> < <config min read length>
Each Line of a Raw Sequence should have at least 1 character (not be blank). ERROR on Line <current line #>: Looking for continuation of Raw Sequence or '+' instead found a blank line, assuming it was part of Raw Sequence.


Plus Line
Validation Criteria Error Message
Must exist for every sequence. ERROR on Line <current line #>: Reached the end of the file without a '+' line.
If the optional sequence identifier is specified, it must equal the one on the Sequence Identifier Line. ERROR on Line <current line #>: Sequence Identifier on '+' line does not equal the one on the '@' line.


Quality String Line
Validation Criteria Error Message
A quality string should be present for every base sequence. ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
Paired quality and base sequences should be of the same length. ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
Valid quality values should all have ASCII codes > 32. ERROR on Line <current line #>: Invalid character ('<invalid char>') in quality string.


Reading Sequence Assumptions

  • The Sequence Identifier is separated by an optional comment by a " ".
  • No validation is required on the optional comment field of the Sequence Identifier Line.
  • The Sequence Identifier and the '+' Lines cannot wrap lines. The are each completely contained on one line.
  • Raw Sequences and Quality Strings may wrap lines
  • All lines are part of the Raw Sequence Line until a line that starts with a '+' is discovered.
  • All lines are considered part of the quality string until at least the length of the associated raw sequence is hit (or the end of the file is reached). This is due to the fact that '@' is a valid quality character, so does not necessarily indicate the start of a Sequence Identifier Line.


Additional Features


Additional Wishlist - Not Implemented

  • Add an option that would reject raw sequence and quality strings that wrap over multiple lines. It would only allow 1 line per raw sequence/quality string.
  • Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).