Difference between revisions of "C++ Class: FastQFile"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(6 intermediate revisions by one other user not shown)
Line 1: Line 1:
 +
[[Category:C++]]
 +
[[Category:libStatGen]]
 +
[[Category:libStatGen FASTQ]]
 +
 
== Reading/Validating FastQ Files In Your Program ==
 
== Reading/Validating FastQ Files In Your Program ==
 
The '''FastQFile''' class allows a user to easily read/validate a fastq file.
 
The '''FastQFile''' class allows a user to easily read/validate a fastq file.
  
=== Public Class Methods ===
+
Documentation on this class is available at: http://csg.sph.umich.edu//mktrost/doxygen/current/classFastQFile.html
 
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 
|-style="background: #f2f2f2; text-align: center;"  '''SamFile Class Methods'''
 
! Method Name !!  Description
 
|-
 
| <code>FastQFile::FastQFile(int minReadLength = 10, int maxReportedErrors = 20)</code>
 
| Initializes the FastQFile class setting the minimum length that a base sequence must be for it to be valid to minReadLength and setting the maximum number of errors that should be reported in detail before suppressing the errors to maxReportedErrors.
 
|-
 
| <code> void FastQFile::disableMessages()</code>
 
| Disables cout messages.  This does not include the Base Composition.  Those are turned on/off based with a parameter to validateFastQFile.
 
|-
 
| <code>void FastQFile::enableMessages()</code>
 
|Enables cout messages.  This does not include the Base Composition.  Those are turned on/off based on a parameter to validateFastQFile.
 
|-
 
| <code>void FastQFile::setQuitAfterErrorNum(int quitAfterErrorNum)</code>
 
| Set the number of errors after which to quit reading/validating a file. (Defaults to -1)
 
-1 indicates to not quit until the entire file has been read/validated.
 
 
 
0 indicates to quit without reading/validating anything.
 
|-
 
| <code>FastQStatus FastQFile::openFile(const char* fileName, BaseAsciiMap::SPACE_TYPE spaceType = BaseAsciiMap::UNKNOWN)</code>
 
| Open the specified fastq file with the specified base space type which defaults to UNKOWN (base the decision on the first character of the sequence).
 
 
 
Returns FastQStatus to indicate if the open was successful or not.
 
|-
 
| <code>FastQStatus FastQFile::closeFile()</code>
 
| Close the opened fastq file.
 
Returns FastQStatus to indicate if the close was successful or not.
 
|-
 
| <code>bool FastQFile::isOpen()</code>
 
| Returns true if a fastq file is open, false if one is not open.
 
|-
 
| <code>bool FastQFile::isEof()</code>
 
| Returns true if it is the end of the file, false if not.
 
|-
 
| <code>FastQStatus FastQFile::validateFastQFile(const String &filename, bool printBaseComp, BaseAsciiMap::SPACE_TYPE spaceType)</code>
 
| Validates the specified fastq file using the specified [[C++ Class: BaseAsciiMap#Public Class Enums|SpaceType]], printing the base composition if specified by printBaseComp.
 
Returns the fastq validation status -  SUCCESS on a successfully validated fastq file.
 
 
 
Invalid error messages are printed to cout until maxReportedErrors is hit.
 
|-
 
| <code>FastQStatus FastQFile::readFastQSequence()</code>
 
| Read the next fastq sequence from the file
 
Returns FASTQ_SUCCESS if it was successfully read and was valid.  Otherwise the failure status is returned.
 
 
 
Invalid error messages are printed to cout until maxReportedErrors is hit.
 
|-
 
| <code>BaseAsciiMap::SPACE_TYPE FastQFile::getSpaceType()</code>
 
| Return the [[C++ Class: BaseAsciiMap#Public Class Enums|SpaceType]] for the File
 
|-
 
|}
 
 
 
=== Public Class Attributes ===
 
These attributes allow access to the contents of the last read fastq sequence.  They are public <code>String</code> variables in order to avoid copying the strings.
 
 
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 
|-style="background: #f2f2f2; text-align: center;"
 
! Attribute !!  Description
 
|-
 
| <code>myRawSequence</code>
 
| <code>String</code> containing the raw sequence for the last fastq sequence that was read.
 
|-
 
| <code>mySequenceIdLine</code>
 
| <code>String</code> containing the sequence identifier line for the last fastq sequence that was read.
 
|-
 
| <code>mySequenceIdentifier</code>
 
| <code>String</code> containing the sequence identifier for the last fastq sequence that was read.
 
|-
 
| <code>myPlusLine</code>
 
| <code>String</code> containing the plus line for the last fastq sequence that was read.
 
|-
 
| <code>myQualityString</code>
 
| <code>String</code> containing the quality string for the last fastq sequence that was read.
 
|}
 
 
 
=== Public Class Enums ===
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 
|-style="background: #f2f2f2; text-align: center;"
 
! colspan="2"| enum FastQStatus
 
|-
 
! Enum Value !!  Description
 
|-
 
| FASTQ_SUCCESS
 
| method finished successfully.
 
|-
 
| FASTQ_FAILURE
 
| method failed to complete successfully.
 
|-
 
| FASTQ_INVALID
 
| sequence was invalid.
 
|}
 
  
 
=== Usage Example ===
 
=== Usage Example ===
Line 112: Line 25:
 
   }
 
   }
 
   // Keep reading the file until there are no more fastq sequences to process.
 
   // Keep reading the file until there are no more fastq sequences to process.
   while (!fastQFile.isEof())
+
   while (fastQFile.keepReadingFile())
 
   {
 
   {
 
       // Read one sequence. This call will read all the lines for  
 
       // Read one sequence. This call will read all the lines for  
Line 142: Line 55:
  
 
== Validation Criteria Used For Reading a Sequence ==
 
== Validation Criteria Used For Reading a Sequence ==
{| class="wikitable" style="width:100%"
+
[[FastQ Validation Criteria]]
|+ style="font-size:150%" |'''Sequence Identifier Line'''
 
!  width="50%"|Validation Criteria
 
!  width="50%"|Error Message
 
|-
 
|  Line is at least 2 characters long ('@' and at least 1 for the sequence identifier)
 
|  ERROR on Line <current line #>: The sequence identifier line was too short.
 
|-
 
|  Line starts with an '@'
 
|  ERROR on Line <current line #>: First line of a sequence does not begin wtih @
 
|-
 
|  Line does not contain a space between the '@' and the first sequence identifier (which must be at least 1 character).
 
|  ERROR on Line <current line #>: No Sequence Identifier specified before the comment.
 
|-
 
|  Every entry in the file should have a unique identifier.
 
|  ERROR on Line <current line #>: Repeated Sequence Identifier: <identifier> at Lines <previous line #> <current line #>
 
|}
 
 
 
 
 
{| class="wikitable" style="width:100%" border="1"
 
|+ style="font-size:150%"|'''Raw Sequence Line'''
 
!  width="50%"|Validation Criteria
 
!  width="50%"|Error Message
 
|-
 
|  A base sequence should have non-zero length.
 
|  ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: 0 < <config min read length>
 
|-
 
|  All characters in the base sequence must be in the allowable set specified via configuration.
 
* Base Only: A C T G N a c t g n
 
* Color Space Only: 0 1 2 3 .(period)
 
|  ERROR on Line <current line #>: Invalid character ('<invalid char>') in base sequence.
 
|-
 
|  Reads should be of a configurable minimum length since many mappers will get into trouble with very short reads.
 
* If the raw sequence spans lines, the sum of the lengths of all lines are validated, not each individual line.
 
|  ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: <read length> < <config min read length>
 
|-
 
|  Each Line of a Raw Sequence should have at least 1 character (not be blank).
 
|  ERROR on Line <current line #>: Looking for continuation of Raw Sequence or '+' instead found a blank line, assuming it was part of Raw Sequence.
 
|}
 
 
 
 
 
{| class="wikitable" style="width:100%" border="1"
 
|+ style="font-size:150%"|'''Plus Line'''
 
!  width="50%"|Validation Criteria
 
!  width="50%"|Error Message
 
|-
 
|  Must exist for every sequence.
 
|  ERROR on Line <current line #>: Reached the end of the file without a '+' line.
 
|-
 
|  If the optional sequence identifier is specified, it must equal the one on the Sequence Identifier Line.
 
|  ERROR on Line <current line #>: Sequence Identifier on '+' line does not equal the one on the '@' line.
 
|}
 
 
 
 
 
{| class="wikitable" style="width:100%" border="1"
 
|+ style="font-size:150%"|'''Quality String Line'''
 
!  width="50%"|Validation Criteria
 
!  width="50%"|Error Message
 
|-
 
|  A quality string should be present for every base sequence.
 
|  ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
 
|-
 
|  Paired quality and base sequences should be of the same length.
 
|  ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
 
|-
 
|  Valid quality values should all have ASCII codes &gt; 32.
 
|  ERROR on Line <current line #>: Invalid character ('<invalid char>') in quality string.
 
|}
 
 
 
  
 
== Reading Sequence Assumptions ==
 
== Reading Sequence Assumptions ==
Line 224: Line 69:
 
*Consumes gzipped and uncompressed text files transparently.
 
*Consumes gzipped and uncompressed text files transparently.
 
*Prints error messages for errors up to the configurable maximum number of reportable errors.
 
*Prints error messages for errors up to the configurable maximum number of reportable errors.
*[[FastQ Validator|Standalone Program for Validating a FastQ File]]
+
*[[FastQValidator|Standalone Program for Validating a FastQ File]]
 
 
  
 
== Additional Wishlist - Not Implemented ==
 
== Additional Wishlist - Not Implemented ==
 
*Add an option that would reject raw sequence and quality strings that wrap over multiple lines.  It would only allow 1 line per raw sequence/quality string.
 
*Add an option that would reject raw sequence and quality strings that wrap over multiple lines.  It would only allow 1 line per raw sequence/quality string.
 
*Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).
 
*Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).

Latest revision as of 10:58, 2 February 2017


Reading/Validating FastQ Files In Your Program

The FastQFile class allows a user to easily read/validate a fastq file.

Documentation on this class is available at: http://csg.sph.umich.edu//mktrost/doxygen/current/classFastQFile.html

Usage Example

Generally this class is used to read a fastq file and perform operations on the sequences. Here is an example of how this would be done.

   FastQFile fastQFile;
   String filename = <your filename>;
   // Open the fastqfile with the default UNKNOWN space type which will determine the 
   // base type from the first character in the sequence.
   if(fastQFile.openFile(filename) != FASTQ_SUCCESS)
   {
      // Failed to open the specified file.
      // Report the error and exit (handled by error).
      error("Failed to open file: %s", filename.c_str());
      return (<your return info to indicate failure>);
   }
   // Keep reading the file until there are no more fastq sequences to process.
   while (fastQFile.keepReadingFile())
   {
      // Read one sequence. This call will read all the lines for 
      // one sequence.
      /////////////////////////////////////////////////////////////////
      // NOTE: It is up to you if you want to process only for success:
      //    if(readFastQSequence() == FASTQ_SUCCESS)
      // or for FASTQ_SUCCESS and FASTQ_INVALID: 
      //    if(readFastQSequence() != FASTQ_FAILURE)
      // Do NOT try to process on a FASTQ_FAILURE
      /////////////////////////////////////////////////////////////////
      if(fastQFile.readFastQSequence() == FastQFile::FASTQ_SUCCESS)
      {
         // The sequence is valid.
         <Your Processing Here>
         // For example if you want to print the lines of the sequence:
         printf("The Sequence ID Line is: %s", fastQFile.mySequenceIdLine.c_str());
         printf("The Sequence ID is: %s", fastQFile.mySequenceIdentifier.c_str());
         printf("The Sequence Line is: %s", fastQFile.myRawSequence.c_str());
         printf("The Plus Line is: %s", fastQFile.myPlusLine.c_str());
         printf("The Quality String Line is: %s", fastQFile.myQualityString.c_str());
      }
   }
   // Finished processing all of the sequences in the file.
   // Close the input file.
   fastQFile.closeFile();
   return(<your return info>); // It is up to you to determine your return.

Validation Criteria Used For Reading a Sequence

FastQ Validation Criteria

Reading Sequence Assumptions

  • The Sequence Identifier is separated by an optional comment by a " ".
  • No validation is required on the optional comment field of the Sequence Identifier Line.
  • The Sequence Identifier and the '+' Lines cannot wrap lines. The are each completely contained on one line.
  • Raw Sequences and Quality Strings may wrap lines
  • All lines are part of the Raw Sequence Line until a line that starts with a '+' is discovered.
  • All lines are considered part of the quality string until at least the length of the associated raw sequence is hit (or the end of the file is reached). This is due to the fact that '@' is a valid quality character, so does not necessarily indicate the start of a Sequence Identifier Line.


Additional Features

Additional Wishlist - Not Implemented

  • Add an option that would reject raw sequence and quality strings that wrap over multiple lines. It would only allow 1 line per raw sequence/quality string.
  • Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).