Difference between revisions of "C++ Class: FastQFile"

From Genome Analysis Wiki
Jump to navigationJump to search
(Created page with '== Reading/Validating FastQ Files In Your Program == The '''FastQFile''' class allows a user to easily read/validate a fastq file. === Class Methods === {| style="margin: 1em 1…')
 
 
(8 intermediate revisions by one other user not shown)
Line 1: Line 1:
 +
[[Category:C++]]
 +
[[Category:libStatGen]]
 +
[[Category:libStatGen FASTQ]]
 +
 
== Reading/Validating FastQ Files In Your Program ==
 
== Reading/Validating FastQ Files In Your Program ==
 
The '''FastQFile''' class allows a user to easily read/validate a fastq file.
 
The '''FastQFile''' class allows a user to easily read/validate a fastq file.
  
=== Class Methods ===
+
Documentation on this class is available at: http://csg.sph.umich.edu//mktrost/doxygen/current/classFastQFile.html
 +
 
 +
=== Usage Example ===
 +
Generally this class is used to read a fastq file and perform operations on the sequences.
 +
Here is an example of how this would be done.
  
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
+
<source lang="cpp">
|-style="background: #f2f2f2; text-align: center;"  '''SamFile Class Methods'''
+
  FastQFile fastQFile;
! Method Name !!  Description
+
  String filename = <your filename>;
|-
+
  // Open the fastqfile with the default UNKNOWN space type which will determine the
| <code>FastQFile::FastQFile(int minReadLength, int maxReportedErrors)</code>
+
  // base type from the first character in the sequence.
|
+
  if(fastQFile.openFile(filename) != FASTQ_SUCCESS)
|-
+
  {
|}
+
      // Failed to open the specified file.
 +
      // Report the error and exit (handled by error).
 +
      error("Failed to open file: %s", filename.c_str());
 +
      return (<your return info to indicate failure>);
 +
  }
 +
  // Keep reading the file until there are no more fastq sequences to process.
 +
  while (fastQFile.keepReadingFile())
 +
  {
 +
      // Read one sequence. This call will read all the lines for
 +
      // one sequence.
 +
      /////////////////////////////////////////////////////////////////
 +
      // NOTE: It is up to you if you want to process only for success:
 +
      //    if(readFastQSequence() == FASTQ_SUCCESS)
 +
      // or for FASTQ_SUCCESS and FASTQ_INVALID:  
 +
      //    if(readFastQSequence() != FASTQ_FAILURE)
 +
      // Do NOT try to process on a FASTQ_FAILURE
 +
      /////////////////////////////////////////////////////////////////
 +
      if(fastQFile.readFastQSequence() == FastQFile::FASTQ_SUCCESS)
 +
      {
 +
        // The sequence is valid.
 +
        <Your Processing Here>
 +
        // For example if you want to print the lines of the sequence:
 +
        printf("The Sequence ID Line is: %s", fastQFile.mySequenceIdLine.c_str());
 +
        printf("The Sequence ID is: %s", fastQFile.mySequenceIdentifier.c_str());
 +
        printf("The Sequence Line is: %s", fastQFile.myRawSequence.c_str());
 +
        printf("The Plus Line is: %s", fastQFile.myPlusLine.c_str());
 +
        printf("The Quality String Line is: %s", fastQFile.myQualityString.c_str());
 +
      }
 +
  }
 +
  // Finished processing all of the sequences in the file.
 +
  // Close the input file.
 +
  fastQFile.closeFile();
 +
  return(<your return info>); // It is up to you to determine your return.
 +
</source>
  
=== Class Enums ===
+
== Validation Criteria Used For Reading a Sequence ==
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
+
[[FastQ Validation Criteria]]
|-style="background: #f2f2f2; text-align: center;"
 
! colspan="2"| enum FastQStatus
 
|-
 
! Enum Value !!  Description
 
|-
 
| FASTQ_SUCCESS
 
| method finished successfully.
 
|-
 
| FASTQ_FAILURE
 
| method failed to complete successfully.
 
|-
 
| FASTQ_INVALID
 
| sequence was invalid.
 
|}
 
  
=== Usage Example ===
+
== Reading Sequence Assumptions ==
 +
*The Sequence Identifier is separated by an optional comment by a " ".
 +
*No validation is required on the optional comment field of the Sequence Identifier Line.
 +
*The Sequence Identifier and the '+' Lines cannot wrap lines.  The are each completely contained on one line.
 +
*Raw Sequences and Quality Strings may wrap lines
 +
*All lines are part of the Raw Sequence Line until a line that starts with a '+' is discovered.
 +
*All lines are considered part of the quality string until at least the length of the associated raw sequence is hit (or the end of the file is reached).  This is due to the fact that '@' is a valid quality character, so does not necessarily indicate the start of a Sequence Identifier Line.
  
  
<source lang="cpp">
+
== Additional Features ==
 +
*Consumes gzipped and uncompressed text files transparently.
 +
*Prints error messages for errors up to the configurable maximum number of reportable errors.
 +
*[[FastQValidator|Standalone Program for Validating a FastQ File]]
  
</source>
+
== Additional Wishlist - Not Implemented ==
 +
*Add an option that would reject raw sequence and quality strings that wrap over multiple lines.  It would only allow 1 line per raw sequence/quality string.
 +
*Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).

Latest revision as of 10:58, 2 February 2017


Reading/Validating FastQ Files In Your Program

The FastQFile class allows a user to easily read/validate a fastq file.

Documentation on this class is available at: http://csg.sph.umich.edu//mktrost/doxygen/current/classFastQFile.html

Usage Example

Generally this class is used to read a fastq file and perform operations on the sequences. Here is an example of how this would be done.

   FastQFile fastQFile;
   String filename = <your filename>;
   // Open the fastqfile with the default UNKNOWN space type which will determine the 
   // base type from the first character in the sequence.
   if(fastQFile.openFile(filename) != FASTQ_SUCCESS)
   {
      // Failed to open the specified file.
      // Report the error and exit (handled by error).
      error("Failed to open file: %s", filename.c_str());
      return (<your return info to indicate failure>);
   }
   // Keep reading the file until there are no more fastq sequences to process.
   while (fastQFile.keepReadingFile())
   {
      // Read one sequence. This call will read all the lines for 
      // one sequence.
      /////////////////////////////////////////////////////////////////
      // NOTE: It is up to you if you want to process only for success:
      //    if(readFastQSequence() == FASTQ_SUCCESS)
      // or for FASTQ_SUCCESS and FASTQ_INVALID: 
      //    if(readFastQSequence() != FASTQ_FAILURE)
      // Do NOT try to process on a FASTQ_FAILURE
      /////////////////////////////////////////////////////////////////
      if(fastQFile.readFastQSequence() == FastQFile::FASTQ_SUCCESS)
      {
         // The sequence is valid.
         <Your Processing Here>
         // For example if you want to print the lines of the sequence:
         printf("The Sequence ID Line is: %s", fastQFile.mySequenceIdLine.c_str());
         printf("The Sequence ID is: %s", fastQFile.mySequenceIdentifier.c_str());
         printf("The Sequence Line is: %s", fastQFile.myRawSequence.c_str());
         printf("The Plus Line is: %s", fastQFile.myPlusLine.c_str());
         printf("The Quality String Line is: %s", fastQFile.myQualityString.c_str());
      }
   }
   // Finished processing all of the sequences in the file.
   // Close the input file.
   fastQFile.closeFile();
   return(<your return info>); // It is up to you to determine your return.

Validation Criteria Used For Reading a Sequence

FastQ Validation Criteria

Reading Sequence Assumptions

  • The Sequence Identifier is separated by an optional comment by a " ".
  • No validation is required on the optional comment field of the Sequence Identifier Line.
  • The Sequence Identifier and the '+' Lines cannot wrap lines. The are each completely contained on one line.
  • Raw Sequences and Quality Strings may wrap lines
  • All lines are part of the Raw Sequence Line until a line that starts with a '+' is discovered.
  • All lines are considered part of the quality string until at least the length of the associated raw sequence is hit (or the end of the file is reached). This is due to the fact that '@' is a valid quality character, so does not necessarily indicate the start of a Sequence Identifier Line.


Additional Features

Additional Wishlist - Not Implemented

  • Add an option that would reject raw sequence and quality strings that wrap over multiple lines. It would only allow 1 line per raw sequence/quality string.
  • Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).