Changes

From Genome Analysis Wiki
Jump to navigationJump to search
7,448 bytes removed ,  10:58, 2 February 2017
Line 1: Line 1:  +
[[Category:C++]]
 +
[[Category:libStatGen]]
 +
[[Category:libStatGen FASTQ]]
 +
 
== Reading/Validating FastQ Files In Your Program ==
 
== Reading/Validating FastQ Files In Your Program ==
 
The '''FastQFile''' class allows a user to easily read/validate a fastq file.
 
The '''FastQFile''' class allows a user to easily read/validate a fastq file.
   −
=== Public Class Methods ===
+
Documentation on this class is available at: http://csg.sph.umich.edu//mktrost/doxygen/current/classFastQFile.html
 
  −
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
  −
|-style="background: #f2f2f2; text-align: center;"  '''SamFile Class Methods'''
  −
! Method Name !!  Description
  −
|-
  −
| <code>FastQFile::FastQFile(int minReadLength = 10, int maxReportedErrors = 20)</code>
  −
| Initializes the FastQFile class setting the minimum length that a base sequence must be for it to be valid to minReadLength and setting the maximum number of errors that should be reported in detail before suppressing the errors to maxReportedErrors.
  −
|-
  −
| <code> void FastQFile::disableMessages()</code>
  −
| Disables cout messages.  This does not include the Base Composition.  Those are turned on/off based with a parameter to validateFastQFile.
  −
|-
  −
| <code>void FastQFile::enableMessages()</code>
  −
|Enables cout messages.  This does not include the Base Composition.  Those are turned on/off based on a parameter to validateFastQFile.
  −
|-
  −
| <code>void FastQFile::setQuitAfterErrorNum(int quitAfterErrorNum)</code>
  −
| Set the number of errors after which to quit reading/validating a file. (Defaults to -1)
  −
-1 indicates to not quit until the entire file has been read/validated.
  −
 
  −
0 indicates to quit without reading/validating anything.
  −
|-
  −
| <code>FastQStatus FastQFile::openFile(const char* fileName, BaseAsciiMap::SPACE_TYPE spaceType = BaseAsciiMap::UNKNOWN)</code>
  −
| Open the specified fastq file with the specified base space type which defaults to UNKOWN (base the decision on the first character of the sequence).
  −
 
  −
Returns FastQStatus to indicate if the open was successful or not.
  −
|-
  −
| <code>FastQStatus FastQFile::closeFile()</code>
  −
| Close the opened fastq file.
  −
Returns FastQStatus to indicate if the close was successful or not.
  −
|-
  −
| <code>bool FastQFile::isOpen()</code>
  −
| Returns true if a fastq file is open, false if one is not open.
  −
|-
  −
| <code>bool FastQFile::isEof()</code>
  −
| Returns true if it is the end of the file, false if not.
  −
|-
  −
| <code>FastQStatus FastQFile::validateFastQFile(const String &filename, bool printBaseComp, BaseAsciiMap::SPACE_TYPE spaceType)</code>
  −
| Validates the specified fastq file using the specified [[C++ Class: BaseAsciiMap#Public Class Enums|SpaceType]], printing the base composition if specified by printBaseComp.
  −
Returns the fastq validation status -  SUCCESS on a successfully validated fastq file.
  −
 
  −
Invalid error messages are printed to cout until maxReportedErrors is hit.
  −
|-
  −
| <code>FastQStatus FastQFile::readFastQSequence()</code>
  −
| Read the next fastq sequence from the file
  −
Returns FASTQ_SUCCESS if it was successfully read and was valid.  Otherwise the failure status is returned.
  −
 
  −
Invalid error messages are printed to cout until maxReportedErrors is hit.
  −
|-
  −
| <code>BaseAsciiMap::SPACE_TYPE FastQFile::getSpaceType()</code>
  −
| Return the [[C++ Class: BaseAsciiMap#Public Class Enums|SpaceType]] for the File
  −
|-
  −
|}
  −
 
  −
=== Public Class Attributes ===
  −
These attributes allow access to the contents of the last read fastq sequence.  They are public <code>String</code> variables in order to avoid copying the strings.
  −
 
  −
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
  −
|-style="background: #f2f2f2; text-align: center;"
  −
! Attribute !!  Description
  −
|-
  −
| <code>myRawSequence</code>
  −
| <code>String</code> containing the raw sequence for the last fastq sequence that was read.
  −
|-
  −
| <code>mySequenceIdLine</code>
  −
| <code>String</code> containing the sequence identifier line for the last fastq sequence that was read.
  −
|-
  −
| <code>mySequenceIdentifier</code>
  −
| <code>String</code> containing the sequence identifier for the last fastq sequence that was read.
  −
|-
  −
| <code>myPlusLine</code>
  −
| <code>String</code> containing the plus line for the last fastq sequence that was read.
  −
|-
  −
| <code>myQualityString</code>
  −
| <code>String</code> containing the quality string for the last fastq sequence that was read.
  −
|}
  −
 
  −
=== Public Class Enums ===
  −
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
  −
|-style="background: #f2f2f2; text-align: center;"
  −
! colspan="2"| enum FastQStatus
  −
|-
  −
! Enum Value !!  Description
  −
|-
  −
| FASTQ_SUCCESS
  −
| method finished successfully.
  −
|-
  −
| FASTQ_FAILURE
  −
| method failed to complete successfully.
  −
|-
  −
| FASTQ_INVALID
  −
| sequence was invalid.
  −
|}
      
=== Usage Example ===
 
=== Usage Example ===
Line 112: Line 25:  
   }
 
   }
 
   // Keep reading the file until there are no more fastq sequences to process.
 
   // Keep reading the file until there are no more fastq sequences to process.
   while (!fastQFile.isEof())
+
   while (fastQFile.keepReadingFile())
 
   {
 
   {
 
       // Read one sequence. This call will read all the lines for  
 
       // Read one sequence. This call will read all the lines for  
Line 142: Line 55:     
== Validation Criteria Used For Reading a Sequence ==
 
== Validation Criteria Used For Reading a Sequence ==
{| class="wikitable" style="width:100%"
+
[[FastQ Validation Criteria]]
|+ style="font-size:150%" |'''Sequence Identifier Line'''
  −
!  width="50%"|Validation Criteria
  −
!  width="50%"|Error Message
  −
|-
  −
|  Line is at least 2 characters long ('@' and at least 1 for the sequence identifier)
  −
|  ERROR on Line <current line #>: The sequence identifier line was too short.
  −
|-
  −
|  Line starts with an '@'
  −
|  ERROR on Line <current line #>: First line of a sequence does not begin wtih @
  −
|-
  −
|  Line does not contain a space between the '@' and the first sequence identifier (which must be at least 1 character).
  −
|  ERROR on Line <current line #>: No Sequence Identifier specified before the comment.
  −
|-
  −
|  Every entry in the file should have a unique identifier.
  −
|  ERROR on Line <current line #>: Repeated Sequence Identifier: <identifier> at Lines <previous line #> <current line #>
  −
|}
  −
 
  −
 
  −
{| class="wikitable" style="width:100%" border="1"
  −
|+ style="font-size:150%"|'''Raw Sequence Line'''
  −
!  width="50%"|Validation Criteria
  −
!  width="50%"|Error Message
  −
|-
  −
|  A base sequence should have non-zero length.
  −
|  ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: 0 < <config min read length>
  −
|-
  −
|  All characters in the base sequence must be in the allowable set specified via configuration.
  −
* Base Only: A C T G N a c t g n
  −
* Color Space Only: 0 1 2 3 .(period)
  −
|  ERROR on Line <current line #>: Invalid character ('<invalid char>') in base sequence.
  −
|-
  −
|  Reads should be of a configurable minimum length since many mappers will get into trouble with very short reads.
  −
* If the raw sequence spans lines, the sum of the lengths of all lines are validated, not each individual line.
  −
|  ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: <read length> < <config min read length>
  −
|-
  −
|  Each Line of a Raw Sequence should have at least 1 character (not be blank).
  −
|  ERROR on Line <current line #>: Looking for continuation of Raw Sequence or '+' instead found a blank line, assuming it was part of Raw Sequence.
  −
|}
  −
 
  −
 
  −
{| class="wikitable" style="width:100%" border="1"
  −
|+ style="font-size:150%"|'''Plus Line'''
  −
!  width="50%"|Validation Criteria
  −
!  width="50%"|Error Message
  −
|-
  −
|  Must exist for every sequence.
  −
|  ERROR on Line <current line #>: Reached the end of the file without a '+' line.
  −
|-
  −
|  If the optional sequence identifier is specified, it must equal the one on the Sequence Identifier Line.
  −
|  ERROR on Line <current line #>: Sequence Identifier on '+' line does not equal the one on the '@' line.
  −
|}
  −
 
  −
 
  −
{| class="wikitable" style="width:100%" border="1"
  −
|+ style="font-size:150%"|'''Quality String Line'''
  −
!  width="50%"|Validation Criteria
  −
!  width="50%"|Error Message
  −
|-
  −
|  A quality string should be present for every base sequence.
  −
|  ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
  −
|-
  −
|  Paired quality and base sequences should be of the same length.
  −
|  ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
  −
|-
  −
|  Valid quality values should all have ASCII codes &gt; 32.
  −
|  ERROR on Line <current line #>: Invalid character ('<invalid char>') in quality string.
  −
|}
  −
 
      
== Reading Sequence Assumptions ==
 
== Reading Sequence Assumptions ==
Line 224: Line 69:  
*Consumes gzipped and uncompressed text files transparently.
 
*Consumes gzipped and uncompressed text files transparently.
 
*Prints error messages for errors up to the configurable maximum number of reportable errors.
 
*Prints error messages for errors up to the configurable maximum number of reportable errors.
*[[FastQ Validator|Standalone Program for Validating a FastQ File]]
+
*[[FastQValidator|Standalone Program for Validating a FastQ File]]
 
      
== Additional Wishlist - Not Implemented ==
 
== Additional Wishlist - Not Implemented ==
 
*Add an option that would reject raw sequence and quality strings that wrap over multiple lines.  It would only allow 1 line per raw sequence/quality string.
 
*Add an option that would reject raw sequence and quality strings that wrap over multiple lines.  It would only allow 1 line per raw sequence/quality string.
 
*Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).
 
*Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).
96

edits

Navigation menu