Line 1: |
Line 1: |
| + | [[Category:C++]] |
| + | [[Category:libStatGen]] |
| + | [[Category:libStatGen FASTQ]] |
| + | |
| == Reading/Validating FastQ Files In Your Program == | | == Reading/Validating FastQ Files In Your Program == |
| The '''FastQFile''' class allows a user to easily read/validate a fastq file. | | The '''FastQFile''' class allows a user to easily read/validate a fastq file. |
| | | |
− | === Public Class Methods ===
| + | Documentation on this class is available at: http://csg.sph.umich.edu//mktrost/doxygen/current/classFastQFile.html |
− | | |
− | {| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
| |
− | |-style="background: #f2f2f2; text-align: center;" '''SamFile Class Methods'''
| |
− | ! Method Name !! Description
| |
− | |-
| |
− | | <code>FastQFile::FastQFile(int minReadLength = 10, int maxReportedErrors = 20)</code>
| |
− | | Initializes the FastQFile class setting the minimum length that a base sequence must be for it to be valid to minReadLength and setting the maximum number of errors that should be reported in detail before suppressing the errors to maxReportedErrors.
| |
− | |-
| |
− | | <code> void FastQFile::disableMessages()</code>
| |
− | | Disables cout messages. This does not include the Base Composition. Those are turned on/off based with a parameter to validateFastQFile.
| |
− | |-
| |
− | | <code>void FastQFile::enableMessages()</code>
| |
− | |Enables cout messages. This does not include the Base Composition. Those are turned on/off based on a parameter to validateFastQFile.
| |
− | |-
| |
− | | <code>void FastQFile::setQuitAfterErrorNum(int quitAfterErrorNum)</code>
| |
− | | Set the number of errors after which to quit reading/validating a file. (Defaults to -1)
| |
− | -1 indicates to not quit until the entire file has been read/validated.
| |
− | | |
− | 0 indicates to quit without reading/validating anything.
| |
− | |-
| |
− | | <code>FastQStatus FastQFile::openFile(const char* fileName, BaseAsciiMap::SPACE_TYPE spaceType = BaseAsciiMap::UNKNOWN)</code>
| |
− | | Open the specified fastq file with the specified base space type which defaults to UNKOWN (base the decision on the first character of the sequence).
| |
− | | |
− | Returns FastQStatus to indicate if the open was successful or not.
| |
− | |-
| |
− | | <code>FastQStatus FastQFile::closeFile()</code>
| |
− | | Close the opened fastq file.
| |
− | Returns FastQStatus to indicate if the close was successful or not.
| |
− | |-
| |
− | | <code>bool FastQFile::isOpen()</code>
| |
− | | Returns true if a fastq file is open, false if one is not open.
| |
− | |-
| |
− | | <code>bool FastQFile::isEof()</code>
| |
− | | Returns true if it is the end of the file, false if not.
| |
− | |-
| |
− | | <code>FastQStatus FastQFile::validateFastQFile(const String &filename, bool printBaseComp, BaseAsciiMap::SPACE_TYPE spaceType)</code>
| |
− | | Validates the specified fastq file using the specified [[C++ Class: BaseAsciiMap#Public Class Enums|SpaceType]], printing the base composition if specified by printBaseComp.
| |
− | Returns the fastq validation status - SUCCESS on a successfully validated fastq file.
| |
− | | |
− | Invalid error messages are printed to cout until maxReportedErrors is hit.
| |
− | |-
| |
− | | <code>FastQStatus FastQFile::readFastQSequence()</code>
| |
− | | Read the next fastq sequence from the file
| |
− | Returns FASTQ_SUCCESS if it was successfully read and was valid. Otherwise the failure status is returned.
| |
− | | |
− | Invalid error messages are printed to cout until maxReportedErrors is hit.
| |
− | |-
| |
− | | <code>BaseAsciiMap::SPACE_TYPE FastQFile::getSpaceType()</code>
| |
− | | Return the [[C++ Class: BaseAsciiMap#Public Class Enums|SpaceType]] for the File
| |
− | |-
| |
− | |}
| |
− | | |
− | === Public Class Attributes ===
| |
− | These attributes allow access to the contents of the last read fastq sequence. They are public <code>String</code> variables in order to avoid copying the strings.
| |
− | | |
− | {| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
| |
− | |-style="background: #f2f2f2; text-align: center;"
| |
− | ! Attribute !! Description
| |
− | |-
| |
− | | <code>myRawSequence</code>
| |
− | | <code>String</code> containing the raw sequence for the last fastq sequence that was read.
| |
− | |-
| |
− | | <code>mySequenceIdLine</code>
| |
− | | <code>String</code> containing the sequence identifier line for the last fastq sequence that was read.
| |
− | |-
| |
− | | <code>mySequenceIdentifier</code>
| |
− | | <code>String</code> containing the sequence identifier for the last fastq sequence that was read.
| |
− | |-
| |
− | | <code>myPlusLine</code>
| |
− | | <code>String</code> containing the plus line for the last fastq sequence that was read.
| |
− | |-
| |
− | | <code>myQualityString</code>
| |
− | | <code>String</code> containing the quality string for the last fastq sequence that was read.
| |
− | |}
| |
− | | |
− | === Public Class Enums ===
| |
− | {| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
| |
− | |-style="background: #f2f2f2; text-align: center;"
| |
− | ! colspan="2"| enum FastQStatus
| |
− | |-
| |
− | ! Enum Value !! Description
| |
− | |-
| |
− | | FASTQ_SUCCESS
| |
− | | method finished successfully.
| |
− | |-
| |
− | | FASTQ_FAILURE
| |
− | | method failed to complete successfully.
| |
− | |-
| |
− | | FASTQ_INVALID
| |
− | | sequence was invalid.
| |
− | |}
| |
| | | |
| === Usage Example === | | === Usage Example === |
Line 112: |
Line 25: |
| } | | } |
| // Keep reading the file until there are no more fastq sequences to process. | | // Keep reading the file until there are no more fastq sequences to process. |
− | while (!fastQFile.isEof()) | + | while (fastQFile.keepReadingFile()) |
| { | | { |
| // Read one sequence. This call will read all the lines for | | // Read one sequence. This call will read all the lines for |
Line 142: |
Line 55: |
| | | |
| == Validation Criteria Used For Reading a Sequence == | | == Validation Criteria Used For Reading a Sequence == |
− | {| class="wikitable" style="width:100%"
| + | [[FastQ Validation Criteria]] |
− | |+ style="font-size:150%" |'''Sequence Identifier Line'''
| |
− | ! width="50%"|Validation Criteria
| |
− | ! width="50%"|Error Message
| |
− | |-
| |
− | | Line is at least 2 characters long ('@' and at least 1 for the sequence identifier)
| |
− | | ERROR on Line <current line #>: The sequence identifier line was too short.
| |
− | |-
| |
− | | Line starts with an '@'
| |
− | | ERROR on Line <current line #>: First line of a sequence does not begin wtih @
| |
− | |-
| |
− | | Line does not contain a space between the '@' and the first sequence identifier (which must be at least 1 character).
| |
− | | ERROR on Line <current line #>: No Sequence Identifier specified before the comment.
| |
− | |-
| |
− | | Every entry in the file should have a unique identifier.
| |
− | | ERROR on Line <current line #>: Repeated Sequence Identifier: <identifier> at Lines <previous line #> <current line #>
| |
− | |}
| |
− | | |
− | | |
− | {| class="wikitable" style="width:100%" border="1"
| |
− | |+ style="font-size:150%"|'''Raw Sequence Line'''
| |
− | ! width="50%"|Validation Criteria
| |
− | ! width="50%"|Error Message
| |
− | |-
| |
− | | A base sequence should have non-zero length.
| |
− | | ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: 0 < <config min read length>
| |
− | |-
| |
− | | All characters in the base sequence must be in the allowable set specified via configuration.
| |
− | * Base Only: A C T G N a c t g n
| |
− | * Color Space Only: 0 1 2 3 .(period)
| |
− | | ERROR on Line <current line #>: Invalid character ('<invalid char>') in base sequence.
| |
− | |-
| |
− | | Reads should be of a configurable minimum length since many mappers will get into trouble with very short reads.
| |
− | * If the raw sequence spans lines, the sum of the lengths of all lines are validated, not each individual line.
| |
− | | ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: <read length> < <config min read length>
| |
− | |-
| |
− | | Each Line of a Raw Sequence should have at least 1 character (not be blank).
| |
− | | ERROR on Line <current line #>: Looking for continuation of Raw Sequence or '+' instead found a blank line, assuming it was part of Raw Sequence.
| |
− | |}
| |
− | | |
− | | |
− | {| class="wikitable" style="width:100%" border="1"
| |
− | |+ style="font-size:150%"|'''Plus Line'''
| |
− | ! width="50%"|Validation Criteria
| |
− | ! width="50%"|Error Message
| |
− | |-
| |
− | | Must exist for every sequence.
| |
− | | ERROR on Line <current line #>: Reached the end of the file without a '+' line.
| |
− | |-
| |
− | | If the optional sequence identifier is specified, it must equal the one on the Sequence Identifier Line.
| |
− | | ERROR on Line <current line #>: Sequence Identifier on '+' line does not equal the one on the '@' line.
| |
− | |}
| |
− | | |
− | | |
− | {| class="wikitable" style="width:100%" border="1"
| |
− | |+ style="font-size:150%"|'''Quality String Line'''
| |
− | ! width="50%"|Validation Criteria
| |
− | ! width="50%"|Error Message
| |
− | |-
| |
− | | A quality string should be present for every base sequence.
| |
− | | ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
| |
− | |-
| |
− | | Paired quality and base sequences should be of the same length.
| |
− | | ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
| |
− | |-
| |
− | | Valid quality values should all have ASCII codes > 32.
| |
− | | ERROR on Line <current line #>: Invalid character ('<invalid char>') in quality string.
| |
− | |}
| |
− | | |
| | | |
| == Reading Sequence Assumptions == | | == Reading Sequence Assumptions == |
Line 224: |
Line 69: |
| *Consumes gzipped and uncompressed text files transparently. | | *Consumes gzipped and uncompressed text files transparently. |
| *Prints error messages for errors up to the configurable maximum number of reportable errors. | | *Prints error messages for errors up to the configurable maximum number of reportable errors. |
− | *[[FastQ Validator|Standalone Program for Validating a FastQ File]] | + | *[[FastQValidator|Standalone Program for Validating a FastQ File]] |
− | | |
| | | |
| == Additional Wishlist - Not Implemented == | | == Additional Wishlist - Not Implemented == |
| *Add an option that would reject raw sequence and quality strings that wrap over multiple lines. It would only allow 1 line per raw sequence/quality string. | | *Add an option that would reject raw sequence and quality strings that wrap over multiple lines. It would only allow 1 line per raw sequence/quality string. |
| *Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors). | | *Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors). |