Open main menu

Genome Analysis Wiki β

Difference between revisions of "LibStatGen: FASTQ"

(Classes in the FASTQ Portion of Library)
 
(36 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== How to Use the fastQFile Library ==
+
[[Category:C++]]
*'''Library Name:''' libfqf.a
+
[[Category:libStatGen]]
*'''Include Files:''' FastQFile.h, StringBasics.h (for String parameter)
+
[[Category:libStatGen FASTQ]]
*'''Class Name:''' FastQFile
 
** Constructor Parameters:
 
*** int minReadLength - The minimum length that a base sequence must be for it to be valid.
 
*** int maxReportedErrors - The maximum number of errors that should be reported in detail before suppressing the errors.
 
*'''Open a FastQ File:''' openFile
 
** Parameters:
 
*** String filename - fastq file to be opened.
 
*** String baseType- Raw sequence type enter:
 
****  "A"/"C"/"G"/"T"/"N"  - Bases only;
 
****  "0"/"1"/"2"/"3"/"."  - Color space only;
 
****  ""                  - Base Decision on the first Raw Sequence Character (Default).
 
****  All other characters - Bases & Color space
 
** Return Value
 
*** FastQStatus: FASTQ_SUCCESS if successfully opened, FASTQ_FAILURE if not.
 
*'''Close a FastQ File:''' closeFile
 
** Parameters: NONE
 
** Return Value
 
*** bool: FastQStatus - FASTQ_SUCCESS if successfully closed, FASTQ_FAILURE if not.
 
*'''Determine if a FastQ File is open Method:''' isOpen
 
** Parameters: NONE
 
** Return Value
 
*** bool: true if a file is open, false if not.
 
*'''Validate a FastQ File:''' validateFastQFile
 
** Parameters:
 
*** String filename - fastq file to be validated.
 
*** String baseType- Raw sequence type enter:
 
****  "A"/"C"/"G"/"T"/"N"  - Bases only;
 
****  "0"/"1"/"2"/"3"/"."  - Color space only;
 
****  ""                  - Base Decision on the first Raw Sequence Character (Default).
 
****  All other characters - Bases & Color space
 
** Return Value
 
*** bool: true if there were no errors in the file, false otherwise.
 
*'''Read a FastQ Sequence From the File:''' readFastQSequence
 
** Parameters: NONE
 
** Return Value
 
*** int: FASTQ_SUCCESS if successfully read and valid, FASTQ_FAILURE if not successfully read, FASTQ_INVALID if the sequence was invalid..
 
*'''Get the Space Type for the File:''' getSpaceType
 
** Parameters: NONE
 
** Return Value
 
*** BaseAsciiMap::SPACETYPE: COLOR_SPACE if the file is color space (0,1,2,3,.), BASE_SPACE if the file is base space (A,C,G,T,N), BOTH_SPACE if the file is both (0,1,2,3,.,A,C,G,T,N), or UNKNOWN if it has yet to be determined.
 
  
 +
== Where to find the fastqFile Library and the FastQValidator ==
  
== Validation Criteria ==
+
The fastQ Library is now a part of [[C++ Library: libStatGen]].
{| class="wikitable" style="width:100%" border="1"
 
|+ style="font-size:150%"|'''Sequence Identifier Line'''
 
!  width="50%"|Validation Criteria
 
!  width="50%"|Error Message
 
|-
 
|  Line is at least 2 characters long ('@' and at least 1 for the sequence identifier)
 
|  ERROR on Line <current line #>: The sequence identifier line was too short.
 
|-
 
|  Line starts with an '@'
 
|  ERROR on Line <current line #>: First line of a sequence does not begin wtih @
 
|-
 
|  Line does not contain a space between the '@' and the first sequence identifier (which must be at least 1 character).
 
|  ERROR on Line <current line #>: No Sequence Identifier specified before the comment.
 
|-
 
|  Every entry in the file should have a unique identifier.
 
|  ERROR on Line <current line #>: Repeated Sequence Identifier: <identifier> at Lines <previous line #> <current line #>
 
|}
 
  
 +
The FastQValidator is documented at [[FastQValidator]].
  
{| class="wikitable" style="width:100%" border="1"
+
== FASTQ Library Component for Reading and Validating FastQFiles ==
|+ style="font-size:150%"|'''Raw Sequence Line'''
+
The software reads and validates fastq files in both compressed and uncompressed formats.
!  width="50%"|Validation Criteria
 
!  width="50%"|Error Message
 
|-
 
|  A base sequence should have non-zero length.
 
|  ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: 0 < <config min read length>
 
|-
 
|  All characters in the base sequence must be in the allowable set specified via configuration.
 
* Base Only: A C T G N a c t g n
 
* Color Space Only: 0 1 2 3 .(period)
 
* Base or Color Space: A C T G N a c t g n 0 1 2 3 .(period)
 
|  ERROR on Line <current line #>: Invalid character ('<invalid char>') in base sequence.
 
|-
 
|  Reads should be of a configurable minimum length since many mappers will get into trouble with very short reads.
 
* If the raw sequence spans lines, the sum of the lengths of all lines are validated, not each individual line.
 
|  ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: <read length> < <config min read length>
 
|-
 
|  Each Line of a Raw Sequence should have at least 1 character (not be blank).
 
|  ERROR on Line <current line #>: Looking for continuation of Raw Sequence or '+' instead found a blank line, assuming it was part of Raw Sequence.
 
|}
 
  
 +
The FASTQ component of the library is found in libStatGen/fastq/.
  
{| class="wikitable" style="width:100%" border="1"
+
See https://github.com/statgen/libStatGen/commits/master/fastq for a list of the most recent updates to the development version of the FASTQ portion of the library.
|+ style="font-size:150%"|'''Plus Line'''
 
!  width="50%"|Validation Criteria
 
!  width="50%"|Error Message
 
|-
 
|  Must exist for every sequence.
 
|  ERROR on Line <current line #>: Reached the end of the file without a '+' line.
 
|-
 
|  If the optional sequence identifier is specified, it must equal the one on the Sequence Identifier Line.
 
|  ERROR on Line <current line #>: Sequence Identifier on '+' line does not equal the one on the '@' line.
 
|}
 
  
 +
For the old change log, see: [[C++ Library: FASTQ Change Log]]
  
{| class="wikitable" style="width:100%" border="1"
+
=== Classes in the FASTQ Portion of Library ===
|+ style="font-size:150%"|'''Quality String Line'''
+
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
! width="50%"|Validation Criteria
+
|-style="background: #f2f2f2; text-align: center;"
width="50%"|Error Message
+
! Class Name !Description
 
|-
 
|-
| A quality string should be present for every base sequence.
+
| <code>[[C++ Class: FastQFile|FastQFile]]</code>
|  ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
+
| Class used for reading/validating a fastq file.
 
|-
 
|-
Paired quality and base sequences should be of the same length.
+
| <code>[http://csg.sph.umich.edu//mktrost/doxygen/current/classBaseCount.html BaseCount]</code>
| ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
+
| Wrapper around an array that has one index per base and an extra index for a total count of all bases. This class is used to keep a count of the number of times each index has occurredIt can print a percentage of the occurrence of each base against the total number of bases.
 
|-
 
|-
| Valid quality values should all have ASCII codes &gt; 32.
+
| <code>[http://csg.sph.umich.edu//mktrost/doxygen/current/classBaseComposition.html BaseComposition]</code>
| ERROR on Line <current line #>: Invalid character ('<invalid char>') in quality string.
+
| Class that tracks the composition of base by read location.
 +
|-
 +
| <code>[http://csg.sph.umich.edu//mktrost/doxygen/current/classFastQStatus.html FastQStatus]</code>
 +
| Status for FastQ operations.
 
|}
 
|}
  
 
+
== FASTQ Output ==  
== Additional Features ==
+
When a sequence is read, error messages for the first maxReportedErrors are output for failed [[C++ Class: FastQFile#Validation Criteria Used For Reading a Sequence|Validation Criteria]].
*Base composition are reported and tracked by position.
+
For Example:
*Consumes gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
 
*Prints error messages for errors up to the configurable maximum number of reportable errors.  A summary of the total number of errors is also printed.
 
*Prints the total number of lines processed as well as the total number of sequences processed.
 
 
 
 
 
== Assumptions ==
 
*The Sequence Identifier is separated by an optional comment by a " ".
 
*No validation is required on the optional comment field of the Sequence Identifier Line.
 
*The Sequence Identifier and the '+' Lines cannot wrap lines.  The are each completely contained on one line.
 
*Raw Sequences and Quality Strings may wrap lines
 
*All lines are part of the Raw Sequence Line until a line that starts with a '+' is discovered.
 
*All lines are considered part of the quality string until at least the length of the associated raw sequence is hit (or the end of the file is reached).  This is due to the fact that '@' is a valid quality character, so does not necessarily indicate the start of a Sequence Identifier Line.
 
 
 
 
 
== Additional Wishlist - Not Implemented ==
 
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
 
*Add an option that would reject raw sequence and quality strings that wrap over multiple lines.  It would only allow 1 line per raw sequence/quality string.
 
*Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).
 
 
 
 
 
== Possible Issues ==
 
*  For color space, there is no specification for:  
 
# The length of read and quality string may be the same or differs by 1 (depending on whether the primer base has a corresponding quality value).
 
# Missing values are usually presented by "." or sometimes left as a blank " ".
 
# Tag names for paired end reads may be the same (e.g. MAQ actually enforces that), and may be in the same file (e.g. BFAST require paired reads in the same file)
 
 
 
 
 
== How to Use the fastQValidator Executable ==
 
'''Required Parameters:'''
 
        -f  :  FastQ filename with path to be prorcessed.
 
 
 
'''Optional Parameters:'''
 
        -l  :  Minimum allowed read length (Defaults to 10).
 
        -e  :  Maximum number of errors to display before suppressing them(Defaults to 20).
 
        -b  :  Raw sequence type: "A"/"C"/"G"/"T"/"N"  - Bases only;
 
                                  "0"/"1"/"2"/"3"/"."  - Color space only;
 
                                  ""                  - Base Decision on the first Raw Sequence Character (Default)
 
                                  All other characters - Bases & Color space
 
 
 
'''Testing only Parameters:'''
 
        -t  :  If "ReadOnly" is specified, the fastq will be read but not processed.  This may be used for determining read time.
 
'''Usage:'''
 
        ./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>
 
 
 
'''Examples:'''
 
        ../fastQValidator -f testFile.txt
 
        ../fastQValidator -f testFile.txt -l 10 -b A -e 100
 
        ./fastQValidator -f test/testFile.txt -l 10 -b Z -e 100
 
        time ./fastQValidator -f test/testFile.txt -t ReadOnly
 
 
 
 
 
== FastQ Validator Output ==
 
When running the fastQValidator Executable, the output starts with a summary of the parameters:
 
The following parameters are in effect:
 
              FastQ File Name :    testFile.txt (-fname)
 
              Min Read Length :              10 (-l9999)
 
          Max Reported Errors :            100 (-e9999)
 
                      BaseType :              A (-bname)
 
                      TestMode :                (-tname)
 
 
 
Both the Executable and the Library outputs the following:
 
*Error messages for the first Configurable number of errors.:
 
 
  ERROR on Line 25: The sequence identifier line was too short.
 
  ERROR on Line 25: The sequence identifier line was too short.
 
  ERROR on Line 29: First line of a sequence does not begin wtih @
 
  ERROR on Line 29: First line of a sequence does not begin wtih @
 
  ERROR on Line 33: No Sequence Identifier specified before the comment.
 
  ERROR on Line 33: No Sequence Identifier specified before the comment.
*Base Composition Percentages by Index:
 
  
Base Composition Statistics:
+
== FastQValidator ==
Read Index %A %C %G %T %N Total Reads At Index
+
The [[FastQValidator]] was built using the FastQFile classMore details on that program are at the supplied link.
        0  100.00    0.00    0.00    0.00    0.00 20
 
        1    5.00  95.00    0.00    0.00    0.00 20
 
        2    5.00    0.00    5.00  90.00    0.00 20
 
*Summary of the number of lines, sequences, and errors:
 
Finished processing testFile.txt with 92 lines containing 20 sequences.
 
  There were a total of 17 errors.
 

Latest revision as of 10:49, 2 February 2017


Contents

Where to find the fastqFile Library and the FastQValidator

The fastQ Library is now a part of C++ Library: libStatGen.

The FastQValidator is documented at FastQValidator.

FASTQ Library Component for Reading and Validating FastQFiles

The software reads and validates fastq files in both compressed and uncompressed formats.

The FASTQ component of the library is found in libStatGen/fastq/.

See https://github.com/statgen/libStatGen/commits/master/fastq for a list of the most recent updates to the development version of the FASTQ portion of the library.

For the old change log, see: C++ Library: FASTQ Change Log

Classes in the FASTQ Portion of Library

Class Name Description
FastQFile Class used for reading/validating a fastq file.
BaseCount Wrapper around an array that has one index per base and an extra index for a total count of all bases. This class is used to keep a count of the number of times each index has occurred. It can print a percentage of the occurrence of each base against the total number of bases.
BaseComposition Class that tracks the composition of base by read location.
FastQStatus Status for FastQ operations.

FASTQ Output

When a sequence is read, error messages for the first maxReportedErrors are output for failed Validation Criteria. For Example:

ERROR on Line 25: The sequence identifier line was too short.
ERROR on Line 29: First line of a sequence does not begin wtih @
ERROR on Line 33: No Sequence Identifier specified before the comment.

FastQValidator

The FastQValidator was built using the FastQFile class. More details on that program are at the supplied link.