Difference between revisions of "LibStatGen: FASTQ"

From Genome Analysis Wiki
Jump to: navigation, search
Line 7: Line 7:
 
|-
 
|-
 
|  Line is at least 2 characters long ('@' and at least 1 for the sequence identifier)
 
|  Line is at least 2 characters long ('@' and at least 1 for the sequence identifier)
|  ERROR on Line <current line #>:  
+
|  ERROR on Line <current line #>: The sequence identifier line was too short.
 
|-
 
|-
 
|  Line starts with an '@'
 
|  Line starts with an '@'
|  ERROR on Line <current line #>:  
+
|  ERROR on Line <current line #>: First line of a sequence does not begin wtih @
 
|-
 
|-
 
|  Line does not contain a space between the '@' and the first sequence identifier (which must be at least 1 character).
 
|  Line does not contain a space between the '@' and the first sequence identifier (which must be at least 1 character).
|  ERROR on Line <current line #>:  
+
|  ERROR on Line <current line #>: No Sequence Identifier specified before the comment.
 
|-
 
|-
 
|  Every entry in the file should have a unique identifier.
 
|  Every entry in the file should have a unique identifier.
Line 26: Line 26:
 
|-
 
|-
 
|  A base sequence should have non-zero length.
 
|  A base sequence should have non-zero length.
|  ERROR on Line <current line #>:  
+
|  ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: 0 < <config min read length>
 
|-
 
|-
 
|  All characters in the base sequence must be in the allowable set specified via configuration.
 
|  All characters in the base sequence must be in the allowable set specified via configuration.
Line 32: Line 32:
 
* Color Space Only: 0 1 2 3 .(period)
 
* Color Space Only: 0 1 2 3 .(period)
 
* Base or Color Space: A C T G N a c t g n 0 1 2 3 .(period)
 
* Base or Color Space: A C T G N a c t g n 0 1 2 3 .(period)
|  ERROR on Line <current line #>:  
+
|  ERROR on Line <current line #>: Invalid character ('<invalid char>') in base sequence.
 
|-
 
|-
 
|  Reads should be of a configurable minimum length since many mappers will get into trouble with very short reads.
 
|  Reads should be of a configurable minimum length since many mappers will get into trouble with very short reads.
 
* If the raw sequence spans lines, the sum of the lengths of all lines are validated, not each individual line.
 
* If the raw sequence spans lines, the sum of the lengths of all lines are validated, not each individual line.
|  ERROR on Line <current line #>:  
+
|  ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: <read length> < <config min read length>
 
|-
 
|-
 
|  Each Line of a Raw Sequence should have at least 1 character (not be blank).
 
|  Each Line of a Raw Sequence should have at least 1 character (not be blank).
|  ERROR on Line <current line #>:  
+
|  ERROR on Line <current line #>: Looking for continuation of Raw Sequence or '+' instead found a blank line, assuming it was part of Raw Sequence.
 
|}
 
|}
  
Line 49: Line 49:
 
|-
 
|-
 
|  Must exist for every sequence.
 
|  Must exist for every sequence.
|  ERROR on Line <current line #>:  
+
|  ERROR on Line <current line #>: Reached the end of the file without a '+' line.
 
|-
 
|-
 
|  If the optional sequence identifier is specified, it must equal the one on the Sequence Identifier Line.
 
|  If the optional sequence identifier is specified, it must equal the one on the Sequence Identifier Line.
|  ERROR on Line <current line #>:  
+
|  ERROR on Line <current line #>: Sequence Identifier on '+' line does not equal the one on the '@' line.
 
|}
 
|}
  
Line 62: Line 62:
 
|-
 
|-
 
|  A quality string should be present for every base sequence.
 
|  A quality string should be present for every base sequence.
|  ERROR on Line <current line #>:  
+
|  ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
 
|-
 
|-
 
|  Paired quality and base sequences should be of the same length.
 
|  Paired quality and base sequences should be of the same length.
|  ERROR on Line <current line #>:  
+
|  ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
 
|-
 
|-
 
|  Valid quality values should all have ASCII codes &gt; 32.
 
|  Valid quality values should all have ASCII codes &gt; 32.
|  ERROR on Line <current line #>:  
+
|  ERROR on Line <current line #>: Invalid character ('<invalid char>') in quality string.
 
|}
 
|}
  

Revision as of 14:42, 4 February 2010

Validation Criteria

Sequence Identifier Line

Validation Criteria Error Message
Line is at least 2 characters long ('@' and at least 1 for the sequence identifier) ERROR on Line <current line #>: The sequence identifier line was too short.
Line starts with an '@' ERROR on Line <current line #>: First line of a sequence does not begin wtih @
Line does not contain a space between the '@' and the first sequence identifier (which must be at least 1 character). ERROR on Line <current line #>: No Sequence Identifier specified before the comment.
Every entry in the file should have a unique identifier. ERROR on Line <current line #>: Repeated Sequence Identifier: <identifier> at Lines <previous line #> <current line #>

Raw Sequence Line

Validation Criteria Error Message
A base sequence should have non-zero length. ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: 0 < <config min read length>
All characters in the base sequence must be in the allowable set specified via configuration.
  • Base Only: A C T G N a c t g n
  • Color Space Only: 0 1 2 3 .(period)
  • Base or Color Space: A C T G N a c t g n 0 1 2 3 .(period)
ERROR on Line <current line #>: Invalid character ('<invalid char>') in base sequence.
Reads should be of a configurable minimum length since many mappers will get into trouble with very short reads.
  • If the raw sequence spans lines, the sum of the lengths of all lines are validated, not each individual line.
ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: <read length> < <config min read length>
Each Line of a Raw Sequence should have at least 1 character (not be blank). ERROR on Line <current line #>: Looking for continuation of Raw Sequence or '+' instead found a blank line, assuming it was part of Raw Sequence.

Plus Line

Validation Criteria Error Message
Must exist for every sequence. ERROR on Line <current line #>: Reached the end of the file without a '+' line.
If the optional sequence identifier is specified, it must equal the one on the Sequence Identifier Line. ERROR on Line <current line #>: Sequence Identifier on '+' line does not equal the one on the '@' line.

Quality String Line

Validation Criteria Error Message
A quality string should be present for every base sequence. ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
Paired quality and base sequences should be of the same length. ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
Valid quality values should all have ASCII codes > 32. ERROR on Line <current line #>: Invalid character ('<invalid char>') in quality string.

Additional Features

  • Base composition are reported and tracked by position.
  • Consumes gzipped and uncompressed text files transparently (see libcsg/InputFile.h).

Additional Wishlist - Not Implemented

  • To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).


Assumptions

  • The Sequence Identifier is separated by an optional comment by a " ".
  • No validation is required on the optional comment field of the Sequence Identifier Line.
  • The Sequence Identifier and the '+' Lines cannot wrap lines. The are each completely contained on one line.
  • Raw Sequences and Quality Strings may wrap lines
  • All lines are part of the Raw Sequence Line until a line that starts with a '+' is discovered.
  • All lines are considered part of the quality string until at least the length of the associated raw sequence is hit (or the end of the file is reached). This is due to the fact that '@' is a valid quality character, so does not necessarily indicate the start of a Sequence Identifier Line.

How to Use the fastQValidator Executable

Required Parameters:

       -f  :  FastQ filename with path to be prorcessed.

Optional Parameters:

       -l  :  Minimum allowed read length (Defaults to 10).
       -e  :  Maximum number of errors to display before suppressing them(Defaults to 20).
       -b  :  Raw sequence type:  B - ACTGN only (Default)
                                  C - 0123. only
                                 BC - ACTGN or 0123.

Testing only Parameters:

       -t  :  If "ReadOnly" is specified, the fastq will be read but not processed.  This may be used for determining read time.

Usage:

       ./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>

Examples:

       ../fastQValidator -f testFile.txt
       ../fastQValidator -f testFile.txt -l 10 -b BC -e 100
       ./fastQValidator -f test/testFile.txt -l 10 -b BC -e 100
       time ./fastQValidator -f test/testFile.txt -t ReadOnly

FastQ Validator Output

Coming Soon