Difference between revisions of "FastQ Validation Criteria"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 33: Line 33:
 
|  All characters in the base sequence must be in the allowable set specified via configuration.
 
|  All characters in the base sequence must be in the allowable set specified via configuration.
 
* Base Only: A C T G N a c t g n
 
* Base Only: A C T G N a c t g n
* Color Space Only: 0 1 2 3 .(period) (Color Space files must start with a 1 character primre)
+
* Color Space Only: 0 1 2 3 .(period) Color Space files must start with a 1 character primer base.
 
|  ERROR on Line <current line #>: Invalid character ('<invalid char>') in base sequence.
 
|  ERROR on Line <current line #>: Invalid character ('<invalid char>') in base sequence.
 
|-
 
|-

Revision as of 10:32, 15 April 2010

FastQ Sequence Validation Criteria

The following validation criteria is used by FastQFile class and the the FastQ Validator Program when reading a FastQ Sequence


Sequence Identifier Line
Validation Criteria Error Message
Line is at least 2 characters long ('@' and at least 1 for the sequence identifier) ERROR on Line <current line #>: The sequence identifier line was too short.
Line starts with an '@' ERROR on Line <current line #>: First line of a sequence does not begin wtih @
Line does not contain a space between the '@' and the first sequence identifier (which must be at least 1 character). ERROR on Line <current line #>: No Sequence Identifier specified before the comment.
Every entry in the file should have a unique identifier. ERROR on Line <current line #>: Repeated Sequence Identifier: <identifier> at Lines <previous line #> <current line #>


Raw Sequence Line
Validation Criteria Error Message
A base sequence should have non-zero length. ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: 0 < <config min read length>
All characters in the base sequence must be in the allowable set specified via configuration.
  • Base Only: A C T G N a c t g n
  • Color Space Only: 0 1 2 3 .(period) Color Space files must start with a 1 character primer base.
ERROR on Line <current line #>: Invalid character ('<invalid char>') in base sequence.
Reads should be of a configurable minimum length since many mappers will get into trouble with very short reads.
  • If the raw sequence spans lines, the sum of the lengths of all lines are validated, not each individual line.
ERROR on Line <current line #>: Raw Sequence is shorter than the min read length: <read length> < <config min read length>
Each Line of a Raw Sequence should have at least 1 character (not be blank). ERROR on Line <current line #>: Looking for continuation of Raw Sequence or '+' instead found a blank line, assuming it was part of Raw Sequence.


Plus Line
Validation Criteria Error Message
Must exist for every sequence. ERROR on Line <current line #>: Reached the end of the file without a '+' line.
If the optional sequence identifier is specified, it must equal the one on the Sequence Identifier Line. ERROR on Line <current line #>: Sequence Identifier on '+' line does not equal the one on the '@' line.


Quality String Line
Validation Criteria Error Message
A quality string should be present for every base sequence. ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
Paired quality and base sequences should be of the same length. ERROR on Line <current line #>: Quality string length (<quality length>) does not equal raw sequence length (<raw sequence length>)
Valid quality values should all have ASCII codes > 32. ERROR on Line <current line #>: Invalid character ('<invalid char>') in quality string.