Changes

From Genome Analysis Wiki
Jump to navigationJump to search
1,581 bytes added ,  14:34, 4 February 2010
no edit summary
Line 5: Line 5:  
!  Validation Criteria
 
!  Validation Criteria
 
!  Error Message
 
!  Error Message
 +
|-
 +
|  Line is at least 2 characters long ('@' and at least 1 for the sequence identifier)
 +
|  ERROR on Line <current line #>:
 +
|-
 +
|  Line starts with an '@'
 +
|  ERROR on Line <current line #>:
 +
|-
 +
|  Line does not contain a space between the '@' and the first sequence identifier (which must be at least 1 character).
 +
|  ERROR on Line <current line #>:
 
|-
 
|-
 
|  Every entry in the file should have a unique identifier.
 
|  Every entry in the file should have a unique identifier.
Line 25: Line 34:  
|  ERROR on Line <current line #>:  
 
|  ERROR on Line <current line #>:  
 
|-
 
|-
|  Reads should be of a minimum length; many mappers will get into trouble with very short reads.
+
|  Reads should be of a configurable minimum length since many mappers will get into trouble with very short reads.
 +
* If the raw sequence spans lines, the sum of the lengths of all lines are validated, not each individual line.
 +
|  ERROR on Line <current line #>:
 +
|-
 +
|  Each Line of a Raw Sequence should have at least 1 character (not be blank).
 
|  ERROR on Line <current line #>:  
 
|  ERROR on Line <current line #>:  
 
|}
 
|}
    
=== Plus Line ===
 
=== Plus Line ===
 +
{| class="wikitable" border="1"
 +
|-
 +
!  Validation Criteria
 +
!  Error Message
 +
|-
 +
|  Must exist for every sequence.
 +
|  ERROR on Line <current line #>:
 +
|-
 +
|  If the optional sequence identifier is specified, it must equal the one on the Sequence Identifier Line.
 +
|  ERROR on Line <current line #>:
 +
|}
    
=== Quality String Line ===
 
=== Quality String Line ===
=== Raw Sequence Line ===
   
{| class="wikitable" border="1"
 
{| class="wikitable" border="1"
 
|-
 
|-
Line 54: Line 77:  
== Additional Wishlist - Not Implemented ==
 
== Additional Wishlist - Not Implemented ==
 
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
 
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
         
== Assumptions ==
 
== Assumptions ==
 +
*The Sequence Identifier is separated by an optional comment by a " ".
 +
*No validation is required on the optional comment field of the Sequence Identifier Line.
 +
*The Sequence Identifier and the '+' Lines cannot wrap lines.  The are each completely contained on one line.
 +
*Raw Sequences and Quality Strings may wrap lines
 +
*All lines are part of the Raw Sequence Line until a line that starts with a '+' is discovered.
 +
*All lines are considered part of the quality string until at least the length of the associated raw sequence is hit (or the end of the file is reached).  This is due to the fact that '@' is a valid quality character, so does not necessarily indicate the start of a Sequence Identifier Line.
    
== How to Use the fastQValidator Executable ==
 
== How to Use the fastQValidator Executable ==

Navigation menu