Changes

From Genome Analysis Wiki
Jump to navigationJump to search
984 bytes added ,  14:59, 4 February 2010
no edit summary
Line 70: Line 70:  
|  ERROR on Line <current line #>: Invalid character ('<invalid char>') in quality string.
 
|  ERROR on Line <current line #>: Invalid character ('<invalid char>') in quality string.
 
|}
 
|}
 +
    
== Additional Features ==
 
== Additional Features ==
 
*Base composition are reported and tracked by position.
 
*Base composition are reported and tracked by position.
 
*Consumes gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
 
*Consumes gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
 +
*Prints error messages for errors up to the configurable maximum number of reportable errors.  A summary of the total number of errors is also printed.
 +
*Prints the total number of lines processed as well as the total number of sequences processed.
 +
    
== Assumptions ==
 
== Assumptions ==
Line 82: Line 86:  
*All lines are part of the Raw Sequence Line until a line that starts with a '+' is discovered.
 
*All lines are part of the Raw Sequence Line until a line that starts with a '+' is discovered.
 
*All lines are considered part of the quality string until at least the length of the associated raw sequence is hit (or the end of the file is reached).  This is due to the fact that '@' is a valid quality character, so does not necessarily indicate the start of a Sequence Identifier Line.
 
*All lines are considered part of the quality string until at least the length of the associated raw sequence is hit (or the end of the file is reached).  This is due to the fact that '@' is a valid quality character, so does not necessarily indicate the start of a Sequence Identifier Line.
 +
    
== Additional Wishlist - Not Implemented ==
 
== Additional Wishlist - Not Implemented ==
 
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
 
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
 +
*Add an option that would reject raw sequence and quality strings that wrap over multiple lines.  It would only allow 1 line per raw sequence/quality string.
 +
*Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).
 +
 +
 +
== Possible Issues ==
 +
*  For color space, there is no specification for:
 +
# The length of read and quality string may be the same or differs by 1 (depending on whether the primer base has a corresponding quality value).
 +
# Missing values are usually presented by "." or sometimes left as a blank " ".
 +
# Tag names for paired end reads may be the same (e.g. MAQ actually enforces that), and may be in the same file (e.g. BFAST require paired reads in the same file)
 +
    
== How to Use the fastQValidator Executable ==
 
== How to Use the fastQValidator Executable ==

Navigation menu