Changes

From Genome Analysis Wiki
Jump to navigationJump to search
1,886 bytes added ,  13:52, 22 February 2010
no edit summary
Line 1: Line 1:  
== Status  ==
 
== Status  ==
   −
The [http://en.wikipedia.org/wiki/FASTQ_format FastQ] Validator is on our [[Todo List]].  
+
The initial version of a [http://en.wikipedia.org/wiki/FASTQ_format FastQ] Validator is complete.  
   −
An initial version of the [[FastQFile]] has been completed which includes validation methods.
      
== Valid FastQ File Requirements  ==
 
== Valid FastQ File Requirements  ==
   −
A valid fastQ file should meet the following requirements:
+
A valid fastQ file meets the validation criteria specified in [[FastQFile]].
   −
*A base sequence should have non-zero length.
     −
*A quality string should be present for every base sequence.
+
== Additional Features ==
   −
*Paired quality and base sequences should be of the same length.
+
*Base composition reported and tracked by position.
 +
*Supports base space and color space files.
 +
*Consumes gzipped and uncompressed text files transparently.
 +
*Prints error messages for errors up to the configurable maximum number of reportable errors.
 +
*Prints a summary of the total number of errors.
 +
*Prints the total number of lines processed as well as the total number of sequences processed.  
   −
*Valid quality values should all have ASCII codes > 32.
     −
*Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.
+
== Additional Wishlist - Not Implemented ==
   −
*Every entry in the file should have a unique identifier.
+
There are a series of optional capabilities a FastQ Validator could implement. Among those:  
 
  −
*Reads should be of a minimum length; many mappers will get into trouble with very short reads.
  −
 
  −
*Base composition should be reported and tracked by position.
  −
 
  −
== Additional Wishlist  ==
  −
 
  −
There are a series of optional capabilities a FastQ Validator should implement. Among those:  
  −
 
  −
*Consume gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
      
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
 
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
   −
*Support color space files, where valid base sequences include the characters 0, 1, 2, 3, '.' (period) in addition to A, C, T, G and N (some csfastq sequence lines start with a primer base).
      
== Discussion ==
 
== Discussion ==
Line 44: Line 35:     
* It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).
 
* It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).
 +
 +
 +
 +
== How to Use the fastQValidator Executable ==
 +
'''Required Parameters:'''
 +
        -f  :  FastQ filename with path to be prorcessed.
 +
 +
'''Optional Parameters:'''
 +
        -l  :  Minimum allowed read length (Defaults to 10).
 +
        -e  :  Maximum number of errors to display before suppressing them(Defaults to 20).
 +
        -b  :  Raw sequence type: "A"/"C"/"G"/"T"/"N"  - Bases only;
 +
                                  "0"/"1"/"2"/"3"/"."  - Color space only;
 +
                                  ""                  - Base Decision on the first Raw Sequence Character (Default)
 +
                                  All other characters - Bases & Color space
 +
 +
'''Testing only Parameters:'''
 +
        -t  :  If "ReadOnly" is specified, the fastq will be read but not processed.  This may be used for determining read time.
 +
'''Usage:'''
 +
        ./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>
 +
 +
'''Examples:'''
 +
        ../fastQValidator -f testFile.txt
 +
        ../fastQValidator -f testFile.txt -l 10 -b A -e 100
 +
        ./fastQValidator -f test/testFile.txt -l 10 -b Z -e 100
 +
        time ./fastQValidator -f test/testFile.txt -t ReadOnly
 +
 +
 +
== FastQ Validator Output ==
 +
When running the fastQValidator Executable, the output starts with a summary of the parameters:
 +
The following parameters are in effect:
 +
              FastQ File Name :    testFile.txt (-fname)
 +
              Min Read Length :              10 (-l9999)
 +
          Max Reported Errors :            100 (-e9999)
 +
                      BaseType :              A (-bname)
 +
                      TestMode :                (-tname)
 +
 +
Both the Executable and the Library outputs the following:
 +
*Error messages for the first Configurable number of errors.:
 +
ERROR on Line 25: The sequence identifier line was too short.
 +
ERROR on Line 29: First line of a sequence does not begin wtih @
 +
ERROR on Line 33: No Sequence Identifier specified before the comment.
 +
*Base Composition Percentages by Index:
 +
 +
Base Composition Statistics:
 +
Read Index %A %C %G %T %N Total Reads At Index
 +
        0  100.00    0.00    0.00    0.00    0.00 20
 +
        1    5.00  95.00    0.00    0.00    0.00 20
 +
        2    5.00    0.00    5.00  90.00    0.00 20
 +
*Summary of the number of lines, sequences, and errors:
 +
Finished processing testFile.txt with 92 lines containing 20 sequences.
 +
There were a total of 17 errors.

Navigation menu