Changes

From Genome Analysis Wiki
Jump to: navigation, search

FastQValidator

1,886 bytes added, 13:52, 22 February 2010
no edit summary
== Status ==
The initial version of a [http://en.wikipedia.org/wiki/FASTQ_format FastQ] Validator is on our [[Todo List]]complete.
An initial version of the [[FastQFile]] has been completed which includes validation methods.
== Valid FastQ File Requirements ==
A valid fastQ file should meet meets the following requirements: validation criteria specified in [[FastQFile]].
*A base sequence should have non-zero length.
*A quality string should be present for every base sequence.== Additional Features ==
*Paired quality Base composition reported and tracked by position.*Supports base sequences should be space and color space files.*Consumes gzipped and uncompressed text files transparently.*Prints error messages for errors up to the configurable maximum number of reportable errors.*Prints a summary of the total number of errors.*Prints the total number of lines processed as well as the same lengthtotal number of sequences processed.
*Valid quality values should all have ASCII codes > 32.
*Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.== Additional Wishlist - Not Implemented ==
*Every entry in the file should have a unique identifier. *Reads should be of a minimum length; many mappers will get into trouble with very short reads. *Base composition should be reported and tracked by position. == Additional Wishlist == There are a series of optional capabilities a FastQ Validator should could implement. Among those:  *Consume gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
*Support color space files, where valid base sequences include the characters 0, 1, 2, 3, '.' (period) in addition to A, C, T, G and N (some csfastq sequence lines start with a primer base).
== Discussion ==
* It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).
 
 
 
== How to Use the fastQValidator Executable ==
'''Required Parameters:'''
-f : FastQ filename with path to be prorcessed.
 
'''Optional Parameters:'''
-l : Minimum allowed read length (Defaults to 10).
-e : Maximum number of errors to display before suppressing them(Defaults to 20).
-b : Raw sequence type: "A"/"C"/"G"/"T"/"N" - Bases only;
"0"/"1"/"2"/"3"/"." - Color space only;
"" - Base Decision on the first Raw Sequence Character (Default)
All other characters - Bases & Color space
 
'''Testing only Parameters:'''
-t : If "ReadOnly" is specified, the fastq will be read but not processed. This may be used for determining read time.
'''Usage:'''
./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>
 
'''Examples:'''
../fastQValidator -f testFile.txt
../fastQValidator -f testFile.txt -l 10 -b A -e 100
./fastQValidator -f test/testFile.txt -l 10 -b Z -e 100
time ./fastQValidator -f test/testFile.txt -t ReadOnly
 
 
== FastQ Validator Output ==
When running the fastQValidator Executable, the output starts with a summary of the parameters:
The following parameters are in effect:
FastQ File Name : testFile.txt (-fname)
Min Read Length : 10 (-l9999)
Max Reported Errors : 100 (-e9999)
BaseType : A (-bname)
TestMode : (-tname)
 
Both the Executable and the Library outputs the following:
*Error messages for the first Configurable number of errors.:
ERROR on Line 25: The sequence identifier line was too short.
ERROR on Line 29: First line of a sequence does not begin wtih @
ERROR on Line 33: No Sequence Identifier specified before the comment.
*Base Composition Percentages by Index:
 
Base Composition Statistics:
Read Index %A %C %G %T %N Total Reads At Index
0 100.00 0.00 0.00 0.00 0.00 20
1 5.00 95.00 0.00 0.00 0.00 20
2 5.00 0.00 5.00 90.00 0.00 20
*Summary of the number of lines, sequences, and errors:
Finished processing testFile.txt with 92 lines containing 20 sequences.
There were a total of 17 errors.

Navigation menu