3,045
edits
Changes
From Genome Analysis Wiki
no edit summary
== Status ==
The initial version of a [http://en.wikipedia.org/wiki/FASTQ_format FastQ] Validator is on our [[Todo List]]complete.
== Valid FastQ File Requirements ==
A valid fastQ file should meet meets the following requirements: validation criteria specified in [[FastQFile]].
*Paired quality Base composition reported and tracked by position.*Supports base sequences should be space and color space files.*Consumes gzipped and uncompressed text files transparently.*Prints error messages for errors up to the configurable maximum number of reportable errors.*Prints a summary of the total number of errors.*Prints the total number of lines processed as well as the same lengthtotal number of sequences processed.
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
== Discussion ==
* It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).
== How to Use the fastQValidator Executable ==
'''Required Parameters:'''
-f : FastQ filename with path to be prorcessed.
'''Optional Parameters:'''
-l : Minimum allowed read length (Defaults to 10).
-e : Maximum number of errors to display before suppressing them(Defaults to 20).
-b : Raw sequence type: "A"/"C"/"G"/"T"/"N" - Bases only;
"0"/"1"/"2"/"3"/"." - Color space only;
"" - Base Decision on the first Raw Sequence Character (Default)
All other characters - Bases & Color space
'''Testing only Parameters:'''
-t : If "ReadOnly" is specified, the fastq will be read but not processed. This may be used for determining read time.
'''Usage:'''
./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>
'''Examples:'''
../fastQValidator -f testFile.txt
../fastQValidator -f testFile.txt -l 10 -b A -e 100
./fastQValidator -f test/testFile.txt -l 10 -b Z -e 100
time ./fastQValidator -f test/testFile.txt -t ReadOnly
== FastQ Validator Output ==
When running the fastQValidator Executable, the output starts with a summary of the parameters:
The following parameters are in effect:
FastQ File Name : testFile.txt (-fname)
Min Read Length : 10 (-l9999)
Max Reported Errors : 100 (-e9999)
BaseType : A (-bname)
TestMode : (-tname)
Both the Executable and the Library outputs the following:
*Error messages for the first Configurable number of errors.:
ERROR on Line 25: The sequence identifier line was too short.
ERROR on Line 29: First line of a sequence does not begin wtih @
ERROR on Line 33: No Sequence Identifier specified before the comment.
*Base Composition Percentages by Index:
Base Composition Statistics:
Read Index %A %C %G %T %N Total Reads At Index
0 100.00 0.00 0.00 0.00 0.00 20
1 5.00 95.00 0.00 0.00 0.00 20
2 5.00 0.00 5.00 90.00 0.00 20
*Summary of the number of lines, sequences, and errors:
Finished processing testFile.txt with 92 lines containing 20 sequences.
There were a total of 17 errors.