Changes

FastQValidator (view source)

Revision as of 13:52, 22 February 2010

1,886 bytes added , 13:52, 22 February 2010

no edit summary

Line 1: Line 1:

== Status ==

−

The [http://en.wikipedia.org/wiki/FASTQ_format FastQ] Validator is ~~on our [[Todo List]]~~.

+

The initial version of a [http://en.wikipedia.org/wiki/FASTQ_format FastQ] Validator is complete.

−

~~An initial version of the [[FastQFile]] has been completed which includes validation methods.~~

== Valid FastQ File Requirements ==

−

A valid fastQ file ~~should meet~~ the ~~following requirements:~~

+

A valid fastQ file meets the validation criteria specified in [[FastQFile]].

−

*A base sequence should have non-zero length.

−

*A quality string should be present for every base sequence.

+

== Additional Features ==

−

*~~Paired quality~~ and base ~~sequences should be~~ of the ~~same length~~.

+

*Base composition reported and tracked by position.

+

*Supports base space and color space files.

+

*Consumes gzipped and uncompressed text files transparently.

+

*Prints error messages for errors up to the configurable maximum number of reportable errors.

+

*Prints a summary of the total number of errors.

+

*Prints the total number of lines processed as well as the total number of sequences processed.

−

*Valid quality values should all have ASCII codes > 32.

−

*Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.

+

== Additional Wishlist - Not Implemented ==

−

*Every entry in the file should have a unique identifier.

+

There are a series of optional capabilities a FastQ Validator could implement. Among those:

−

*Reads should be of a minimum length; many mappers will get into trouble with very short reads.

−

*Base composition should be reported and tracked by position.

−

~~== Additional Wishlist ==~~

−

There are a series of optional capabilities a FastQ Validator ~~should~~ implement. Among those:

−

*Consume gzipped and uncompressed text files transparently (see libcsg/InputFile.h).

*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).

−

*Support color space files, where valid base sequences include the characters 0, 1, 2, 3, '.' (period) in addition to A, C, T, G and N (some csfastq sequence lines start with a primer base).

== Discussion ==

Line 44: Line 35:

* It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).

+

== How to Use the fastQValidator Executable ==

+

'''Required Parameters:'''

+

-f : FastQ filename with path to be prorcessed.

+

'''Optional Parameters:'''

+

-l : Minimum allowed read length (Defaults to 10).

+

-e : Maximum number of errors to display before suppressing them(Defaults to 20).

+

-b : Raw sequence type: "A"/"C"/"G"/"T"/"N" - Bases only;

+

"0"/"1"/"2"/"3"/"." - Color space only;

+

"" - Base Decision on the first Raw Sequence Character (Default)

+

All other characters - Bases & Color space

+

'''Testing only Parameters:'''

+

-t : If "ReadOnly" is specified, the fastq will be read but not processed. This may be used for determining read time.

+

'''Usage:'''

+

./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>

+

'''Examples:'''

+

../fastQValidator -f testFile.txt

+

../fastQValidator -f testFile.txt -l 10 -b A -e 100

+

./fastQValidator -f test/testFile.txt -l 10 -b Z -e 100

+

time ./fastQValidator -f test/testFile.txt -t ReadOnly

+

== FastQ Validator Output ==

+

When running the fastQValidator Executable, the output starts with a summary of the parameters:

+

The following parameters are in effect:

+

FastQ File Name : testFile.txt (-fname)

+

Min Read Length : 10 (-l9999)

+

Max Reported Errors : 100 (-e9999)

+

BaseType : A (-bname)

+

TestMode : (-tname)

+

Both the Executable and the Library outputs the following:

+

*Error messages for the first Configurable number of errors.:

+

ERROR on Line 25: The sequence identifier line was too short.

+

ERROR on Line 29: First line of a sequence does not begin wtih @

+

ERROR on Line 33: No Sequence Identifier specified before the comment.

+

*Base Composition Percentages by Index:

+

Base Composition Statistics:

+

Read Index %A %C %G %T %N Total Reads At Index

+

0 100.00 0.00 0.00 0.00 0.00 20

+

1 5.00 95.00 0.00 0.00 0.00 20

+

2 5.00 0.00 5.00 90.00 0.00 20

+

*Summary of the number of lines, sequences, and errors:

+

Finished processing testFile.txt with 92 lines containing 20 sequences.

+

There were a total of 17 errors.

Mktrost

Administrators

3,045

edits

Changes

FastQValidator (view source)

Revision as of 13:52, 22 February 2010

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools