Changes

LibStatGen: FASTQ (view source)

Revision as of 13:52, 22 February 2010

3,533 bytes removed , 13:52, 22 February 2010

no edit summary

Line 9: Line 9:

** Parameters:

*** String filename - fastq file to be opened.

−

*** ~~String baseType- Raw sequence type enter~~:

+

*** BaseAsciiMap::SPACE_TYPE spaceType:

−

**** "A~~"/"~~C~~"/"~~G~~"/"~~T~~"/"~~N~~" - Bases only;~~

+

**** BASE_SPACE - Bases only (A,C,G,T,N)

−

**** "0~~"/"~~1~~"/"~~2~~"/"~~3~~"/"~~.~~" - Color space only;~~

+

**** COLOR_SPACE - Color space only (0,1,2,3,.)

−

**** "" - Base Decision on the first Raw Sequence Character (Default).

+

**** UNKNOWN - Base Decision on the first Raw Sequence Character (Default).

−

**** All other characters - Bases & Color space

** Return Value

*** FastQStatus: FASTQ_SUCCESS if successfully opened, FASTQ_FAILURE if not.

Line 24: Line 23:

** Return Value

*** bool: true if a file is open, false if not.

−

*'''Validate a FastQ File:''' validateFastQFile

+

*'''Validate an entire FastQ File:''' validateFastQFile

** Parameters:

*** String filename - fastq file to be validated.

−

*** ~~String baseType- Raw sequence type enter~~:

+

*** BaseAsciiMap::SPACE_TYPE spaceType:

−

**** ~~"A"/"C"/"G"/"T"/"N"~~ - Bases only;

+

**** BASE_SPACE - Bases only;

−

**** ~~"0"/"1"/"2"/"3"/"."~~ - Color space only;

+

**** COLOR_SPACE - Color space only;

−

**** "" - Base Decision on the first Raw Sequence Character (Default).

+

**** UNKNOWN - Base Decision on the first Raw Sequence Character (Default).

−

**** All other characters - Bases & Color space

** Return Value

*** bool: true if there were no errors in the file, false otherwise.

Line 45: Line 43:

** Parameters: NONE

** Return Value

−

*** BaseAsciiMap::SPACETYPE: COLOR_SPACE if the file is color space (0,1,2,3,.), BASE_SPACE if the file is base space (A,C,G,T,N)~~, BOTH_SPACE if the file is both (0,1,2,3,.,A,C,G,T,N), or~~ UNKNOWN if it has yet to be determined.

+

*** BaseAsciiMap::SPACETYPE:

+

****COLOR_SPACE if the file is color space (0,1,2,3,.)

+

****BASE_SPACE if the file is base space (A,C,G,T,N)

+

****UNKNOWN if it has yet to be determined.

*'''Access last read Sequence Lines'''

** Public String Variables to avoid having to copy the strings:

Line 55: Line 56: −

== Validation Criteria ==

+

== Validation Criteria Used for Reporting Errors as a Sequence is Read ==

{| class="wikitable" style="width:100%" border="1"

|+ style="font-size:150%"|'''Sequence Identifier Line'''

Line 86: Line 87:

* Base Only: A C T G N a c t g n

* Color Space Only: 0 1 2 3 .(period)

−

* Base or Color Space: A C T G N a c t g n 0 1 2 3 .(period)

| ERROR on Line <current line #>: Invalid character ('<invalid char>') in base sequence.

|-

Line 127: Line 127: −

== ~~Additional Features ==~~

+

== Reading Sequence Assumptions ==

−

*Base composition are reported and tracked by position.

−

*Consumes gzipped and uncompressed text files transparently (see libcsg/InputFile.h).

−

*Prints error messages for errors up to the configurable maximum number of reportable errors. A summary of the total number of errors is also printed.

−

*Prints the total number of lines processed as well as the total number of sequences processed.

−

== Assumptions ==

*The Sequence Identifier is separated by an optional comment by a " ".

*No validation is required on the optional comment field of the Sequence Identifier Line.

Line 141: Line 134:

*All lines are part of the Raw Sequence Line until a line that starts with a '+' is discovered.

*All lines are considered part of the quality string until at least the length of the associated raw sequence is hit (or the end of the file is reached). This is due to the fact that '@' is a valid quality character, so does not necessarily indicate the start of a Sequence Identifier Line.

+

== Additional Features ==

+

*Consumes gzipped and uncompressed text files transparently.

+

*Prints error messages for errors up to the configurable maximum number of reportable errors.

== Additional Wishlist - Not Implemented ==

−

*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).

*Add an option that would reject raw sequence and quality strings that wrap over multiple lines. It would only allow 1 line per raw sequence/quality string.

−

*Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).

+

*Maybe report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).

−

~~== Possible Issues ==~~

−

* For color space, there is no specification for:

−

~~# The length of read and quality string may be the same or differs by 1 (depending on whether the primer base has a corresponding quality value).~~

−

~~# Missing values are usually presented by "." or sometimes left as a blank " ".~~

−

~~# Tag names for paired end reads may be the same (e.g. MAQ actually enforces that), and may be in the same file (e.g. BFAST require paired reads in the same file)~~

−

~~== How to Use the fastQValidator Executable ==~~

−

~~'''Required Parameters:'''~~

−

~~-f : FastQ filename with path to be prorcessed.~~

−

~~'''Optional Parameters:'''~~

−

~~-l : Minimum allowed read length (Defaults to 10).~~

−

~~-e : Maximum number of errors to display before suppressing them(Defaults to 20).~~

−

~~-b : Raw sequence type: "A"/"C"/"G"/"T"/"N" - Bases only;~~

−

~~"0"/"1"/"2"/"3"/"." - Color space only;~~

−

~~"" - Base Decision on the first Raw Sequence Character (Default)~~

−

~~All other characters - Bases & Color space~~

−

~~'''Testing only Parameters:'''~~

−

~~-t : If "ReadOnly" is specified, the fastq will be read but not processed. This may be used for determining read time.~~

−

~~'''Usage:'''~~

−

~~./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>~~

−

~~'''Examples:'''~~

−

~~../fastQValidator -f testFile.txt~~

−

~~../fastQValidator -f testFile.txt -l 10 -b A -e 100~~

−

~~./fastQValidator -f test/testFile.txt -l 10 -b Z -e 100~~

−

~~time ./fastQValidator -f test/testFile.txt -t ReadOnly~~

−

~~== FastQ Validator Output ==~~

−

~~When running the fastQValidator Executable, the output starts with a summary of the parameters:~~

−

~~The following parameters are in effect:~~

−

~~FastQ File Name : testFile.txt (-fname)~~

−

~~Min Read Length : 10 (-l9999)~~

−

~~Max Reported Errors : 100 (-e9999)~~

−

~~BaseType : A (-bname)~~

−

~~TestMode : (-tname)~~

−

~~Both the Executable and the Library outputs the following:~~

−

*Error messages for the first Configurable number of errors.:

−

~~ERROR on Line 25: The sequence identifier line was too short.~~

−

~~ERROR on Line 29: First line of a sequence does not begin wtih @~~

−

~~ERROR on Line 33: No Sequence Identifier specified before the comment.~~

−

*Base Composition Percentages by Index:

−

~~Base Composition Statistics:~~

−

~~Read Index %A %C %G %T %N Total Reads At Index~~

−

~~0 100.00 0.00 0.00 0.00 0.00 20~~

−

~~1 5.00 95.00 0.00 0.00 0.00 20~~

−

~~2 5.00 0.00 5.00 90.00 0.00 20~~

−

*Summary of the number of lines, sequences, and errors:

−

~~Finished processing testFile.txt with 92 lines containing 20 sequences.~~

−

~~There were a total of 17 errors~~.

Mktrost

Administrators

3,045

edits

Changes

LibStatGen: FASTQ (view source)

Revision as of 13:52, 22 February 2010

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools