Difference between revisions of "SAM Validation Criteria"
From Genome Analysis Wiki
Jump to navigationJump to searchLine 192: | Line 192: | ||
NOTE: There may be other BAM Validations that can be done. They will come later. | NOTE: There may be other BAM Validations that can be done. They will come later. | ||
+ | |||
+ | Consider may want to validate the cigar string against the read length... | ||
===SAM Questions=== | ===SAM Questions=== |
Revision as of 12:01, 18 March 2010
SAM Header Validation Rules
TODO
SAM Alignment Validation
Validation Criteria | Implemented | Tested | ||
---|---|---|---|---|
SAM | BAM | SAM | BAM | |
QNAME.Length() > 0 and <= 254 | ||||
QNAME does not contain [ \t\n\r] | ||||
FLAG is an integer [0-9]+ | ||||
FLAG < 2048 (I think) or [0, (2^16)-1] | ||||
RNAME does not contain [ \t\n\r@=] | ||||
POS is an integer [0-9]+ | ||||
POS is [0, (2^29)-1] | ||||
MAPQ is an integer [0-9]+ | ||||
MAPQ is [0, (2^8)-1] | ||||
CIGAR ([0-9]+[MIDNSHP])+|\* | ||||
MRNM does not contain [ \t\n\r@] ('=' means it is the same as RNAME) | ||||
If SQ is in the header RNAME & MRNM (if not “=”) must be in SQ. | ||||
MPOS is an integer [0-9]+ | ||||
MPOS is [0, (2^29)-1] | ||||
ISIZE is an integer -?[0-9]+ | ||||
ISIZE is [-(2^29), 2^29] | ||||
SEQ is [acgtnACGTN.=]+|\* | ||||
If SEQ is * then QUAL is * | ||||
QUAL is [!-~]+|* → dec 33 – 126 or dec 42 (which is in 32-126) (for BAM, it is between [0,93]) | ||||
If QUAL is not “*” it is the same length as SEQ. | ||||
TAG is [A-Z][A-Z0-9] | ||||
A TAG only appears once per alignment | ||||
VTYPE is [AifZH] for SAM and [AcCsSiIfZH] | ||||
VALUE does NOT contain [\t\n\r] | ||||
For VTYPE = “A”, VALUE is a printable character | ||||
For VTYPE = “i”, VALUE is a signed 32-bit integer. | ||||
For VTYPE = “f”, VALUE is a single-precision float. | ||||
For VTYPE = “Z”, VALUE is a printable string. | ||||
For VTYPE = “H”, VALUE is a Hex string. |
NOTE: There are other TAG Validations that can be done. They will come later.
NOTE: There may be other BAM Validations that can be done. They will come later.
Consider may want to validate the cigar string against the read length...
SAM Questions
- Comment says: “If the mapping position of the query is not available, RNAME and CIGAR are set as “*”, and POS and MAPQ as 0.” Is it all or nothing? Can some be set to “*”/0 but not all?
- Same question for MRNM = “*” and MPOS & ISIZE = 0
- Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.” - Is there anything here that needs to be validated???