Difference between revisions of "SAM Validation Criteria"

From Genome Analysis Wiki
Jump to: navigation, search
(No difference)

Revision as of 14:44, 22 March 2010

SAM Header Validation Rules

TODO


SAM Alignment Validation

SAM Alignment Record
Validation Criteria Implemented Tested
SAM BAM SAM BAM
QNAME.Length() > 0 and <= 254
QNAME does not contain [ \t\n\r]
FLAG is an integer [0-9]+
FLAG < 2048 (I think) or [0, (2^16)-1]
RNAME does not contain [ \t\n\r@=]
POS is an integer [0-9]+
POS is [0, (2^29)-1]
MAPQ is an integer [0-9]+
MAPQ is [0, (2^8)-1]
CIGAR ([0-9]+[MIDNSHP])+|\*
MRNM does not contain [ \t\n\r@] ('=' means it is the same as RNAME)
If SQ is in the header RNAME & MRNM (if not “=”) must be in SQ.
MPOS is an integer [0-9]+
MPOS is [0, (2^29)-1]
ISIZE is an integer -?[0-9]+
ISIZE is [-(2^29), 2^29]
SEQ is [acgtnACGTN.=]+|\*
If SEQ is * then QUAL is *
QUAL is [!-~]+|* → dec 33 – 126 or dec 42 (which is in 32-126) (for BAM, it is between [0,93])
If QUAL is not “*” it is the same length as SEQ.
TAG is [A-Z][A-Z0-9]
A TAG only appears once per alignment
VTYPE is [AifZH] for SAM and [AcCsSiIfZH]
VALUE does NOT contain [\t\n\r]
For VTYPE = “A”, VALUE is a printable character
For VTYPE = “i”, VALUE is a signed 32-bit integer.
For VTYPE = “f”, VALUE is a single-precision float.
For VTYPE = “Z”, VALUE is a printable string.
For VTYPE = “H”, VALUE is a Hex string.

NOTE: There are other TAG Validations that can be done. They will come later.

NOTE: There may be other BAM Validations that can be done. They will come later.

Consider may want to validate the cigar string against the read length...


Other Read Validation

SAM Alignment Record
Validation Criteria Implemented Tested
SAM BAM SAM BAM
If SO flag is set in the header, fail if the file is not sorted.


SAM Questions

  • Comment says: “If the mapping position of the query is not available, RNAME and CIGAR are set as “*”, and POS and MAPQ as 0.” Is it all or nothing? Can some be set to “*”/0 but not all?
    • Same question for MRNM = “*” and MPOS & ISIZE = 0
  • Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.” - Is there anything here that needs to be validated???