SAM Validation Criteria

From Genome Analysis Wiki
Revision as of 14:44, 17 March 2010 by Mktrost (talk | contribs)
Jump to navigationJump to search

SAM Header Validation Rules

TODO

SAM Alignment Validation

SAM Alignment Record
Validation Criteria Implemented Tested
QNAME.Length() > 0 and <= 254
QNAME does not contain [ \t\n\r]
FLAG is an integer [0-9]+
FLAG < 2048 (I think) or [0, (2^16)-1]
RNAME does not contain [ \t\n\r@=]
POS is an integer [0-9]+
POS is [0, (2^29)-1]
MAPQ is an integer [0-9]+
MAPQ is [0, (2^8)-1]
CIGAR ([0-9]+[MIDNSHP])+|\*
MRNM does not contain [ \t\n\r@] ('=' means it is the same as RNAME)
If SQ is in the header RNAME & MRNM (if not “=”) must be in SQ.
MPOS is an integer [0-9]+
MPOS is [0, (2^29)-1]
ISIZE is an integer -?[0-9]+
ISIZE is [-(2^29), 2^29]
SEQ is [acgtnACGTN.=]+|\*
If SEQ is * then QUAL is *
QUAL is [!-~]+|* → dec 33 – 126 or dec 42 (which is in 32-126) (for BAM, it is between [0,93])
If QUAL is not “*” it is the same length as SEQ.
TAG is [A-Z][A-Z0-9]
A TAG only appears once per alignment
VTYPE is [AifZH] for SAM and [AcCsSiIfZH]
VALUE does NOT contain [\t\n\r]
For VTYPE = “A”, VALUE is a printable character
For VTYPE = “i”, VALUE is a signed 32-bit integer.
For VTYPE = “f”, VALUE is a single-precision float.
For VTYPE = “Z”, VALUE is a printable string.
For VTYPE = “H”, VALUE is a Hex string.

NOTE: There are other TAG Validations that can be done. They will come later.

NOTE: There are other BAM Validations that can be done. They will come later.

SAM Questions

  • Comment says: “If the mapping position of the query is not available, RNAME and CIGAR are set as “*”, and POS and MAPQ as 0.” Is it all or nothing? Can some be set to “*”/0 but not all?
    • Same question for MRNM = “*” and MPOS & ISIZE = 0
  • Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.” - Is there anything here that needs to be validated???