Difference between revisions of "SAM Validation Criteria"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 192: Line 192:
  
 
NOTE: There may be other BAM Validations that can be done.  They will come later.
 
NOTE: There may be other BAM Validations that can be done.  They will come later.
 +
 +
Consider may want to validate the cigar string against the read length...
  
 
===SAM Questions===
 
===SAM Questions===

Revision as of 12:01, 18 March 2010

SAM Header Validation Rules

TODO

SAM Alignment Validation

SAM Alignment Record
Validation Criteria Implemented Tested
SAM BAM SAM BAM
QNAME.Length() > 0 and <= 254
QNAME does not contain [ \t\n\r]
FLAG is an integer [0-9]+
FLAG < 2048 (I think) or [0, (2^16)-1]
RNAME does not contain [ \t\n\r@=]
POS is an integer [0-9]+
POS is [0, (2^29)-1]
MAPQ is an integer [0-9]+
MAPQ is [0, (2^8)-1]
CIGAR ([0-9]+[MIDNSHP])+|\*
MRNM does not contain [ \t\n\r@] ('=' means it is the same as RNAME)
If SQ is in the header RNAME & MRNM (if not “=”) must be in SQ.
MPOS is an integer [0-9]+
MPOS is [0, (2^29)-1]
ISIZE is an integer -?[0-9]+
ISIZE is [-(2^29), 2^29]
SEQ is [acgtnACGTN.=]+|\*
If SEQ is * then QUAL is *
QUAL is [!-~]+|* → dec 33 – 126 or dec 42 (which is in 32-126) (for BAM, it is between [0,93])
If QUAL is not “*” it is the same length as SEQ.
TAG is [A-Z][A-Z0-9]
A TAG only appears once per alignment
VTYPE is [AifZH] for SAM and [AcCsSiIfZH]
VALUE does NOT contain [\t\n\r]
For VTYPE = “A”, VALUE is a printable character
For VTYPE = “i”, VALUE is a signed 32-bit integer.
For VTYPE = “f”, VALUE is a single-precision float.
For VTYPE = “Z”, VALUE is a printable string.
For VTYPE = “H”, VALUE is a Hex string.

NOTE: There are other TAG Validations that can be done. They will come later.

NOTE: There may be other BAM Validations that can be done. They will come later.

Consider may want to validate the cigar string against the read length...

SAM Questions

  • Comment says: “If the mapping position of the query is not available, RNAME and CIGAR are set as “*”, and POS and MAPQ as 0.” Is it all or nothing? Can some be set to “*”/0 but not all?
    • Same question for MRNM = “*” and MPOS & ISIZE = 0
  • Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.” - Is there anything here that needs to be validated???