SAM Validation Criteria
From Genome Analysis Wiki
SAM Header Validation Rules
SAM Alignment Validation
|QNAME.Length() > 0 and <= 254|
|QNAME does not contain [ \t\n\r]|
|FLAG is an integer [0-9]+|
|FLAG < 2048 (I think) or [0, (2^16)-1]|
|RNAME does not contain [ \t\n\r@=]|
|POS is an integer [0-9]+|
|POS is [0, (2^29)-1]|
|MAPQ is an integer [0-9]+|
|MAPQ is [0, (2^8)-1]|
|MRNM does not contain [ \t\n\r@] ('=' means it is the same as RNAME)|
|If SQ is in the header RNAME & MRNM (if not “=”) must be in SQ.|
|MPOS is an integer [0-9]+|
|MPOS is [0, (2^29)-1]|
|ISIZE is an integer -?[0-9]+|
|ISIZE is [-(2^29), 2^29]|
|SEQ is [acgtnACGTN.=]+|\*|
|If SEQ is * then QUAL is *|
|QUAL is [!-~]+|* → dec 33 – 126 or dec 42 (which is in 32-126) (for BAM, it is between [0,93])|
|If QUAL is not “*” it is the same length as SEQ.|
|TAG is [A-Z][A-Z0-9]|
|A TAG only appears once per alignment|
|VTYPE is [AifZH] for SAM and [AcCsSiIfZH]|
|VALUE does NOT contain [\t\n\r]|
|For VTYPE = “A”, VALUE is a printable character|
|For VTYPE = “i”, VALUE is a signed 32-bit integer.|
|For VTYPE = “f”, VALUE is a single-precision float.|
|For VTYPE = “Z”, VALUE is a printable string.|
|For VTYPE = “H”, VALUE is a Hex string.|
NOTE: There are other TAG Validations that can be done. They will come later.
NOTE: There may be other BAM Validations that can be done. They will come later.
Consider may want to validate the cigar string against the read length...
- Comment says: “If the mapping position of the query is not available, RNAME and CIGAR are set as “*”, and POS and MAPQ as 0.” Is it all or nothing? Can some be set to “*”/0 but not all?
- Same question for MRNM = “*” and MPOS & ISIZE = 0
- Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.” - Is there anything here that needs to be validated???