Line 1: |
Line 1: |
| + | '''NOTE: Not all validation Criteria has been listed here, and not all listed here have been implemented (Implemented checks are marked green.)''' |
| + | |
| === SAM Header Validation Rules === | | === SAM Header Validation Rules === |
| TODO | | TODO |
| + | {| class="wikitable" style="width:100%" border="1" |
| + | |+ style="font-size:150%"|'''SAM Header''' |
| + | ! rowspan='2' width="60%"|Validation Criteria |
| + | ! colspan="2" width="20%"|Implemented |
| + | ! colspan="2" width="20%"|Tested |
| + | |- |
| + | ! width="10%"|SAM |
| + | ! width="10%"|BAM |
| + | ! width="10%"|SAM |
| + | ! width="10%"|BAM |
| + | |- |
| + | | All Required Fields are set |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | If HD line is there, VN is also there. |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | HD/VN is not in valid format /^[0-9]+\.[0-9]+$/ |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | HD/SO is a valid value (unsorted, queryname, coordinate) |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | SQ/SN all SQ lines have a unique SN field |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | SQ/LN is in the range [1, (2^29) -1] |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | SQ/LN is not a number |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | RG/ID all RG lines have a unique ID field |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | RG/PL is a valid value (ILLUMINA, SOLID, LS454, HELICOS, PACBIO) |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | Header has X-lines or fewer (or a max number of SQ lines (this was a problem once of a file with a crazy number of header lines) |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |} |
| | | |
| === SAM Alignment Validation === | | === SAM Alignment Validation === |
| {| class="wikitable" style="width:100%" border="1" | | {| class="wikitable" style="width:100%" border="1" |
| |+ style="font-size:150%"|'''SAM Alignment Record''' | | |+ style="font-size:150%"|'''SAM Alignment Record''' |
− | ! width="70%"|Validation Criteria | + | ! rowspan='2' width="60%"|Validation Criteria |
− | ! width="15%"|Implemented | + | ! colspan="2" width="20%"|Implemented |
− | ! width="15%"|Tested | + | ! colspan="2" width="20%"|Tested |
| + | |- |
| + | ! width="10%"|SAM |
| + | ! width="10%"|BAM |
| + | ! width="10%"|SAM |
| + | ! width="10%"|BAM |
| |- | | |- |
| | QNAME.Length() > 0 and <= 254 | | | QNAME.Length() > 0 and <= 254 |
− | | | + | |style="background-color:green;"| |
− | | | + | |style="background-color:green;"| |
| + | |style="background-color:green;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
− | | QNAME does not contain [ \t\n\r] | + | | QNAME is valid: [!-?A-~] (printable characters minus space and '@') '''This is a new regular expression''' |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | FLAG is an integer [0-9]+ | | | FLAG is an integer [0-9]+ |
− | | | + | |style="background-color:green;"| |
− | | | + | |style="background-color:grey;"| N/A: just interpret the bits as an int. |
| + | |style="background-color:green;"| |
| + | |style="background-color:grey;"| N/A: just interpret the bits as an int. |
| |- | | |- |
− | | FLAG < 2048 (I think) or [0, (2^16)-1] | + | | FLAG is [0, (2^16)-1] |
− | | | + | |style="background-color:green;"| Parse Error since it will be written into a 16 bit field. |
− | | | + | |style="background-color:grey;"| N/A: only a 16 bit field |
| + | |style="background-color:green;"| |
| + | |style="background-color:grey;"| N/A: only a 16 bit field |
| |- | | |- |
| | RNAME does not contain [ \t\n\r@=] | | | RNAME does not contain [ \t\n\r@=] |
− | | | + | |style="background-color:green;"| |
− | | | + | |style="background-color:green;"| |
| + | |style="background-color:green;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | RNAME is found in an SQ header record if there are any SQs in the header. |
| + | |style="background-color:green;"| |
| + | |style="background-color:green;"| |
| + | |style="background-color:green;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | Reference Name length does not match specified length. |
| + | |style="background-color:grey;"| N/A: reference name length is in BAM format only |
| + | |style="background-color:red;"| |
| + | |style="background-color:grey;"| N/A: reference name length is in BAM format only |
| + | |style="background-color:red;"| |
| + | |- |
| + | | Reference ID is in range of the number of references |
| + | |style="background-color:grey;"| N/A: rID is in BAM format only |
| + | |style="background-color:red;"| |
| + | |style="background-color:grey;"| N/A: rID is in BAM format only |
| + | |style="background-color:red;"| |
| |- | | |- |
| | POS is an integer [0-9]+ | | | POS is an integer [0-9]+ |
− | | | + | |style="background-color:green;"| |
− | | | + | |style="background-color:grey;"| N/A: just interpret the bits as an int. |
| + | |style="background-color:green;"| |
| + | |style="background-color:grey;"| N/A: just interpret the bits as an int. |
| |- | | |- |
| | POS is [0, (2^29)-1] | | | POS is [0, (2^29)-1] |
− | | | + | |style="background-color:green;"| Parse Error if it can't fit in the 32 bit field, other out of range is a validation error. |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:green;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | MAPQ is an integer [0-9]+ | | | MAPQ is an integer [0-9]+ |
− | | | + | |style="background-color:green;"| |
− | | | + | |style="background-color:grey;"| N/A: just interpret the bits as an int. |
| + | |style="background-color:green;"| |
| + | |style="background-color:grey;"| N/A: just interpret the bits as an int. |
| |- | | |- |
| | MAPQ is [0, (2^8)-1] | | | MAPQ is [0, (2^8)-1] |
− | | | + | |style="background-color:green;"| Parse Error since it will be written into an 8 bit field. |
− | | | + | |style="background-color:grey;"| N/A: only a 8 bit field |
| + | |style="background-color:green;"| |
| + | |style="background-color:grey;"| N/A: only a 8 bit field |
| |- | | |- |
| | <nowiki>CIGAR ([0-9]+[MIDNSHP])+|\*</nowiki> | | | <nowiki>CIGAR ([0-9]+[MIDNSHP])+|\*</nowiki> |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | CIGAR string matches the length of SEQ if both are not "*" |
| + | |style="background-color:green;"| |
| + | |style="background-color:green;"| |
| + | |style="background-color:green;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | MRNM does not contain [ \t\n\r@] ('=' means it is the same as RNAME) | | | MRNM does not contain [ \t\n\r@] ('=' means it is the same as RNAME) |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | If SQ is in the header RNAME & MRNM (if not “=”) must be in SQ. | | | If SQ is in the header RNAME & MRNM (if not “=”) must be in SQ. |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | MPOS is an integer [0-9]+ | | | MPOS is an integer [0-9]+ |
− | | | + | |style="background-color:green;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | MPOS is [0, (2^29)-1] | | | MPOS is [0, (2^29)-1] |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | ISIZE is an integer -?[0-9]+ | | | ISIZE is an integer -?[0-9]+ |
− | | | + | |style="background-color:green;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | ISIZE is [-(2^29), 2^29] | | | ISIZE is [-(2^29), 2^29] |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | <nowiki>SEQ is [acgtnACGTN.=]+|\*</nowiki> | | | <nowiki>SEQ is [acgtnACGTN.=]+|\*</nowiki> |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | If SEQ is * then QUAL is * | | | If SEQ is * then QUAL is * |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | <nowiki>QUAL is [!-~]+|* → dec 33 – 126 or dec 42 (which is in 32-126) (for BAM, it is between [0,93])</nowiki> | | | <nowiki>QUAL is [!-~]+|* → dec 33 – 126 or dec 42 (which is in 32-126) (for BAM, it is between [0,93])</nowiki> |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
− | | If QUAL is not “*” it is the same length as SEQ. | + | | If QUAL and SEQ are not “*” they are the same length. |
− | | | + | |style="background-color:green;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:green;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | TAG is [A-Z][A-Z0-9] | | | TAG is [A-Z][A-Z0-9] |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | A TAG only appears once per alignment | | | A TAG only appears once per alignment |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
− | | VTYPE is [AifZH] for SAM and [AcCsSiIfZH] | + | | VTYPE is [AifZH] for SAM and [AcCsSiIfZH] for BAM |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | VALUE does NOT contain [\t\n\r] | | | VALUE does NOT contain [\t\n\r] |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | For VTYPE = “A”, VALUE is a printable character | | | For VTYPE = “A”, VALUE is a printable character |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | For VTYPE = “i”, VALUE is a signed 32-bit integer. | | | For VTYPE = “i”, VALUE is a signed 32-bit integer. |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | For VTYPE = “f”, VALUE is a single-precision float. | | | For VTYPE = “f”, VALUE is a single-precision float. |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | For VTYPE = “Z”, VALUE is a printable string. | | | For VTYPE = “Z”, VALUE is a printable string. |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| |- | | |- |
| | For VTYPE = “H”, VALUE is a Hex string. | | | For VTYPE = “H”, VALUE is a Hex string. |
− | | | + | |style="background-color:red;"| |
− | | | + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | For TAG = E2, length should be the same as the Read Length |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | For TAG = E2, each base should be different than the read Base (unless 'N') |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| + | | For TAG = U2, length should be the same as the Read Length |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |style="background-color:red;"| |
| + | |- |
| |} | | |} |
| | | |
| NOTE: There are other TAG Validations that can be done. They will come later. | | NOTE: There are other TAG Validations that can be done. They will come later. |
| | | |
− | NOTE: There are other BAM Validations that can be done. They will come later. | + | NOTE: There may be other BAM Validations that can be done. They will come later. |
| + | |
| + | Consider may want to validate the cigar string against the read length... |
| + | |
| + | == Other Read Validation == |
| + | |
| + | {| class="wikitable" style="width:100%" border="1" |
| + | |+ style="font-size:150%"|'''SAM Alignment Record''' |
| + | ! rowspan='2' width="60%"|Validation Criteria |
| + | ! colspan="2" width="20%"|Implemented |
| + | ! colspan="2" width="20%"|Tested |
| + | |- |
| + | ! width="10%"|SAM |
| + | ! width="10%"|BAM |
| + | ! width="10%"|SAM |
| + | ! width="10%"|BAM |
| + | |- |
| + | | If specified to check sort order (either based on SO flag or user specifies coordinate or query name). |
| + | |style="background-color:green;"| |
| + | |style="background-color:green;"| |
| + | |style="background-color:green;"| |
| + | |style="background-color:green;"| |
| + | |} |
| + | |
| | | |
| ===SAM Questions=== | | ===SAM Questions=== |
Line 134: |
Line 336: |
| **Same question for MRNM = “*” and MPOS & ISIZE = 0 | | **Same question for MRNM = “*” and MPOS & ISIZE = 0 |
| *Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.” - Is there anything here that needs to be validated??? | | *Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.” - Is there anything here that needs to be validated??? |
| + | |
| + | == BamFile Classes == |
| + | [[C++ Library: libbam|BamFile]] |