SAM: Filtering Reads
Two filters have been requested for updating a SAM/BAM File.
A program should be written that will run both at the same time and produce an updated SAM/BAM File.
- Sum the quality of all bases that are mismatches between the query and the reference. If the Sum > a configurable threshold, then mark the read as unmapped or filtered.
- Calculate the %mismatches from both ends of the read. When it is over a threshold, clip the ends.
Example
Reference: ACTGAACCTTGGAAACTGCCGGGGACT Read: TTTACTGACTGAAACCATT Qual: >>9>6>6+4;>>453;>>; CIGAR: 3S4M10N4M3I2M4D3M POS: 5
This means it aligns:
Reference: ACTGAACCTTGGAAACTG CCGGGGACT
Read: TTTACTG ACTGAAACC ATT
Adding the position:
RefPos: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Reference: A C T G A A C C T T G G A A A C T G C C G G G G A C T
Read: T T T A C T G A C T G A A A C C A T T
Qual: > > 9 > 6 > 6 + 4 ; > > 4 5 3 ; > > ;
Adding the offsets:
RefPos: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
refOffset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Reference: A C T G A A C C T T G G A A A C T G C C G G G G A C T
Read: T T T A C T G A C T G A A A C C A T T
queryIndex: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Qual: > > 9 > 6 > 6 + 4 ; > > 4 5 3 ; > > ;
QUESTIONS:
- For #mismatches
- Should insertions/deletions/skips/clips be included?
- Don't have a quality score for deletions/skips, but do for insertions & clips...should those be added?
Sum Quality (just mismatches) - only read position 17 - Quality: > = 62-33 = 29
Sum Quality (mismatches & insertions) - >45> = (62-33) + (52-33) + (53-33) + (62-33) = 29+19+20+29 = 97
Sum Quality (mismatches & insertions & soft clips) - >>9>45> = (62-33) + (62-33) + (57-33) + (62-33) + (52-33) + (53-33) + (62-33) = 29+29+24+29+19+20+29 = 179
The results of a call to getRefOffset for each value passed in (where NA stands for INDEX_NA):
queryIndex: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19(and any value over 19) Return: NA NA NA 0 1 2 3 14 15 16 17 NA NA NA 18 19 24 25 26 NA
The results of a call to getQueryIndex for each value passed in (where NA stands for INDEX_NA):
refOffset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27(and any value over 27) Return: 3 4 5 6 NA NA NA NA NA NA NA NA NA NA 7 8 9 10 14 15 NA NA NA NA 16 17 18 NA
The results of a call to getRefPosition passing in start position 5 (where NA stands for INDEX_NA):
queryIndex: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19(and any value over 19) Return: NA NA NA 5 6 7 8 19 20 21 22 NA NA NA 23 24 29 30 31 NA
The results of a call to getQueryIndex using refPosition and start position 5 (where NA stands for INDEX_NA):
refPosition:5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32(and any value over 32) Return: 3 4 5 6 NA NA NA NA NA NA NA NA NA NA 7 8 9 10 14 15 NA NA NA NA 16 17 18 NA