SAM: Filtering Reads

From Genome Analysis Wiki
Jump to navigationJump to search

Two filters have been requested for updating a SAM/BAM File.

A program should be written that will run both at the same time and produce an updated SAM/BAM File.

  1. Sum the quality of all bases that are mismatches between the query and the reference. If the Sum > a configurable threshold, then mark the read as unmapped or filtered.
  2. Calculate the %mismatches from both ends of the read. When it is over a threshold, clip the ends.

Example

Reference: ACTGAACCTTGGAAACTGCCGGGGACT
Read: TTTACTGACTGAAACCATT
Qual: >>9>6>6+4;>>453;>>;
CIGAR: 3S4M10N4M3I2M4D3M
POS: 5

This means it aligns:

Reference:    ACTGAACCTTGGAAACTG   CCGGGGACT
Read:      TTTACTG          ACTGAAACC    ATT

Adding the position:

RefPos:              5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22          23 24 25 26 27 28 29 30 31
Reference:           A  C  T  G  A  A  C  C  T  T  G  G  A  A  A  C  T  G           C  C  G  G  G  G  A  C  T
Read:       T  T  T  A  C  T  G                                A  C  T  G  A  A  A  C  C              A  T  T
Qual:       >  >  9  >  6  >  6                                +  4  ;  >  >  4  5  3  ;              >  >  ;

Adding the offsets:

RefPos:              5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22          23 24 25 26 27 28 29 30 31
refOffset:           0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17          18 19 20 21 22 23 24 25 26
Reference:           A  C  T  G  A  A  C  C  T  T  G  G  A  A  A  C  T  G           C  C  G  G  G  G  A  C  T
Read:       T  T  T  A  C  T  G                                A  C  T  G  A  A  A  C  C              A  T  T
queryIndex: 0  1  2  3  4  5  6                                7  8  9 10 11 12 13 14 15             16 17 18
Qual:       >  >  9  >  6  >  6                                +  4  ;  >  >  4  5  3  ;              >  >  ;


QUESTIONS:

  • For #mismatches
    • Should insertions/deletions/skips/clips be included?
    • Don't have a quality score for deletions/skips, but do for insertions & clips...should those be added?

Sum Quality (just mismatches) - only read position 17 - Quality: > = 62-33 = 29

Sum Quality (mismatches & insertions) - >45> = (62-33) + (52-33) + (53-33) + (62-33) = 29+19+20+29 = 97

Sum Quality (mismatches & insertions & soft clips) - >>9>45> = (62-33) + (62-33) + (57-33) + (62-33) + (52-33) + (53-33) + (62-33) = 29+29+24+29+19+20+29 = 179


The results of a call to getRefOffset for each value passed in (where NA stands for INDEX_NA):

queryIndex: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19(and any value over 19)
Return:    NA NA NA  0  1  2  3 14 15 16 17 NA NA NA 18 19 24 25 26 NA

The results of a call to getQueryIndex for each value passed in (where NA stands for INDEX_NA):

refOffset:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27(and any value over 27)
Return:     3  4  5  6 NA NA NA NA NA NA NA NA NA NA  7  8  9 10 14 15 NA NA NA NA 16 17 18 NA

The results of a call to getRefPosition passing in start position 5 (where NA stands for INDEX_NA):

queryIndex: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19(and any value over 19)
Return:    NA NA NA  5  6  7  8 19 20 21 22 NA NA NA 23 24 29 30 31 NA

The results of a call to getQueryIndex using refPosition and start position 5 (where NA stands for INDEX_NA):

refPosition:5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32(and any value over 32)
Return:     3  4  5  6 NA NA NA NA NA NA NA NA NA NA  7  8  9 10 14 15 NA NA NA NA 16 17 18 NA