BamUtil: filter

From Genome Analysis Wiki
Revision as of 14:18, 2 September 2011 by Mktrost (talk | contribs) (moved BamUtil: Filter to BamUtil: filter: Rename to lowercase name)
Jump to navigationJump to search


Overview of the filter function of bamUtil

The filter option on the bamUtil executable writes the alignments, filtering them by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high.

The following modifications may occur in an alignment:

  • CIGAR updated to reflect clips
  • POS updated to reflect a new CIGAR if clipping occurs at the front of a read
  • FLAG updated to reflect a read is unmapped if it is below the quality of mismatches is too high, or clipping would cause an entire read to be clipped.

The POS and FLAG fields of an alignment are reflected in the mate's alignment. Thus, the mate also needs to be updated.

Also, if the file was sorted, and a POS was changed, the file may no longer be sorted.

NOTE: This program does NOT update the mate or resort the file.

Fixing the mate/resorting

In order to update the mate, samtools fixmate must be run.

In order to reorder the file, samtools sort must be run.

Notes about the samtools programs:

  • samtools fixmate requires the file to be sorted by query name.
  • samtools sort cannot write to pipes.

Steps

  1. Run this program and pipe it into samtools sort by query name
    • ./bam filter --in <your InputFile> --refFile <your reference file> --out -.bam <any other options> | samtools sort -n - tempQuerySort
  2. Run samtools fixmate and pipe it into samtools sort by position
    •  samtools fixmate tempQuerySort.bam - | samtools sort - finalResult

Example

~/pipeFilter/bam/bam filter --in ../../originalBamFile.bam --refFile ~/data/human.g1k.v37.fa --out -.bam | samtools sort -n - tempQuerySort; samtools fixmate tempQuerySort.bam - | samtools sort - newResult


Parameters

    Required Parameters:
        --in       : the SAM/BAM file to be read
        --refFile  : the reference file
        --out      : the SAM/BAM file to write to
    Optional Parameters:
        --noeof             : do not expect an EOF block on a bam file.
        --qualityThreshold  : maximum sum of the mismatch qualities before marking
                              a read unmapped. (Defaults to 60)
        --defaultQualityInt : quality value to use for mismatches that do not have a quality
                              (Defaults to 20)
        --mismatchThreshold : decimal value indicating the maximum ration of mismatches to
                              matches and mismatches allowed before clipping from the ends
                              (Defaults to .10)
        --params            : print the parameter settings

Usage

./bam filter --in <inputFilename>  --refFile <referenceFilename>  --out <outputFilename> [--noeof] [--qualityThreshold <qualThresh>] [--defaultQualityInt <defaultQual>] [--mismatchThreshold <mismatchThresh>] [--params]

Return Value

  • 0: all records are successfully read and written.
  • non-0: at least one record was not successfully read or written.

Example Output


open and prefetch reference genome /home/mktrost/data/human.g1k.v37.fa: done.
Number of Reads Clipped by Filtering: 704578
Number of Reads Filtered Due to MismatchThreshold: 0
Number of Reads Filtered Due to QualityThreshold: 13064
[bam_sort_core] merging from 3 files...
[bam_sort_core] merging from 3 files...


FAQ

This section contains information about what the filters mean, how they work, etc.

What counts as a match or a mismatch?

Only bases with a cigar values of M, X, or = can be counted as matches or mismatches. Cigar I, D, N, S, P, and H are NOT matches or mismatches.
If either the reference or the read are 'N' or 'n', it is NOT a match or mismatch.
It is a match if the reference and the read have the same base (both A/a, both C/c, both G/g, or both T/t). If they are different, non-'N', bases it is a mismatch.

What type of clip is used?

When a read is clipped due to exceeding the mismatch threshold, a softclip is used (Cigar S).

What should the quality count as if the quality string is '*'?

That is the purpose of the defaultQualityInt parameter.

Is the quality threshold or the clipping checked first?

First the mismatch threshold is checked so clipping is performed. Then the quality threshold is checked on the non-clipped mismatches. That way mismatches that are clipped do not go into the sum of the mismatch qualities.

How is the quality threshold checked?

Easy, just loop through an alignment finding mismatches. Add up the quality of the mismatches. If they sum is greater than the quality threshold, mark the read as unmapped.

How is the mismatch threshold checked?

This is requires a bit more logic...and thus gets its own section: Mismatch Threshold.

Mismatch Threshold

Mismatch threshold is:

But is that total number of mismatches in the entire alignment?

In processing, it is just the numbers for the bases that have been read so far.

For mismatch threshold, the logic goes in 1 match/mismatch base at a time from each end of the read and checks its current processing against the threshold.

If the first base is a mismatch, it is

This base will be clipped since 1 is > than the mismatch threshold (assuming that wasn't set to 1).

If the first base is a match, it is

At this point, this base will not be clipped since it is not greater than the mismatch threshold.

When a clip occurs, NumMatches & NumMismatches are reset to 0 (those bases are now clipped and do not count as a match or a mismatch).

To try to minimize the number of bases that are clipped, the logic keeps the NumMatches + NumMismatches when reading from the front and NumMatches + NumMismatches when reading from the back within 1 of each other.

If the mismatch threshold is 10%, it means that none of the 1st or last 10 bases of the updated read will be a mismatch.

If one of the 1st 10 bases is a mismatch, it will be clipped.

If one of the next 10 bases from that clip is a mismatch, it will also be clipped.


For example consider the following strings of matches & mismatches (M indicates match, X indicates mismatch)

MMMXXXMXXM

Use a mismatch threshold of .50 The logic will read alternating, with '-' indicating which bases were processed from the front, '|' reads processed from the back, and '+' reads processed in both directions. '*' indicates clipped bases.

M M M X X X M X X M
-

Match was found from the front, so basesFromFront = 1, basesFromBack = 0, so now read from the back.

M M M X X X M X X M
-                 |

Match was found from the back, so basesFromFront = 1, basesFromBack = 1, so read again from the back.

M M M X X X M X X M
-               | |

From back, 1/2 = .5, so is not over the .5 threshold; basesFromFront = 1, basesFromBack = 2, so read from the front.

M M M X X X M X X M
- -             | |

From front, 0/2 < .5, so is not over the .5 threshold; basesFromFront = 2, basesFromBack = 2, so read again from the front.

M M M X X X M X X M
- - -           | |

From front, 0/3 < .5, so is not over the .5 threshold; basesFromFront = 3, basesFromBack = 2, so read from the back.

M M M X X X M X X M
- - -         | | |

From back, 2/3 > .5, so over the .5 threshold; need to clip...basesFromFront = 3, basesFromBack = 0, so read from the back.

M M M X X X M X X M
- - -       | * * *

From back, 0/1 < .5, so is not over the .5 threshold; basesFromFront = 3, basesFromBack = 1, so read from the back.

M M M X X X M X X M
- - -     | | * * *

From back, 1/2 = .5, so is not over the .5 threshold; basesFromFront = 3, basesFromBack = 2, so read from the back.

M M M X X X M X X M
- - -   | | | * * *

From back, 2/3 > .5, so over the .5 threshold; need to clip...basesFromFront = 3, basesFromBack = 0, so read from the back.

M M M X X X M X X M
- - - | * * * * * *

From back, 1/1 > .5, so over the .5 threshold; need to clip...basesFromFront = 3, basesFromBack = 0, so read from the back.

M M M X X X M X X M
- - + * * * * * * *

From back, 0/1 < .5, so is not over the .5 threshold; basesFromFront = 3, basesFromBack = 1, so read from the back.

M M M X X X M X X M
- + + * * * * * * *

From back, 0/2 < .5, so is not over the .5 threshold; basesFromFront = 3, basesFromBack = 2, so read from the back.

M M M X X X M X X M
+ + + * * * * * * *

From back, 0/3 < .5, so is not over the .5 threshold; basesFromFront = 3, basesFromBack = 3, so read from the back.

In this example, there is no more to read from the front or the back. The new Cigar is 3M7S