Difference between revisions of "BamUtil: filter"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[Category:libbam]]
+
[[Category:BamUtil|filter]]
 +
[[Category:BAM Software]]
 +
[[Category:Software]]
  
== filter ==
+
= Overview of the <code>filter</code> function of <code>bamUtil</code> =
 
+
The <code>filter</code> option on the [[bamUtil]] executable writes the alignments, filtering them by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high.
The <code>filter</code> option on the [[Bam|bam executable]] writes the alignments filtering them by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high.
 
  
 
The following modifications may occur in an alignment:
 
The following modifications may occur in an alignment:
Line 10: Line 11:
 
* FLAG updated to reflect a read is unmapped if it is below the quality of mismatches is too high, or clipping would cause an entire read to be clipped.
 
* FLAG updated to reflect a read is unmapped if it is below the quality of mismatches is too high, or clipping would cause an entire read to be clipped.
  
=== NOTES ===
+
The POS and FLAG fields of an alignment are reflected in the mate's alignment.  Thus, the mate also needs to be updated.
The POS and FLAG fields of an alignment are reflected in the mate's alignment.  Thus, when the mate also needs to be updated.
 
  
 
Also, if the file was sorted, and a POS was changed, the file may no longer be sorted.
 
Also, if the file was sorted, and a POS was changed, the file may no longer be sorted.
  
'''NOTE: This program does NOT update the mate or resort the file.'''
+
'''NOTE: This program does NOT update the mate or re-sort the file.'''
  
 +
== Fixing the mate/resorting ==
 
In order to update the mate, samtools fixmate must be run.  
 
In order to update the mate, samtools fixmate must be run.  
  
Line 25: Line 26:
 
* samtools sort cannot write to pipes.
 
* samtools sort cannot write to pipes.
  
Steps:
+
===Steps===
 
# Run this program and pipe it into samtools sort by query name
 
# Run this program and pipe it into samtools sort by query name
 
#* <pre>./bam filter --in <your InputFile> --refFile <your reference file> --out -.bam <any other options> | samtools sort -n - tempQuerySort</pre>
 
#* <pre>./bam filter --in <your InputFile> --refFile <your reference file> --out -.bam <any other options> | samtools sort -n - tempQuerySort</pre>
Line 31: Line 32:
 
#* <pre> samtools fixmate tempQuerySort.bam - | samtools sort - finalResult</pre>
 
#* <pre> samtools fixmate tempQuerySort.bam - | samtools sort - finalResult</pre>
  
For Example:
+
===Example===
 
  ~/pipeFilter/bam/bam filter --in ../../originalBamFile.bam --refFile ~/data/human.g1k.v37.fa --out -.bam | samtools sort -n - tempQuerySort; samtools fixmate tempQuerySort.bam - | samtools sort - newResult
 
  ~/pipeFilter/bam/bam filter --in ../../originalBamFile.bam --refFile ~/data/human.g1k.v37.fa --out -.bam | samtools sort -n - tempQuerySort; samtools fixmate tempQuerySort.bam - | samtools sort - newResult
  
  
=== Parameters ===
+
= Usage =
 +
 
 +
./bam filter --in <inputFilename>  --refFile <referenceFilename>  --out <outputFilename> [--noeof] [--qualityThreshold <qualThresh>] [--defaultQualityInt <defaultQual>] [--mismatchThreshold <mismatchThresh>] [--params]
 +
 
 +
= Parameters =
 
<pre>
 
<pre>
Required Parameters:
+
    Required Parameters:
--in      : the SAM/BAM file to be read
+
        --in      : the SAM/BAM file to be read
--refFile  : the reference file
+
        --refFile  : the reference file
--out      : the SAM/BAM file to write to
+
        --out      : the SAM/BAM file to write to
Optional Parameters:
+
    Optional Parameters:
--noeof            : do not expect an EOF block on a bam file.
+
        --noeof            : do not expect an EOF block on a bam file.
--qualityThreshold  : maximum sum of the mismatch qualities before marking
+
        --qualityThreshold  : maximum sum of the mismatch qualities before marking
                      a read unmapped. (Defaults to 60)
+
                              a read unmapped. (Defaults to 60)
--defaultQualityInt : quality value to use for mismatches that do not have a quality
+
        --defaultQualityInt : quality value to use for mismatches that do not have a quality
                      (Defaults to 20)
+
                              (Defaults to 20)
--mismatchThreshold : decimal value indicating the maximum ration of mismatches to
+
        --mismatchThreshold : decimal value indicating the maximum ratio of mismatches to
                      matches and mismatches allowed before clipping from the ends
+
                              matches and mismatches allowed before clipping from the ends
                      (Defaults to .10)
+
                              (Defaults to .10)
 +
        --params            : print the parameter settings
 +
</pre>
 +
{{PhoneHomeParamDesc}}
 +
 
 +
== Required Parameters==
 +
{{InBAMInputFile}}
 +
{{refFile}}
 +
{{OutBAMOutputFile}}
 +
 
 +
== Optional Parameters ==
 +
{{noeofBGZFParameter}}
 +
=== Quality Threshold (<code>--qualityThreshold</code>) ===
 +
In <code>filter</code>, when the sum of the mismatching base qualities is higher than the <code>--qualityThreshold</code>, the read is marked as unmapped.  The default threshold is 60.
 +
 
 +
=== Default Quality (<code>--defaultQualityInt</code>) ===
 +
<code>filter</code> filters reads based on the sum of the base qualities of mismatches.  Some reads, however, do not have base qualities.  Use <code>--defaultQualityInt</code> to specify the base qualities to use for mismatches that do not have quality values.  The default is 20.
  
</pre>
+
=== Mismatch Threshold (<code>--mismatchThreshold</code>) ===
 +
<code>filter</code> clips the ends of reads if the ratio of mismatches to matches and mismatches is higher than the decimal parameter, <code>--mismatchThreshold</code>.  The default is .10.
 +
 
 +
{{paramsParameter}}
  
=== Usage ===
+
{{PhoneHomeParameters}}
  
./bam filter --in <inputFilename>  --refFile <referenceFilename>  --out <outputFilename> [--noeof] [--qualityThreshold <qualThresh>] [--defaultQualityInt <defaultQual>] [--mismatchThreshold <mismatchThresh>]
 
  
=== Return Value ===
+
= Return Value =
 
*    0: all records are successfully read and written.
 
*    0: all records are successfully read and written.
 
* non-0: at least one record was not successfully read or written.
 
* non-0: at least one record was not successfully read or written.
  
=== Example Output ===
+
= Example Output =
 
<pre>
 
<pre>
The following parameters are available.  Ones with "[]" are in effect:
 
 
Input Parameters
 
--in [../../originalBamFile.bam],
 
--out [-.bam], --refFile [/home/mktrost/data/human.g1k.v37.fa], --noeof,
 
--qualityThreshold [60], --defaultQualityInt [20], --mismatchThreshold [0.10]
 
  
 
open and prefetch reference genome /home/mktrost/data/human.g1k.v37.fa: done.
 
open and prefetch reference genome /home/mktrost/data/human.g1k.v37.fa: done.
Line 79: Line 96:
  
  
=== In Case You Are Curious ===
+
= FAQ =
 
This section contains information about what the filters mean, how they work, etc.
 
This section contains information about what the filters mean, how they work, etc.
  
Line 86: Line 103:
 
:If either the reference or the read are 'N' or 'n', it is NOT a match or mismatch.
 
:If either the reference or the read are 'N' or 'n', it is NOT a match or mismatch.
 
:It is a match if the reference and the read have the same base (both A/a, both C/c, both G/g, or both T/t).  If they are different, non-'N', bases it is a mismatch.
 
:It is a match if the reference and the read have the same base (both A/a, both C/c, both G/g, or both T/t).  If they are different, non-'N', bases it is a mismatch.
 +
 +
'''What type of clip is used?'''
 +
:When a read is clipped due to exceeding the mismatch threshold, a softclip is used (Cigar S).
  
 
'''What should the quality count as if the quality string is '*'?'''
 
'''What should the quality count as if the quality string is '*'?'''
 
:That is the purpose of the <code>defaultQualityInt</code> parameter.
 
:That is the purpose of the <code>defaultQualityInt</code> parameter.
 +
 +
'''Is the quality threshold or the clipping checked first?'''
 +
:First the mismatch threshold is checked so clipping is performed.  Then the quality threshold is checked on the non-clipped mismatches.  That way mismatches that are clipped do not go into the sum of the mismatch qualities.
 +
 +
'''How is the quality threshold checked?'''
 +
:Easy, just loop through an alignment finding mismatches.  Add up the quality of the mismatches.  If they sum is greater than the quality threshold, mark the read as unmapped.
 +
 +
'''How is the mismatch threshold checked?'''
 +
:This is requires a bit more logic...and thus gets its own section: [[Bam Executable: Filter#Mismatch Threshold|Mismatch Threshold]].
 +
 +
= Mismatch Threshold =
 +
 +
Mismatch threshold is:
 +
:<math>NumMismatches \over NumMismatches + NumMatches</math>
 +
 +
But is that total number of mismatches in the entire alignment?
 +
:In processing, it is just the numbers for the bases that have been read so far.
 +
 +
For mismatch threshold, the logic goes in 1 match/mismatch base at a time from each end of the read and checks its current processing against the threshold.
 +
 +
If the first base is a mismatch, it is
 +
:<math>{1 \over 1 + 0} = {1 \over 1} = 1</math>
 +
This base will be clipped since 1 is > than the mismatch threshold (assuming that wasn't set to 1).
 +
 +
If the first base is a match, it is
 +
:<math>{0 \over 1 + 0} = {0 \over 1} = 0</math>
 +
At this point, this base will not be clipped since it is not greater than the mismatch threshold.
 +
 +
When a clip occurs, NumMatches & NumMismatches are reset to 0 (those bases are now clipped and do not count as a match or a mismatch).
 +
 +
To try to minimize the number of bases that are clipped, the logic keeps the NumMatches + NumMismatches when reading from the front and NumMatches + NumMismatches when reading from the back within 1 of each other.
 +
 +
If the mismatch threshold is 10%, it means that none of the 1st or last 10 bases of the updated read will be a mismatch.
 +
 +
If one of the 1st 10 bases is a mismatch, it will be clipped.
 +
 +
If one of the next 10 bases from that clip is a mismatch, it will also be clipped.
 +
 +
 +
For example consider the following strings of matches & mismatches (M indicates match, X indicates mismatch)
 +
MMMXXXMXXM
 +
Use a mismatch threshold of .50
 +
The logic will read alternating, with '-' indicating which bases were processed from the front, '|' reads processed from the back, and '+' reads processed in both directions. '*' indicates clipped bases.
 +
M M M X X X M X X M
 +
-
 +
 +
Match was found from the front, so basesFromFront = 1, basesFromBack = 0, so now read from the back.
 +
 +
M M M X X X M X X M
 +
-                |
 +
Match was found from the back, so basesFromFront = 1, basesFromBack = 1, so read again from the back.
 +
M M M X X X M X X M
 +
-              | |
 +
From back, 1/2 = .5, so is not over the .5 threshold;  basesFromFront = 1, basesFromBack = 2, so read from the front.
 +
M M M X X X M X X M
 +
- -            | |
 +
From front, 0/2 < .5, so is not over the .5 threshold;  basesFromFront = 2, basesFromBack = 2, so read again from the front.
 +
M M M X X X M X X M
 +
- - -          | |
 +
From front, 0/3 < .5, so is not over the .5 threshold;  basesFromFront = 3, basesFromBack = 2, so read from the back.
 +
M M M X X X M X X M
 +
- - -        | | |
 +
From back, 2/3 > .5, so over the .5 threshold;  need to clip...basesFromFront = 3, basesFromBack = 0, so read from the back.
 +
M M M X X X M X X M
 +
- - -      | * * *
 +
From back, 0/1 < .5, so is not over the .5 threshold;  basesFromFront = 3, basesFromBack = 1, so read from the back.
 +
M M M X X X M X X M
 +
- - -    | | * * *
 +
From back, 1/2 = .5, so is not over the .5 threshold;  basesFromFront = 3, basesFromBack = 2, so read from the back.
 +
M M M X X X M X X M
 +
- - -  | | | * * *
 +
From back, 2/3 > .5, so over the .5 threshold;  need to clip...basesFromFront = 3, basesFromBack = 0, so read from the back.
 +
M M M X X X M X X M
 +
- - - | * * * * * *
 +
From back, 1/1 > .5, so over the .5 threshold;  need to clip...basesFromFront = 3, basesFromBack = 0, so read from the back.
 +
M M M X X X M X X M
 +
- - + * * * * * * *
 +
From back, 0/1 < .5, so is not over the .5 threshold;  basesFromFront = 3, basesFromBack = 1, so read from the back.
 +
M M M X X X M X X M
 +
- + + * * * * * * *
 +
From back, 0/2 < .5, so is not over the .5 threshold;  basesFromFront = 3, basesFromBack = 2, so read from the back.
 +
M M M X X X M X X M
 +
+ + + * * * * * * *
 +
From back, 0/3 < .5, so is not over the .5 threshold;  basesFromFront = 3, basesFromBack = 3, so read from the back.
 +
 +
In this example, there is no more to read from the front or the back.
 +
The new Cigar is 3M7S

Latest revision as of 14:07, 6 January 2014


Overview of the filter function of bamUtil

The filter option on the bamUtil executable writes the alignments, filtering them by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high.

The following modifications may occur in an alignment:

  • CIGAR updated to reflect clips
  • POS updated to reflect a new CIGAR if clipping occurs at the front of a read
  • FLAG updated to reflect a read is unmapped if it is below the quality of mismatches is too high, or clipping would cause an entire read to be clipped.

The POS and FLAG fields of an alignment are reflected in the mate's alignment. Thus, the mate also needs to be updated.

Also, if the file was sorted, and a POS was changed, the file may no longer be sorted.

NOTE: This program does NOT update the mate or re-sort the file.

Fixing the mate/resorting

In order to update the mate, samtools fixmate must be run.

In order to reorder the file, samtools sort must be run.

Notes about the samtools programs:

  • samtools fixmate requires the file to be sorted by query name.
  • samtools sort cannot write to pipes.

Steps

  1. Run this program and pipe it into samtools sort by query name
    • ./bam filter --in <your InputFile> --refFile <your reference file> --out -.bam <any other options> | samtools sort -n - tempQuerySort
  2. Run samtools fixmate and pipe it into samtools sort by position
    •  samtools fixmate tempQuerySort.bam - | samtools sort - finalResult

Example

~/pipeFilter/bam/bam filter --in ../../originalBamFile.bam --refFile ~/data/human.g1k.v37.fa --out -.bam | samtools sort -n - tempQuerySort; samtools fixmate tempQuerySort.bam - | samtools sort - newResult


Usage

./bam filter --in <inputFilename>  --refFile <referenceFilename>  --out <outputFilename> [--noeof] [--qualityThreshold <qualThresh>] [--defaultQualityInt <defaultQual>] [--mismatchThreshold <mismatchThresh>] [--params]

Parameters

    Required Parameters:
        --in       : the SAM/BAM file to be read
        --refFile  : the reference file
        --out      : the SAM/BAM file to write to
    Optional Parameters:
        --noeof             : do not expect an EOF block on a bam file.
        --qualityThreshold  : maximum sum of the mismatch qualities before marking
                              a read unmapped. (Defaults to 60)
        --defaultQualityInt : quality value to use for mismatches that do not have a quality
                              (Defaults to 20)
        --mismatchThreshold : decimal value indicating the maximum ratio of mismatches to
                              matches and mismatches allowed before clipping from the ends
                              (Defaults to .10)
        --params            : print the parameter settings
	PhoneHome:
		--noPhoneHome       : disable PhoneHome (default enabled)
		--phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)

Required Parameters

Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in yourFileName
SAM from stdin --in -
BAM from stdin --in -.bam
Uncompressed BAM from stdin --in -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Reference File (--refFile)

Use --refFile followed by the reference file name to specify the reference sequence file.

Output File (--out)

Use --out followed by your file name to specify the SAM/BAM output file.

The file extension is used to determine whether to write SAM/BAM/uncompressed BAM. A - is used to indicate stdout and the extension for file type (no extension is SAM).

SAM to file --out yourFileName.sam
BAM to file --out yourFileName.bam
Uncompressed BAM to file --out yourFileName.ubam
SAM to stdout --out -
BAM to stdout --out -.bam
Uncompressed BAM to stdout --out -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Optional Parameters

Do not require BGZF EOF block (--noeof)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Quality Threshold (--qualityThreshold)

In filter, when the sum of the mismatching base qualities is higher than the --qualityThreshold, the read is marked as unmapped. The default threshold is 60.

Default Quality (--defaultQualityInt)

filter filters reads based on the sum of the base qualities of mismatches. Some reads, however, do not have base qualities. Use --defaultQualityInt to specify the base qualities to use for mismatches that do not have quality values. The default is 20.

Mismatch Threshold (--mismatchThreshold)

filter clips the ends of reads if the ratio of mismatches to matches and mismatches is higher than the decimal parameter, --mismatchThreshold. The default is .10.

Print the Program Parameters (--params)

Use --params to print the parameters for your program to stderr.

PhoneHome Parameters

See PhoneHome for more information on how PhoneHome works and what it does.

Turn off PhoneHome (--noPhoneHome)

Use the --noPhoneHome option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.

Adjust the Frequency of PhoneHome (--phoneHomeThinning)

Use --phoneHomeThinning to modify the percentage of the time that PhoneHome will run (0-100).

  • By default, --phoneHomeThinning is set to 50, running 50% of the time.
  • PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
  • N/A if --noPhoneHome is set.


Return Value

  • 0: all records are successfully read and written.
  • non-0: at least one record was not successfully read or written.

Example Output


open and prefetch reference genome /home/mktrost/data/human.g1k.v37.fa: done.
Number of Reads Clipped by Filtering: 704578
Number of Reads Filtered Due to MismatchThreshold: 0
Number of Reads Filtered Due to QualityThreshold: 13064
[bam_sort_core] merging from 3 files...
[bam_sort_core] merging from 3 files...


FAQ

This section contains information about what the filters mean, how they work, etc.

What counts as a match or a mismatch?

Only bases with a cigar values of M, X, or = can be counted as matches or mismatches. Cigar I, D, N, S, P, and H are NOT matches or mismatches.
If either the reference or the read are 'N' or 'n', it is NOT a match or mismatch.
It is a match if the reference and the read have the same base (both A/a, both C/c, both G/g, or both T/t). If they are different, non-'N', bases it is a mismatch.

What type of clip is used?

When a read is clipped due to exceeding the mismatch threshold, a softclip is used (Cigar S).

What should the quality count as if the quality string is '*'?

That is the purpose of the defaultQualityInt parameter.

Is the quality threshold or the clipping checked first?

First the mismatch threshold is checked so clipping is performed. Then the quality threshold is checked on the non-clipped mismatches. That way mismatches that are clipped do not go into the sum of the mismatch qualities.

How is the quality threshold checked?

Easy, just loop through an alignment finding mismatches. Add up the quality of the mismatches. If they sum is greater than the quality threshold, mark the read as unmapped.

How is the mismatch threshold checked?

This is requires a bit more logic...and thus gets its own section: Mismatch Threshold.

Mismatch Threshold

Mismatch threshold is:

But is that total number of mismatches in the entire alignment?

In processing, it is just the numbers for the bases that have been read so far.

For mismatch threshold, the logic goes in 1 match/mismatch base at a time from each end of the read and checks its current processing against the threshold.

If the first base is a mismatch, it is

This base will be clipped since 1 is > than the mismatch threshold (assuming that wasn't set to 1).

If the first base is a match, it is

At this point, this base will not be clipped since it is not greater than the mismatch threshold.

When a clip occurs, NumMatches & NumMismatches are reset to 0 (those bases are now clipped and do not count as a match or a mismatch).

To try to minimize the number of bases that are clipped, the logic keeps the NumMatches + NumMismatches when reading from the front and NumMatches + NumMismatches when reading from the back within 1 of each other.

If the mismatch threshold is 10%, it means that none of the 1st or last 10 bases of the updated read will be a mismatch.

If one of the 1st 10 bases is a mismatch, it will be clipped.

If one of the next 10 bases from that clip is a mismatch, it will also be clipped.


For example consider the following strings of matches & mismatches (M indicates match, X indicates mismatch)

MMMXXXMXXM

Use a mismatch threshold of .50 The logic will read alternating, with '-' indicating which bases were processed from the front, '|' reads processed from the back, and '+' reads processed in both directions. '*' indicates clipped bases.

M M M X X X M X X M
-

Match was found from the front, so basesFromFront = 1, basesFromBack = 0, so now read from the back.

M M M X X X M X X M
-                 |

Match was found from the back, so basesFromFront = 1, basesFromBack = 1, so read again from the back.

M M M X X X M X X M
-               | |

From back, 1/2 = .5, so is not over the .5 threshold; basesFromFront = 1, basesFromBack = 2, so read from the front.

M M M X X X M X X M
- -             | |

From front, 0/2 < .5, so is not over the .5 threshold; basesFromFront = 2, basesFromBack = 2, so read again from the front.

M M M X X X M X X M
- - -           | |

From front, 0/3 < .5, so is not over the .5 threshold; basesFromFront = 3, basesFromBack = 2, so read from the back.

M M M X X X M X X M
- - -         | | |

From back, 2/3 > .5, so over the .5 threshold; need to clip...basesFromFront = 3, basesFromBack = 0, so read from the back.

M M M X X X M X X M
- - -       | * * *

From back, 0/1 < .5, so is not over the .5 threshold; basesFromFront = 3, basesFromBack = 1, so read from the back.

M M M X X X M X X M
- - -     | | * * *

From back, 1/2 = .5, so is not over the .5 threshold; basesFromFront = 3, basesFromBack = 2, so read from the back.

M M M X X X M X X M
- - -   | | | * * *

From back, 2/3 > .5, so over the .5 threshold; need to clip...basesFromFront = 3, basesFromBack = 0, so read from the back.

M M M X X X M X X M
- - - | * * * * * *

From back, 1/1 > .5, so over the .5 threshold; need to clip...basesFromFront = 3, basesFromBack = 0, so read from the back.

M M M X X X M X X M
- - + * * * * * * *

From back, 0/1 < .5, so is not over the .5 threshold; basesFromFront = 3, basesFromBack = 1, so read from the back.

M M M X X X M X X M
- + + * * * * * * *

From back, 0/2 < .5, so is not over the .5 threshold; basesFromFront = 3, basesFromBack = 2, so read from the back.

M M M X X X M X X M
+ + + * * * * * * *

From back, 0/3 < .5, so is not over the .5 threshold; basesFromFront = 3, basesFromBack = 3, so read from the back.

In this example, there is no more to read from the front or the back. The new Cigar is 3M7S