Changes

From Genome Analysis Wiki
Jump to navigationJump to search
8,642 bytes added ,  14:07, 6 January 2014
no edit summary
Line 1: Line 1: −
[[Category:libbam]]
+
[[Category:BamUtil|filter]]
 +
[[Category:BAM Software]]
 +
[[Category:Software]]
   −
== filter ==
+
= Overview of the <code>filter</code> function of <code>bamUtil</code> =
 +
The <code>filter</code> option on the [[bamUtil]] executable writes the alignments, filtering them by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high.
   −
The <code>filter</code> option on the [[Bam|bam executable]] writes the alignments filtering them by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high.
+
The following modifications may occur in an alignment:
 +
* CIGAR updated to reflect clips
 +
* POS updated to reflect a new CIGAR if clipping occurs at the front of a read
 +
* FLAG updated to reflect a read is unmapped if it is below the quality of mismatches is too high, or clipping would cause an entire read to be clipped.
    +
The POS and FLAG fields of an alignment are reflected in the mate's alignment.  Thus, the mate also needs to be updated.
   −
=== Parameters ===
+
Also, if the file was sorted, and a POS was changed, the file may no longer be sorted.
 +
 
 +
'''NOTE: This program does NOT update the mate or re-sort the file.'''
 +
 
 +
== Fixing the mate/resorting ==
 +
In order to update the mate, samtools fixmate must be run.
 +
 
 +
In order to reorder the file, samtools sort must be run.
 +
 
 +
Notes about the samtools programs:
 +
* samtools fixmate requires the file to be sorted by query name.
 +
* samtools sort cannot write to pipes.
 +
 
 +
===Steps===
 +
# Run this program and pipe it into samtools sort by query name
 +
#* <pre>./bam filter --in <your InputFile> --refFile <your reference file> --out -.bam <any other options> | samtools sort -n - tempQuerySort</pre>
 +
# Run samtools fixmate and pipe it into samtools sort by position
 +
#* <pre> samtools fixmate tempQuerySort.bam - | samtools sort - finalResult</pre>
 +
 
 +
===Example===
 +
~/pipeFilter/bam/bam filter --in ../../originalBamFile.bam --refFile ~/data/human.g1k.v37.fa --out -.bam | samtools sort -n - tempQuerySort; samtools fixmate tempQuerySort.bam - | samtools sort - newResult
 +
 
 +
 
 +
= Usage =
 +
 
 +
./bam filter --in <inputFilename>  --refFile <referenceFilename>  --out <outputFilename> [--noeof] [--qualityThreshold <qualThresh>] [--defaultQualityInt <defaultQual>] [--mismatchThreshold <mismatchThresh>] [--params]
 +
 
 +
= Parameters =
 
<pre>
 
<pre>
 
     Required Parameters:
 
     Required Parameters:
 
         --in      : the SAM/BAM file to be read
 
         --in      : the SAM/BAM file to be read
 
         --refFile  : the reference file
 
         --refFile  : the reference file
 +
        --out      : the SAM/BAM file to write to
 
     Optional Parameters:
 
     Optional Parameters:
         --noeof : do not expect an EOF block on a bam file.
+
         --noeof             : do not expect an EOF block on a bam file.
 +
        --qualityThreshold  : maximum sum of the mismatch qualities before marking
 +
                              a read unmapped. (Defaults to 60)
 +
        --defaultQualityInt : quality value to use for mismatches that do not have a quality
 +
                              (Defaults to 20)
 +
        --mismatchThreshold : decimal value indicating the maximum ratio of mismatches to
 +
                              matches and mismatches allowed before clipping from the ends
 +
                              (Defaults to .10)
 +
        --params            : print the parameter settings
 
</pre>
 
</pre>
 +
{{PhoneHomeParamDesc}}
 +
 +
== Required Parameters==
 +
{{InBAMInputFile}}
 +
{{refFile}}
 +
{{OutBAMOutputFile}}
 +
 +
== Optional Parameters ==
 +
{{noeofBGZFParameter}}
 +
=== Quality Threshold (<code>--qualityThreshold</code>) ===
 +
In <code>filter</code>, when the sum of the mismatching base qualities is higher than the <code>--qualityThreshold</code>, the read is marked as unmapped.  The default threshold is 60.
 +
 +
=== Default Quality (<code>--defaultQualityInt</code>) ===
 +
<code>filter</code> filters reads based on the sum of the base qualities of mismatches.  Some reads, however, do not have base qualities.  Use <code>--defaultQualityInt</code> to specify the base qualities to use for mismatches that do not have quality values.  The default is 20.
 +
 +
=== Mismatch Threshold (<code>--mismatchThreshold</code>) ===
 +
<code>filter</code> clips the ends of reads if the ratio of mismatches to matches and mismatches is higher than the decimal parameter, <code>--mismatchThreshold</code>.  The default is .10.
 +
 +
{{paramsParameter}}
   −
=== Usage ===
+
{{PhoneHomeParameters}}
   −
    ./bam filter --in <inputFilename>  --refFile <referenceFilename> [--noeof]
     −
=== Return Value ===
+
= Return Value =
 
*    0: all records are successfully read and written.
 
*    0: all records are successfully read and written.
 
* non-0: at least one record was not successfully read or written.
 
* non-0: at least one record was not successfully read or written.
   −
=== Example Output ===
+
= Example Output =
 
<pre>
 
<pre>
The following parameters are available.  Ones with "[]" are in effect:
     −
Input Parameters
+
open and prefetch reference genome /home/mktrost/data/human.g1k.v37.fa: done.
--in [testFiles/testFilter.sam], --out [-],
+
Number of Reads Clipped by Filtering: 704578
  --refFile [testFiles/chr1_partial.fa], --noeof, --qualityThreshold [30],
+
Number of Reads Filtered Due to MismatchThreshold: 0
  --defaultQualityInt [20], --mismatchThreshold [0.49]
+
Number of Reads Filtered Due to QualityThreshold: 13064
 +
[bam_sort_core] merging from 3 files...
 +
[bam_sort_core] merging from 3 files...
 +
</pre>
 +
 
 +
 
 +
= FAQ =
 +
This section contains information about what the filters mean, how they work, etc.
 +
 
 +
'''What counts as a match or a mismatch?'''
 +
:Only bases with a cigar values of M, X, or = can be counted as matches or mismatches. Cigar I, D, N, S, P, and H are NOT matches or mismatches.
 +
:If either the reference or the read are 'N' or 'n', it is NOT a match or mismatch.
 +
:It is a match if the reference and the read have the same base (both A/a, both C/c, both G/g, or both T/t). If they are different, non-'N', bases it is a mismatch.
 +
 
 +
'''What type of clip is used?'''
 +
:When a read is clipped due to exceeding the mismatch threshold, a softclip is used (Cigar S).
 +
 
 +
'''What should the quality count as if the quality string is '*'?'''
 +
:That is the purpose of the <code>defaultQualityInt</code> parameter.
 +
 
 +
'''Is the quality threshold or the clipping checked first?'''
 +
:First the mismatch threshold is checked so clipping is performed. Then the quality threshold is checked on the non-clipped mismatches.  That way mismatches that are clipped do not go into the sum of the mismatch qualities.
 +
 
 +
'''How is the quality threshold checked?'''
 +
:Easy, just loop through an alignment finding mismatches.  Add up the quality of the mismatches.  If they sum is greater than the quality threshold, mark the read as unmapped.
 +
 
 +
'''How is the mismatch threshold checked?'''
 +
:This is requires a bit more logic...and thus gets its own section: [[Bam Executable: Filter#Mismatch Threshold|Mismatch Threshold]].
 +
 
 +
= Mismatch Threshold =
 +
 
 +
Mismatch threshold is:
 +
:<math>NumMismatches \over NumMismatches + NumMatches</math>
 +
 
 +
But is that total number of mismatches in the entire alignment?
 +
:In processing, it is just the numbers for the bases that have been read so far.
 +
 
 +
For mismatch threshold, the logic goes in 1 match/mismatch base at a time from each end of the read and checks its current processing against the threshold.
 +
 
 +
If the first base is a mismatch, it is
 +
:<math>{1 \over 1 + 0} = {1 \over 1} = 1</math>
 +
This base will be clipped since 1 is > than the mismatch threshold (assuming that wasn't set to 1).
 +
 
 +
If the first base is a match, it is
 +
:<math>{0 \over 1 + 0} = {0 \over 1} = 0</math>
 +
At this point, this base will not be clipped since it is not greater than the mismatch threshold.
 +
 
 +
When a clip occurs, NumMatches & NumMismatches are reset to 0 (those bases are now clipped and do not count as a match or a mismatch).
 +
 
 +
To try to minimize the number of bases that are clipped, the logic keeps the NumMatches + NumMismatches when reading from the front and NumMatches + NumMismatches when reading from the back within 1 of each other.
 +
 
 +
If the mismatch threshold is 10%, it means that none of the 1st or last 10 bases of the updated read will be a mismatch.
 +
 
 +
If one of the 1st 10 bases is a mismatch, it will be clipped.
 +
 
 +
If one of the next 10 bases from that clip is a mismatch, it will also be clipped.
 +
 
   −
open and prefetch reference genome testFiles/chr1_partial.fa: done.
+
For example consider the following strings of matches & mismatches (M indicates match, X indicates mismatch)
Number of Reads Clipped by Filtering: 8
+
MMMXXXMXXM
Number of Reads Filtered Due to MismatchThreshold: 1
+
Use a mismatch threshold of .50
Number of Reads Filtered Due to QualityThreshold: 2
+
The logic will read alternating, with '-' indicating which bases were processed from the front, '|' reads processed from the back, and '+' reads processed in both directions. '*' indicates clipped bases.
</pre>
+
M M M X X X M X X M
 +
-
 +
 
 +
Match was found from the front, so basesFromFront = 1, basesFromBack = 0, so now read from the back.
 +
 
 +
M M M X X X M X X M
 +
-                |
 +
Match was found from the back, so basesFromFront = 1, basesFromBack = 1, so read again from the back.
 +
M M M X X X M X X M
 +
-              | |
 +
From back, 1/2 = .5, so is not over the .5 threshold;  basesFromFront = 1, basesFromBack = 2, so read from the front.
 +
M M M X X X M X X M
 +
- -            | |
 +
From front, 0/2 < .5, so is not over the .5 threshold;  basesFromFront = 2, basesFromBack = 2, so read again from the front.
 +
M M M X X X M X X M
 +
- - -          | |
 +
From front, 0/3 < .5, so is not over the .5 threshold;  basesFromFront = 3, basesFromBack = 2, so read from the back.
 +
M M M X X X M X X M
 +
- - -        | | |
 +
From back, 2/3 > .5, so over the .5 threshold;  need to clip...basesFromFront = 3, basesFromBack = 0, so read from the back.
 +
M M M X X X M X X M
 +
- - -      | * * *
 +
From back, 0/1 < .5, so is not over the .5 threshold;  basesFromFront = 3, basesFromBack = 1, so read from the back.
 +
M M M X X X M X X M
 +
- - -    | | * * *
 +
From back, 1/2 = .5, so is not over the .5 threshold;  basesFromFront = 3, basesFromBack = 2, so read from the back.
 +
M M M X X X M X X M
 +
- - -  | | | * * *
 +
From back, 2/3 > .5, so over the .5 threshold;  need to clip...basesFromFront = 3, basesFromBack = 0, so read from the back.
 +
M M M X X X M X X M
 +
- - - | * * * * * *
 +
From back, 1/1 > .5, so over the .5 threshold;  need to clip...basesFromFront = 3, basesFromBack = 0, so read from the back.
 +
M M M X X X M X X M
 +
- - + * * * * * * *
 +
From back, 0/1 < .5, so is not over the .5 threshold;  basesFromFront = 3, basesFromBack = 1, so read from the back.
 +
M M M X X X M X X M
 +
- + + * * * * * * *
 +
From back, 0/2 < .5, so is not over the .5 threshold;  basesFromFront = 3, basesFromBack = 2, so read from the back.
 +
M M M X X X M X X M
 +
+ + + * * * * * * *
 +
From back, 0/3 < .5, so is not over the .5 threshold;  basesFromFront = 3, basesFromBack = 3, so read from the back.
 +
 
 +
In this example, there is no more to read from the front or the back.
 +
The new Cigar is 3M7S

Navigation menu