Difference between revisions of "MutationFilter"
From Genome Analysis Wiki
Jump to navigationJump to search(19 intermediate revisions by the same user not shown) | |||
Line 33: | Line 33: | ||
When filters have 'mut_' the filters are for the statistis calculated for MUTANT alleles only! | When filters have 'mut_' the filters are for the statistis calculated for MUTANT alleles only! | ||
− | + | ==== Some examples ==== | |
+ | ===== 1. Filter based on nearby Indels (INDEL): filtered if there are >=3 reads with Indels in a window if 20bp up- and down-stream of the mutation candidate===== | ||
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 | mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 | ||
+ | |||
+ | ===== 2. Filter based on cycle bias (CB): filtered if the median distance to the nearest end of the mutant allele is >=5 ===== | ||
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 | mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 | ||
+ | |||
+ | ===== 3. Filter based on read clipping (CL): filter if percentage of the reads '''carrying the mutant allele''' have clipping from the head of the read is >=20%===== | ||
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --mut_clip_pct 20 | mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --mut_clip_pct 20 | ||
+ | |||
+ | ===== 4. Filter based on Allelic Balance (AB): filtered if the percentage of the reads '''carrying the mutant allele''' is <30% ===== | ||
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --mut_clip_pct 20 --mut_base_pct 30 | mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --mut_clip_pct 20 --mut_base_pct 30 | ||
− | mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 -- | + | |
− | mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 -- | + | ===== 5. Filter based on low Map Quality (LMQ): filtered if the percentage of reads with low Map quality (defined as map quality below 10 vis --low_mapq_cutoff) is >=15% ===== |
− | mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 -- | + | mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --low_mapq_cutoff 10 --low_mapq_pct 15 |
+ | |||
+ | ===== 6. Filter based on low Map Quality (LMQ): filtered if the percentage of reads '''carrying the mutant allele''' with low Map quality (defined as map quality below 10 via --low_mapq_cutoff) is >=15% ===== | ||
+ | mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --low_mapq_cutoff 10 --mut_low_mapq_pct 15 | ||
+ | |||
+ | ===== 7. Filter based on low Base Quality (LBQ): filtered if the percentage of reads '''carrying the mutant allele''' with low Base Quality (defined as base quality below 30 via --low_baseq_cutoff) is >=50% ===== | ||
+ | mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --low_baseq_cutoff 30 --mut_low_baseq_pct 50 | ||
+ | |||
+ | ===== 8. Filter based on improperly pairing (IP): filtered if the percentage of improperly paired reads is >=10% ===== | ||
+ | mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --improper_paired_pct 10 | ||
+ | |||
+ | ===== 9. Filter based on improperly pairing (IP): filtered if the percentage of improperly paired reads '''carrying the mutant allele''' is >=5% ===== | ||
+ | mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --mut_improper_paired_pct 5 | ||
+ | |||
+ | ===== 10. Filter based on strand bias (SB): filtered if the odds ratio of the strand bias is >=2 ===== | ||
+ | mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --SB_OR 2 | ||
+ | |||
+ | =====You can create any combination of the filtering and mutfilter will report in the FILTER column which criteria the mutation candidate failed ===== | ||
+ | |||
+ | *NOTE1: The filtering criteria will only change the FILTER column which indicates which criteria a candidate fails. The output statistics are not affected by these criteria | ||
+ | *NOTE2: Most of the filtering can be achieved based on the output statistics and post-hoc filtering can be tuned to achieve desired filtering based on the output statistics | ||
+ | |||
+ | === Input === | ||
+ | * [[http://samtools.sourceforge.net/ SAM/BAM]] file | ||
+ | * BED-like file: It requires 5 columns as CHR, START, END, REF, MUTANT in that order. It also requires that START=END here since it deal with single sites | ||
+ | |||
+ | 1 10000 10000 A C | ||
+ | 1 20000 20000 G T | ||
+ | 2 15000 15000 A G | ||
+ | |||
+ | === Output === | ||
+ | |||
+ | #CHR START END REF MUT FILTER nRef nMut nOA MP:MM:RP:RM nINS nDEL nClip_mut Cycle_mut SB nIP nIP_mut nLMQ nLMQ_mut nLBQ nLBQ_mut | ||
+ | 11 190324 190324 A C INDEL,AB,BQM 0 5 0 2:3:0:0 0 1 0 20 . 0 0 0 0 5 5 | ||
+ | |||
+ | The first 5 columns are the same as the input BED-like file. The following columns are various statistics based on which various filtering can be achieved. The meaning of each column is as follows | ||
+ | *FILTER: indicates the filtering criteria a candidate failed, separated by commas. A . is used if not set. | ||
+ | *nRef: # of REF alleles | ||
+ | *nMut: # of MUT alleles | ||
+ | *nOA: # of other alleles different from REF and MUT alleles. | ||
+ | *RF:RR:MF:MR: #Ref on forward strand : #Ref on reverse strand : #Mut on forward strand : #Mut on reverse strand | ||
+ | *nINS: # of reads with insertions in a window of sized specified by --indel_winsize | ||
+ | *nDEL: # of reads with deletions in a window as nINS | ||
+ | *nClip_mut: # of reads with clipping from the head of reads | ||
+ | *Cycle_mut: the median distance of the mutation allele to the nearest end of a read | ||
+ | *SB: strand bias odds ratio, which can be derived from MP:MM:RP:RM. A . is used if not applicable (e.g. all are on the same strand). | ||
+ | *nIP: # of improperly paired reads | ||
+ | *nIP_mut: # of improperly paired reads carrying the mutant allele | ||
+ | *nLMQ: # of reads with low mapping quality | ||
+ | *nLMQ_mut: # of low mapping quality reads that carry the mutant allele | ||
+ | *nLBQ: # of low quality bases | ||
+ | *nLBQ_mut: # of low quality mutant bases. | ||
+ | *meanMQ_pass: average of MapQ of reads that have MapQ > low_mapq_cutoff. | ||
+ | *mut_meanMQ_pass: average of MapQ of reads that have MapQ > low_mapq_cutoff AND contain the mutant allele. | ||
+ | *meanBQ_pass: average of BaseQ of reads that have BaseQ > low_baseq_cutoff. | ||
+ | *mut_meanBQ_pass: average of BaseQ of mutant alleles that have BaseQ > low_baseq_cutoff. | ||
+ | |||
+ | === Download === | ||
+ | Source code can be [[Media:Mutfilter.tar.gz | Downloaded]] here. |
Latest revision as of 01:06, 16 April 2013
Introduction
- The tool mutfilter generates various diagnosis statistics based on sequence alignment and filters alignment artifacts based on user-provided criteria.
- It takes as input a SAM/BAM file (through --bam) and a BED-like file (through --bed) and generates output on screen.
- Additional filtering options can be provided and mutfilter will generate filtering flags for each input filtering option. See details below.
Usage
- Typing mutfiler without any other options will display the following message
The following parameters are available. Ones with "[]" are in effect: Input Files : --bam [], --bed [] Cycle Bias (CB) : --mut_median_cycles2ends [-1] Strand Bias (SB) : --SB_OR [-1.0e+00] Nearby Indels (INDEL) : --indel_winsize [30], --indel_cnt [-1], --indel_pct [-1.0e+00] Head Clip (HC) : --mut_clip_cnt [-1], --mut_clip_pct [-1.0e+00] Other Alleles (OA) : --other_allele_cnt [-1], --other_allele_pct [-1.0e+00] Allelic Balance (AB) : --mut_base_cnt [-1], --mut_base_pct [-1.0e+00] Low Map Quality (LMQ) : --low_mapq_cutoff [-1.0e+00], --low_mapq_cnt [-1], --low_mapq_pct [-1.0e+00], --mut_low_mapq_cnt [-1], --mut_low_mapq_pct [-1.0e+00] Low Base Quality (LBQ) : --low_baseq_cutoff [-1.0e+00], --low_baseq_cnt [-1], --low_baseq_pct [-1.0e+00], --mut_low_baseq_cnt [-1], --mut_low_baseq_pct [-1.0e+00] Improper Paired (IP) : --improper_paried_cnt [-1], --improper_paried_pct [-1.0e+00], --mut_improper_paried_cnt [-1], --mut_improper_paried_pct [-1.0e+00]
NOTE: When parameters are negative these filters are NOT in effect! When filters have 'mut_' the filters are for the statistis calculated for MUTANT alleles only!
Some examples
1. Filter based on nearby Indels (INDEL): filtered if there are >=3 reads with Indels in a window if 20bp up- and down-stream of the mutation candidate
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3
2. Filter based on cycle bias (CB): filtered if the median distance to the nearest end of the mutant allele is >=5
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5
3. Filter based on read clipping (CL): filter if percentage of the reads carrying the mutant allele have clipping from the head of the read is >=20%
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --mut_clip_pct 20
4. Filter based on Allelic Balance (AB): filtered if the percentage of the reads carrying the mutant allele is <30%
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --mut_clip_pct 20 --mut_base_pct 30
5. Filter based on low Map Quality (LMQ): filtered if the percentage of reads with low Map quality (defined as map quality below 10 vis --low_mapq_cutoff) is >=15%
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --low_mapq_cutoff 10 --low_mapq_pct 15
6. Filter based on low Map Quality (LMQ): filtered if the percentage of reads carrying the mutant allele with low Map quality (defined as map quality below 10 via --low_mapq_cutoff) is >=15%
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --low_mapq_cutoff 10 --mut_low_mapq_pct 15
7. Filter based on low Base Quality (LBQ): filtered if the percentage of reads carrying the mutant allele with low Base Quality (defined as base quality below 30 via --low_baseq_cutoff) is >=50%
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --low_baseq_cutoff 30 --mut_low_baseq_pct 50
8. Filter based on improperly pairing (IP): filtered if the percentage of improperly paired reads is >=10%
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --improper_paired_pct 10
9. Filter based on improperly pairing (IP): filtered if the percentage of improperly paired reads carrying the mutant allele is >=5%
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --mut_improper_paired_pct 5
10. Filter based on strand bias (SB): filtered if the odds ratio of the strand bias is >=2
mutfilter --bam in.bam --bed in.bed --indel_winsize 20 --indel_cnt 3 --mut_median_cycles2ends 5 --SB_OR 2
You can create any combination of the filtering and mutfilter will report in the FILTER column which criteria the mutation candidate failed
- NOTE1: The filtering criteria will only change the FILTER column which indicates which criteria a candidate fails. The output statistics are not affected by these criteria
- NOTE2: Most of the filtering can be achieved based on the output statistics and post-hoc filtering can be tuned to achieve desired filtering based on the output statistics
Input
- [SAM/BAM] file
- BED-like file: It requires 5 columns as CHR, START, END, REF, MUTANT in that order. It also requires that START=END here since it deal with single sites
1 10000 10000 A C 1 20000 20000 G T 2 15000 15000 A G
Output
#CHR START END REF MUT FILTER nRef nMut nOA MP:MM:RP:RM nINS nDEL nClip_mut Cycle_mut SB nIP nIP_mut nLMQ nLMQ_mut nLBQ nLBQ_mut 11 190324 190324 A C INDEL,AB,BQM 0 5 0 2:3:0:0 0 1 0 20 . 0 0 0 0 5 5
The first 5 columns are the same as the input BED-like file. The following columns are various statistics based on which various filtering can be achieved. The meaning of each column is as follows
- FILTER: indicates the filtering criteria a candidate failed, separated by commas. A . is used if not set.
- nRef: # of REF alleles
- nMut: # of MUT alleles
- nOA: # of other alleles different from REF and MUT alleles.
- RF:RR:MF:MR: #Ref on forward strand : #Ref on reverse strand : #Mut on forward strand : #Mut on reverse strand
- nINS: # of reads with insertions in a window of sized specified by --indel_winsize
- nDEL: # of reads with deletions in a window as nINS
- nClip_mut: # of reads with clipping from the head of reads
- Cycle_mut: the median distance of the mutation allele to the nearest end of a read
- SB: strand bias odds ratio, which can be derived from MP:MM:RP:RM. A . is used if not applicable (e.g. all are on the same strand).
- nIP: # of improperly paired reads
- nIP_mut: # of improperly paired reads carrying the mutant allele
- nLMQ: # of reads with low mapping quality
- nLMQ_mut: # of low mapping quality reads that carry the mutant allele
- nLBQ: # of low quality bases
- nLBQ_mut: # of low quality mutant bases.
- meanMQ_pass: average of MapQ of reads that have MapQ > low_mapq_cutoff.
- mut_meanMQ_pass: average of MapQ of reads that have MapQ > low_mapq_cutoff AND contain the mutant allele.
- meanBQ_pass: average of BaseQ of reads that have BaseQ > low_baseq_cutoff.
- mut_meanBQ_pass: average of BaseQ of mutant alleles that have BaseQ > low_baseq_cutoff.
Download
Source code can be Downloaded here.