BamUtil: filter

From Genome Analysis Wiki
Revision as of 15:58, 30 September 2010 by Mktrost (talk | contribs)
Jump to navigationJump to search


filter

The filter option on the bam executable writes the alignments filtering them by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high.

The following modifications may occur in an alignment:

  • CIGAR updated to reflect clips
  • POS updated to reflect a new CIGAR if clipping occurs at the front of a read
  • FLAG updated to reflect a read is unmapped if it is below the quality of mismatches is too high, or clipping would cause an entire read to be clipped.

NOTES

The POS and FLAG fields of an alignment are reflected in the mate's alignment. Thus, when the mate also needs to be updated.

Also, if the file was sorted, and a POS was changed, the file may no longer be sorted.

NOTE: This program does NOT update the mate or resort the file.

In order to update the mate, samtools fixmate must be run.

In order to reorder the file, samtools sort must be run.

Notes about the samtools programs:

  • samtools fixmate requires the file to be sorted by query name.
  • samtools sort cannot write to pipes.

Steps:

  1. Run this program and pipe it into samtools sort by query name
    • ./bam filter --in <your InputFile> --refFile <your reference file> --out -.bam <any other options> | samtools sort -n - tempQuerySort
  2. Run samtools fixmate and pipe it into samtools sort by position
    •  samtools fixmate tempQuerySort.bam - | samtools sort - finalResult

For Example:

~/pipeFilter/bam/bam filter --in ../../originalBamFile.bam --refFile ~/data/human.g1k.v37.fa --out -.bam | samtools sort -n - tempQuerySort; samtools fixmate tempQuerySort.bam - | samtools sort - newResult


Parameters

	Required Parameters:
		--in       : the SAM/BAM file to be read
		--refFile  : the reference file
		--out      : the SAM/BAM file to write to
	Optional Parameters:
		--noeof             : do not expect an EOF block on a bam file.
		--qualityThreshold  : maximum sum of the mismatch qualities before marking
		                      a read unmapped. (Defaults to 60)
		--defaultQualityInt : quality value to use for mismatches that do not have a quality
		                      (Defaults to 20)
		--mismatchThreshold : decimal value indicating the maximum ration of mismatches to
		                      matches and mismatches allowed before clipping from the ends
		                      (Defaults to .10)

Usage

./bam filter --in <inputFilename>  --refFile <referenceFilename>  --out <outputFilename> [--noeof] [--qualityThreshold <qualThresh>] [--defaultQualityInt <defaultQual>] [--mismatchThreshold <mismatchThresh>]

Return Value

  • 0: all records are successfully read and written.
  • non-0: at least one record was not successfully read or written.

Example Output

The following parameters are available.  Ones with "[]" are in effect:

Input Parameters
 --in [../../originalBamFile.bam],
 --out [-.bam], --refFile [/home/mktrost/data/human.g1k.v37.fa], --noeof,
 --qualityThreshold [60], --defaultQualityInt [20], --mismatchThreshold [0.10]

open and prefetch reference genome /home/mktrost/data/human.g1k.v37.fa: done.
Number of Reads Clipped by Filtering: 704578
Number of Reads Filtered Due to MismatchThreshold: 0
Number of Reads Filtered Due to QualityThreshold: 13064
[bam_sort_core] merging from 3 files...
[bam_sort_core] merging from 3 files...