Difference between revisions of "BamUtil: diff"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 13: Line 13:
  
 
Options are available to compare:
 
Options are available to compare:
 +
* all fields
 +
* flags
 +
* mapping quality
 +
* mate chromosome/position
 +
* insert size
 
* sequence
 
* sequence
 
* base quality
 
* base quality
 
* specified tags
 
* specified tags
 +
* all tags
 
* turn off position comparison
 
* turn off position comparison
 
* turn off cigar comparison
 
* turn off cigar comparison
Line 26: Line 32:
 
Optional Parameters:
 
Optional Parameters:
 
--out        : output filename, use .bam extension to output in SAM/BAM format instead of diff format.
 
--out        : output filename, use .bam extension to output in SAM/BAM format instead of diff format.
                In SAMBAM format there will be 3 output files:
+
                In SAM/BAM format there will be 3 output files:
 
                    1) the specified name with record diffs
 
                    1) the specified name with record diffs
 
                    2) specified name with _only_<in1>.sam/bam with records only in the in1 file
 
                    2) specified name with _only_<in1>.sam/bam with records only in the in1 file
 
                    3) specified name with _only_<in2>.sam/bam with records only in the in2 file
 
                    3) specified name with _only_<in2>.sam/bam with records only in the in2 file
 +
--all        : diff all the SAM/BAM fields.
 +
--flag        : diff the flags.
 +
--mapQual    : diff the mapping qualities.
 +
--mate        : diff the mate chrom/pos.
 +
--isize      : diff the insert sizes.
 
--seq        : diff the sequence bases.
 
--seq        : diff the sequence bases.
 
--baseQual    : diff the base qualities.
 
--baseQual    : diff the base qualities.
 
--tags        : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type...
 
--tags        : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type...
 +
--everyTag    : diff all the Tags
 
--noCigar    : do not diff the the cigars.
 
--noCigar    : do not diff the the cigars.
 
--noPos      : do not diff the positions.
 
--noPos      : do not diff the positions.
Line 43: Line 55:
  
 
= Usage =
 
= Usage =
  ./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]
+
  ./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--all] [--flag] [--mapQual] [--mate] [--isize] [--seq] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--everyTag] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]
  
 
= Return Value =
 
= Return Value =
Line 68: Line 80:
  
 
The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.
 
The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.
 
  
 
The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified:
 
The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified:
Line 99: Line 110:
  
 
If <code>onlyDiffs</code> is specified, only the fields that are specified and are different get printed in lines 2 & 3.
 
If <code>onlyDiffs</code> is specified, only the fields that are specified and are different get printed in lines 2 & 3.
 +
 +
If all fields are diffed and <code>--onlyDiffs</code> is specified, it may be difficult to determine which field is different.
  
 
=== Example Output ===
 
=== Example Output ===
Line 129: Line 142:
 
When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags:
 
When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags:
 
* ZF - Flag
 
* ZF - Flag
* ZP - Pos
+
* ZP - Chrom/Pos
 
* ZC - Cigar
 
* ZC - Cigar
 +
* ZM - Mapping Quality
 +
* ZN - Mate chrom/pos
 +
* ZI - Insert Size
 
* ZS - Sequence
 
* ZS - Sequence
 
* ZQ - Base Quality
 
* ZQ - Base Quality
 
* ZT - Tags
 
* ZT - Tags

Revision as of 14:44, 16 April 2012


Overview of the diff function of bamUtil

The diff option on the bamUtil executable prints the difference between two coordinate sorted SAM/BAM files. This can be used to compare the outputs of running a SAM/BAM through different tools/versions of tools.

The diff tool compares records that have the same Read Name and Fragment (from the flag). If a matching ReadName & Fragment is not found, the record is considered to be different.

diff assumes the files are coordinate sorted and uses this assumption for determining how long to store a record before determining that the other file does not contain a matching ReadName/Fragment. If the files are not coordinate sorted, this logic does not work.

By default, just the chromosome/position and cigar are compared for each record.

Options are available to compare:

  • all fields
  • flags
  • mapping quality
  • mate chromosome/position
  • insert size
  • sequence
  • base quality
  • specified tags
  • all tags
  • turn off position comparison
  • turn off cigar comparison

Parameters

	Required Parameters:
		--in1         : first coordinate sorted SAM/BAM file to be diffed
		--in2         : second coordinate sorted SAM/BAM file to be diffed
	Optional Parameters:
		--out         : output filename, use .bam extension to output in SAM/BAM format instead of diff format.
		                In SAM/BAM format there will be 3 output files:
		                    1) the specified name with record diffs
		                    2) specified name with _only_<in1>.sam/bam with records only in the in1 file
		                    3) specified name with _only_<in2>.sam/bam with records only in the in2 file
		--all         : diff all the SAM/BAM fields.
		--flag        : diff the flags.
		--mapQual     : diff the mapping qualities.
		--mate        : diff the mate chrom/pos.
		--isize       : diff the insert sizes.
		--seq         : diff the sequence bases.
		--baseQual    : diff the base qualities.
		--tags        : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type...
		--everyTag    : diff all the Tags
		--noCigar     : do not diff the the cigars.
		--noPos       : do not diff the positions.
		--onlyDiffs   : only print the fields that are different, otherwise for any diff all the fields that are compared are printed.
		--recPoolSize : number of records to allow to be stored at a time, default value: 1000000
		--posDiff     : max base pair difference between possibly matching records100000
		--noeof       : do not expect an EOF block on a bam file.
		--params      : print the parameter settings

Usage

./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--all] [--flag] [--mapQual] [--mate] [--isize] [--seq] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--everyTag] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]

Return Value

  • 0: all records are successfully read and written.
  • non-0: an error occurred processing the parameters or reading one of the files.e

Output Format

2 Output Formats:

  1. Diff Format
  2. BAM Format

Diff Format

There are 2 types of differences.

  • ReadName/Fragment combo is in one file, but not in the other file within the window set by recPoolSize & posDiff
  • ReadName/Fragment combo is in both files, but at least one of the specified fields to diff is different

Each difference output consists of 2 or 3 lines. If the record only appears in one of the files, the diff is 2 lines, if it appears in both files, the diff is 3 lines.

The first line of the difference output is just the read name.

The 2nd and 3rd line (if present) begin with either a '<' or a '>'. If the record is from the first file (--in1), it begins with a '<'. If the record is from the 2nd file (--in2), it begins with a '>'.

The 2nd line is the flag followed by the diff'd fields from one of the records.

The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.

The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified: After April 16, 2012:

  • '<' or '>'
  • flag
  • chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
  • cigar - if --noCigar is not specified
  • mapping quality - if --mapq or --all is specified
  • mate chrom:pos (chromosome name ':' 1 based position) - if --mate or --all is specified
  • insert size - if --isize or --all is specified
  • sequence - if --seq or --all is specified
  • base quality - if --baseQual or --all is specified
  • tag:type:value - for each tag:type specified in --tags or for every tag if --all or --everyTag specified
  • ...
  • tag:type:value


Prior to April 16, 2012:

  • '<' or '>'
  • flag
  • chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
  • cigar - if --noCigar is not specified
  • sequence - if --seq is specified
  • base quality - if --baseQual is specified
  • tag:type:value - for each tag:type specified in --tags
  • ...
  • tag:type:value

If onlyDiffs is specified, only the fields that are specified and are different get printed in lines 2 & 3.

If all fields are diffed and --onlyDiffs is specified, it may be difficult to determine which field is different.

Example Output

Command:

../bin/bam diff --in1 testFiles/testDiff1.sam --in2 testFiles/testDiff2.sam --seq --baseQual --tags "OP:i;MD:Z" --onlyDiffs --out results/diffOrderSam.log

Output:

18:462+29M5I3M:F:295
<	a1	1:78
>	a1	1:74
1
>	a1	1:70	3S1M1S	ACGTN	;46>>	OP:i:75	MD:Z:30A0C5
2
>	a1	1:72	3S1M1S	ACGTN	;47>>	OP:i:75	MD:Z:30A0C5
ABC
>	cd	*:0	*	*	*
DEF
>	cd	*:0	*	*	*

SAM/Bam Format

use .sam/.bam extension to output in SAM/BAM format instead of diff format.

In SAM/BAM format there will be 3 output files:

  1. the specified name with record diffs
  2. specified name with _only_<in1>.sam/bam with records only in the in1 file
  3. specified name with _only_<in2>.sam/bam with records only in the in2 file

When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags:

  • ZF - Flag
  • ZP - Chrom/Pos
  • ZC - Cigar
  • ZM - Mapping Quality
  • ZN - Mate chrom/pos
  • ZI - Insert Size
  • ZS - Sequence
  • ZQ - Base Quality
  • ZT - Tags