BamUtil: diff

From Genome Analysis Wiki
Revision as of 21:45, 29 January 2013 by Mktrost (talk | contribs)
Jump to navigationJump to search


Overview of the diff function of bamUtil

The diff option on the bamUtil executable prints the difference between two coordinate sorted SAM/BAM files. This can be used to compare the outputs of running a SAM/BAM through different tools/versions of tools.

The diff tool compares records that have the same Read Name and Fragment (from the flag). If a matching ReadName & Fragment is not found, the record is considered to be different.

diff assumes the files are coordinate sorted and uses this assumption for determining how long to store a record before determining that the other file does not contain a matching ReadName/Fragment. If the files are not coordinate sorted, this logic does not work.

By default, just the chromosome/position and cigar are compared for each record.

Note: The headers are not compared.

Options are available to compare:

  • all fields
  • flags
  • mapping quality
  • mate chromosome/position
  • insert size
  • sequence
  • base quality
  • specified tags
  • all tags
  • turn off position comparison
  • turn off cigar comparison

Parameters

	Required Parameters:
		--in1         : first coordinate sorted SAM/BAM file to be diffed
		--in2         : second coordinate sorted SAM/BAM file to be diffed
	Optional Parameters:
		--out         : output filename, use .bam extension to output in SAM/BAM format instead of diff format.
		                In SAM/BAM format there will be 3 output files:
		                    1) the specified name with record diffs
		                    2) specified name with _only_<in1>.sam/bam with records only in the in1 file
		                    3) specified name with _only_<in2>.sam/bam with records only in the in2 file
		--all         : diff all the SAM/BAM fields.
		--flag        : diff the flags.
		--mapQual     : diff the mapping qualities.
		--mate        : diff the mate chrom/pos.
		--isize       : diff the insert sizes.
		--seq         : diff the sequence bases.
		--baseQual    : diff the base qualities.
		--tags        : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type...
		--everyTag    : diff all the Tags
		--noCigar     : do not diff the the cigars.
		--noPos       : do not diff the positions.
		--onlyDiffs   : only print the fields that are different, otherwise for any diff all the fields that are compared are printed.
		--recPoolSize : number of records to allow to be stored at a time, default value: 1000000
		--posDiff     : max base pair difference between possibly matching records100000
		--noeof       : do not expect an EOF block on a bam file.
		--params      : print the parameter settings

input Files 1 & 2 (--in1 and --in2)

Use --in1 and --in2 followed by your file names to specify the SAM/BAM input files to compare.

The program automatically determines if your input files are SAM/BAM/uncompressed BAM unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in1 yourFileName
SAM from stdin --in1 -
BAM from stdin --in1 -.bam
Uncompressed BAM from stdin --in1 -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.


Usage

./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--all] [--flag] [--mapQual] [--mate] [--isize] [--seq] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--everyTag] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]


Return Value

  • 0: all records are successfully read and written.
  • non-0: an error occurred processing the parameters or reading one of the files.e

Output Format

2 Output Formats:

  1. Diff Format
  2. BAM Format

Diff Format

There are 2 types of differences.

  • ReadName/Fragment combo is in one file, but not in the other file within the window set by recPoolSize & posDiff
  • ReadName/Fragment combo is in both files, but at least one of the specified fields to diff is different

Each difference output consists of 2 or 3 lines. If the record only appears in one of the files, the diff is 2 lines, if it appears in both files, the diff is 3 lines.

The first line of the difference output is just the read name.

The 2nd and 3rd line (if present) begin with either a '<' or a '>'. If the record is from the first file (--in1), it begins with a '<'. If the record is from the 2nd file (--in2), it begins with a '>'.

The 2nd line is the flag followed by the diff'd fields from one of the records.

The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.

The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified: After April 16, 2012:

  • '<' or '>'
  • flag
  • chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
  • cigar - if --noCigar is not specified
  • mapping quality - if --mapq or --all is specified
  • mate chrom:pos (chromosome name ':' 1 based position) - if --mate or --all is specified
  • insert size - if --isize or --all is specified
  • sequence - if --seq or --all is specified
  • base quality - if --baseQual or --all is specified
  • tag:type:value - for each tag:type specified in --tags or for every tag if --all or --everyTag specified
  • ...
  • tag:type:value


Prior to April 16, 2012:

  • '<' or '>'
  • flag
  • chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
  • cigar - if --noCigar is not specified
  • sequence - if --seq is specified
  • base quality - if --baseQual is specified
  • tag:type:value - for each tag:type specified in --tags
  • ...
  • tag:type:value

If onlyDiffs is specified, only the fields that are specified and are different get printed in lines 2 & 3.

If all fields are diffed and --onlyDiffs is specified, it may be difficult to determine which field is different.

Example Output

Command:

../bin/bam diff --in1 testFiles/testDiff1.sam --in2 testFiles/testDiff2.sam --seq --baseQual --tags "OP:i;MD:Z" --onlyDiffs --out results/diffOrderSam.log

Output:

18:462+29M5I3M:F:295
<	a1	1:78
>	a1	1:74
1
>	a1	1:70	3S1M1S	ACGTN	;46>>	OP:i:75	MD:Z:30A0C5
2
>	a1	1:72	3S1M1S	ACGTN	;47>>	OP:i:75	MD:Z:30A0C5
ABC
>	cd	*:0	*	*	*
DEF
>	cd	*:0	*	*	*

SAM/Bam Format

use .sam/.bam extension to output in SAM/BAM format instead of diff format.

In SAM/BAM format there will be 3 output files:

  1. the specified name with record diffs
  2. specified name with _only_<in1>.sam/bam with records only in the in1 file
  3. specified name with _only_<in2>.sam/bam with records only in the in2 file

Records that are identical in the two files are not written in any of these output files.

When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags:

  • ZF - Flag
  • ZP - Chromosome:1-based Position
  • ZC - Cigar
  • ZM - Mapping Quality
  • ZN - Chromosome:1-based Mate Position
  • ZI - Insert Size
  • ZS - Sequence
  • ZQ - Base Quality
  • ZT - Tags

If --onlyDiffs is not specified, all fields that were compared will be printed in the tags. If --onlyDiffs is specified, then only the differing compared fields will be printed in the tags.