BamUtil: diff

From Genome Analysis Wiki
Jump to navigationJump to search


Overview of the diff function of bamUtil

The diff option on the bamUtil executable prints the difference between two coordinate sorted SAM/BAM files. This can be used to compare the outputs of running a SAM/BAM through different tools/versions of tools.

The diff tool compares records that have the same Read Name and Fragment (from the flag). If a matching ReadName & Fragment is not found, the record is considered to be different.

diff assumes the files are coordinate sorted and uses this assumption for determining how long to store a record before determining that the other file does not contain a matching ReadName/Fragment. If the files are not coordinate sorted, this logic does not work.

By default, just the chromosome/position and cigar are compared for each record.

Options are available to compare:

  • all fields
  • flags
  • mapping quality
  • mate chromosome/position
  • insert size
  • sequence
  • base quality
  • specified tags
  • all tags
  • turn off position comparison
  • turn off cigar comparison

Parameters

	Required Parameters:
		--in1         : first coordinate sorted SAM/BAM file to be diffed
		--in2         : second coordinate sorted SAM/BAM file to be diffed
	Optional Parameters:
		--out         : output filename, use .bam extension to output in SAM/BAM format instead of diff format.
		                In SAM/BAM format there will be 3 output files:
		                    1) the specified name with record diffs
		                    2) specified name with _only_<in1>.sam/bam with records only in the in1 file
		                    3) specified name with _only_<in2>.sam/bam with records only in the in2 file
		--all         : diff all the SAM/BAM fields.
		--flag        : diff the flags.
		--mapQual     : diff the mapping qualities.
		--mate        : diff the mate chrom/pos.
		--isize       : diff the insert sizes.
		--seq         : diff the sequence bases.
		--baseQual    : diff the base qualities.
		--tags        : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type...
		--everyTag    : diff all the Tags
		--noCigar     : do not diff the the cigars.
		--noPos       : do not diff the positions.
		--onlyDiffs   : only print the fields that are different, otherwise for any diff all the fields that are compared are printed.
		--recPoolSize : number of records to allow to be stored at a time, default value: 1000000
		--posDiff     : max base pair difference between possibly matching records100000
		--noeof       : do not expect an EOF block on a bam file.
		--params      : print the parameter settings

Usage

./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--all] [--flag] [--mapQual] [--mate] [--isize] [--seq] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--everyTag] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]

Return Value

  • 0: all records are successfully read and written.
  • non-0: an error occurred processing the parameters or reading one of the files.e

Output Format

2 Output Formats:

  1. Diff Format
  2. BAM Format

Diff Format

There are 2 types of differences.

  • ReadName/Fragment combo is in one file, but not in the other file within the window set by recPoolSize & posDiff
  • ReadName/Fragment combo is in both files, but at least one of the specified fields to diff is different

Each difference output consists of 2 or 3 lines. If the record only appears in one of the files, the diff is 2 lines, if it appears in both files, the diff is 3 lines.

The first line of the difference output is just the read name.

The 2nd and 3rd line (if present) begin with either a '<' or a '>'. If the record is from the first file (--in1), it begins with a '<'. If the record is from the 2nd file (--in2), it begins with a '>'.

The 2nd line is the flag followed by the diff'd fields from one of the records.

The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.

The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified: After April 16, 2012:

  • '<' or '>'
  • flag
  • chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
  • cigar - if --noCigar is not specified
  • mapping quality - if --mapq or --all is specified
  • mate chrom:pos (chromosome name ':' 1 based position) - if --mate or --all is specified
  • insert size - if --isize or --all is specified
  • sequence - if --seq or --all is specified
  • base quality - if --baseQual or --all is specified
  • tag:type:value - for each tag:type specified in --tags or for every tag if --all or --everyTag specified
  • ...
  • tag:type:value


Prior to April 16, 2012:

  • '<' or '>'
  • flag
  • chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
  • cigar - if --noCigar is not specified
  • sequence - if --seq is specified
  • base quality - if --baseQual is specified
  • tag:type:value - for each tag:type specified in --tags
  • ...
  • tag:type:value

If onlyDiffs is specified, only the fields that are specified and are different get printed in lines 2 & 3.

If all fields are diffed and --onlyDiffs is specified, it may be difficult to determine which field is different.

Example Output

Command:

../bin/bam diff --in1 testFiles/testDiff1.sam --in2 testFiles/testDiff2.sam --seq --baseQual --tags "OP:i;MD:Z" --onlyDiffs --out results/diffOrderSam.log

Output:

18:462+29M5I3M:F:295
<	a1	1:78
>	a1	1:74
1
>	a1	1:70	3S1M1S	ACGTN	;46>>	OP:i:75	MD:Z:30A0C5
2
>	a1	1:72	3S1M1S	ACGTN	;47>>	OP:i:75	MD:Z:30A0C5
ABC
>	cd	*:0	*	*	*
DEF
>	cd	*:0	*	*	*

SAM/Bam Format

use .sam/.bam extension to output in SAM/BAM format instead of diff format.

In SAM/BAM format there will be 3 output files:

  1. the specified name with record diffs
  2. specified name with _only_<in1>.sam/bam with records only in the in1 file
  3. specified name with _only_<in2>.sam/bam with records only in the in2 file

Records that are identical in the two files are not written in any of these output files.

When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags:

  • ZF - Flag
  • ZP - Chromosome:1-based Position
  • ZC - Cigar
  • ZM - Mapping Quality
  • ZN - Chromosome:1-based Mate Position
  • ZI - Insert Size
  • ZS - Sequence
  • ZQ - Base Quality
  • ZT - Tags

If --onlyDiffs is not specified, all fields that were compared will be printed in the tags. If --onlyDiffs is specified, then only the differing compared fields will be printed in the tags.