BamUtil: diff

From Genome Analysis Wiki
Revision as of 17:00, 6 January 2014 by Mktrost (talk | contribs) (→‎Optional Parmaeters)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search


Overview of the diff function of bamUtil

The diff option on the bamUtil executable prints the difference between two coordinate sorted SAM/BAM files. This can be used to compare the outputs of running a SAM/BAM through different tools/versions of tools.

The diff tool compares records that have the same Read Name and Fragment (from the flag). If a matching ReadName & Fragment is not found, the record is considered to be different.

diff assumes the files are coordinate sorted and uses this assumption for determining how long to store a record before determining that the other file does not contain a matching ReadName/Fragment. If the files are not coordinate sorted, this logic does not work.

By default, just the chromosome/position and cigar are compared for each record.

Note: The headers are not compared.

Options are available to compare:

  • all fields
  • flags
  • mapping quality
  • mate chromosome/position
  • insert size
  • sequence
  • base quality
  • specified tags
  • all tags
  • turn off position comparison
  • turn off cigar comparison


Usage

./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--all] [--flag] [--mapQual] [--mate] [--isize] [--seq] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--everyTag] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]


Parameters

	Required Parameters:
		--in1         : first coordinate sorted SAM/BAM file to be diffed
		--in2         : second coordinate sorted SAM/BAM file to be diffed
	Optional Parameters:
		--out         : output filename, use .bam extension to output in SAM/BAM format instead of diff format.
		                In SAM/BAM format there will be 3 output files:
		                    1) the specified name with record diffs
		                    2) specified name with _only_<in1>.sam/bam with records only in the in1 file
		                    3) specified name with _only_<in2>.sam/bam with records only in the in2 file
		--all         : diff all the SAM/BAM fields.
		--flag        : diff the flags.
		--mapQual     : diff the mapping qualities.
		--mate        : diff the mate chrom/pos.
		--isize       : diff the insert sizes.
		--seq         : diff the sequence bases.
		--baseQual    : diff the base qualities.
		--tags        : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type...
		--everyTag    : diff all the Tags
		--noCigar     : do not diff the the cigars.
		--noPos       : do not diff the positions.
		--onlyDiffs   : only print the fields that are different, otherwise for any diff all the fields that are compared are printed.
		--recPoolSize : number of records to allow to be stored at a time, default value: 1000000
		                Set to -1 for unlimited number of records
		--posDiff     : max base pair difference between possibly matching records, default value: 100000
		--noeof       : do not expect an EOF block on a bam file.
		--params      : print the parameter settings
	PhoneHome:
		--noPhoneHome       : disable PhoneHome (default enabled)
		--phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)

Required Parameters

input Files 1 & 2 (--in1 and --in2)

Use --in1 and --in2 followed by your file names to specify the SAM/BAM input files to compare. They are both required.

The program automatically determines if your input files are SAM/BAM/uncompressed BAM unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in1 yourFileName
SAM from stdin --in1 -
BAM from stdin --in1 -.bam
Uncompressed BAM from stdin --in1 -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Optional Parmaeters

output File (--out)

Use --out (optional) to specify the name of the output file.

It is output in Diff Format by default. Specify the filename with a .bam, .sam, .ubam extension to output in SAM/BAM Format.

Fields to Diff (--all, --flag, --mapQual, --mate, --isize, --seq, --baseQual, --tags, --everyTag, --noCigar, --noPos, )

By default only the chromosome/position and cigar are compared for each record.

SAM/BAM Record fields:

Field Name Flag to Enable Flag to Disable
Read Name used to match records between files
Flag Fragment bit used to match records between files
Flag other bits --flag
Reference (chrom) Name (on by default) --noPos
Position
Mapping Quality --mapQual
Cigar (on by default) --noCigar
Mate Reference (chrom) --mate
Mate Position
Insert Size --isize
Sequence --seq
Quality --baseQual

To diff all Tags, use --everyTag. To diff only certain tags, use --tags Tag1:Type1;Tag2:Type2;Tag3:Type3 specifying a semicolon separated list of tag/type pairs (separated by a colon).

OR use --all to diff all SAM/BAM record fields.

Only print different fields (--onlyDiffs)

Specify --onlyDiffs to only print the fields that are different, otherwise for any diff all the fields that are compared are printed. The read name is always printed.

Maximum Number of Records That Can be Allocated (--recPoolSize)

When comparing the files, matching reads may not have the same positions and thus may not be at the same location in the files. In this case, reads need to be stored until its match is found in the other file.

--recPoolSize is used to specify the number of records allowed to be allocated at one time by the program. Set it to -1 to allow unlimited records. Note: If the number of allocated records is large, it will use up a large amount of memory.

The default pool size is 1000000.

Records are released when the match is found in the other file or when the opposite file is [[Maximum Base Pair Difference Between Possibly Matching Records (--posDiff)|--posDiff]] number of positions past the position in the record.

When the Pool Size is exceeded, the oldest record in the file that has more records stored is released and treated as unique to that file. If the matching record is later found in the other file, it will also be treated as unique to its file. At the end of the run, a warning message is printed with the number of times the PoolSize was hit and records were forced to be released.

Maximum Base Pair Difference Between Possibly Matching Records (--posDiff)

In order to limit th number of records that are held onto while looking for matching records, a maximum difference in position between the matches is used. This value is defaulted to 100000 amd cam be modified using --posDiff. Any matching pairs that are further than --posDiff are treated as unique to their files.

Note: No warning message is printed about --posDiff affecting your output since the software doesn't know if the matching records don't exist or are just further away.

Do not require BGZF EOF block (--noeof)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Print the Program Parameters (--params)

Use --params to print the parameters for your program to stderr.

PhoneHome Parameters

See PhoneHome for more information on how PhoneHome works and what it does.

Turn off PhoneHome (--noPhoneHome)

Use the --noPhoneHome option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.

Adjust the Frequency of PhoneHome (--phoneHomeThinning)

Use --phoneHomeThinning to modify the percentage of the time that PhoneHome will run (0-100).

  • By default, --phoneHomeThinning is set to 50, running 50% of the time.
  • PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
  • N/A if --noPhoneHome is set.

Return Value

  • 0: all records are successfully read and written.
  • non-0: an error occurred processing the parameters or reading one of the files.e

Output Format

2 Output Formats:

  1. Diff Format
  2. BAM Format

Diff Format

There are 2 types of differences.

  • ReadName/Fragment combo is in one file, but not in the other file within the window set by recPoolSize & posDiff
  • ReadName/Fragment combo is in both files, but at least one of the specified fields to diff is different

Each difference output consists of 2 or 3 lines. If the record only appears in one of the files, the diff is 2 lines, if it appears in both files, the diff is 3 lines.

The first line of the difference output is just the read name.

The 2nd and 3rd line (if present) begin with either a '<' or a '>'. If the record is from the first file (--in1), it begins with a '<'. If the record is from the 2nd file (--in2), it begins with a '>'.

The 2nd line is the flag followed by the diff'd fields from one of the records.

The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.

The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified: After April 16, 2012:

  • '<' or '>'
  • flag
  • chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
  • cigar - if --noCigar is not specified
  • mapping quality - if --mapq or --all is specified
  • mate chrom:pos (chromosome name ':' 1 based position) - if --mate or --all is specified
  • insert size - if --isize or --all is specified
  • sequence - if --seq or --all is specified
  • base quality - if --baseQual or --all is specified
  • tag:type:value - for each tag:type specified in --tags or for every tag if --all or --everyTag specified
  • ...
  • tag:type:value


Prior to April 16, 2012:

  • '<' or '>'
  • flag
  • chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
  • cigar - if --noCigar is not specified
  • sequence - if --seq is specified
  • base quality - if --baseQual is specified
  • tag:type:value - for each tag:type specified in --tags
  • ...
  • tag:type:value

If onlyDiffs is specified, only the fields that are specified and are different get printed in lines 2 & 3.

If all fields are diffed and --onlyDiffs is specified, it may be difficult to determine which field is different.

Example Output

Command:

../bin/bam diff --in1 testFiles/testDiff1.sam --in2 testFiles/testDiff2.sam --seq --baseQual --tags "OP:i;MD:Z" --onlyDiffs --out results/diffOrderSam.log

Output:

18:462+29M5I3M:F:295
<	a1	1:78
>	a1	1:74
1
>	a1	1:70	3S1M1S	ACGTN	;46>>	OP:i:75	MD:Z:30A0C5
2
>	a1	1:72	3S1M1S	ACGTN	;47>>	OP:i:75	MD:Z:30A0C5
ABC
>	cd	*:0	*	*	*
DEF
>	cd	*:0	*	*	*

SAM/BAM Format

use .sam/.bam extension to output in SAM/BAM format instead of diff format.

In SAM/BAM format there will be 3 output files:

  1. the specified name with record diffs
  2. specified name with _only_<in1>.sam/bam with records only in the in1 file
  3. specified name with _only_<in2>.sam/bam with records only in the in2 file

Records that are identical in the two files are not written in any of these output files.

When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags:

  • ZF - Flag
  • ZP - Chromosome:1-based Position
  • ZC - Cigar
  • ZM - Mapping Quality
  • ZN - Chromosome:1-based Mate Position
  • ZI - Insert Size
  • ZS - Sequence
  • ZQ - Base Quality
  • ZT - Tags

If --onlyDiffs is not specified, all fields that were compared will be printed in the tags. If --onlyDiffs is specified, then only the differing compared fields will be printed in the tags.