Difference between revisions of "BamUtil: diff"
Line 142: | Line 142: | ||
When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags: | When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags: | ||
* ZF - Flag | * ZF - Flag | ||
− | * ZP - | + | * ZP - Chromosome:1-based Position |
* ZC - Cigar | * ZC - Cigar | ||
* ZM - Mapping Quality | * ZM - Mapping Quality | ||
− | * ZN - Mate | + | * ZN - Chromosome:1-based Mate Position |
* ZI - Insert Size | * ZI - Insert Size | ||
* ZS - Sequence | * ZS - Sequence |
Revision as of 13:27, 21 May 2012
Overview of the diff
function of bamUtil
The diff
option on the bamUtil executable prints the difference between two coordinate sorted SAM/BAM files. This can be used to compare the outputs of running a SAM/BAM through different tools/versions of tools.
The diff
tool compares records that have the same Read Name and Fragment (from the flag). If a matching ReadName & Fragment is not found, the record is considered to be different.
diff
assumes the files are coordinate sorted and uses this assumption for determining how long to store a record before determining that the other file does not contain a matching ReadName/Fragment. If the files are not coordinate sorted, this logic does not work.
By default, just the chromosome/position and cigar are compared for each record.
Options are available to compare:
- all fields
- flags
- mapping quality
- mate chromosome/position
- insert size
- sequence
- base quality
- specified tags
- all tags
- turn off position comparison
- turn off cigar comparison
Parameters
Required Parameters: --in1 : first coordinate sorted SAM/BAM file to be diffed --in2 : second coordinate sorted SAM/BAM file to be diffed Optional Parameters: --out : output filename, use .bam extension to output in SAM/BAM format instead of diff format. In SAM/BAM format there will be 3 output files: 1) the specified name with record diffs 2) specified name with _only_<in1>.sam/bam with records only in the in1 file 3) specified name with _only_<in2>.sam/bam with records only in the in2 file --all : diff all the SAM/BAM fields. --flag : diff the flags. --mapQual : diff the mapping qualities. --mate : diff the mate chrom/pos. --isize : diff the insert sizes. --seq : diff the sequence bases. --baseQual : diff the base qualities. --tags : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type... --everyTag : diff all the Tags --noCigar : do not diff the the cigars. --noPos : do not diff the positions. --onlyDiffs : only print the fields that are different, otherwise for any diff all the fields that are compared are printed. --recPoolSize : number of records to allow to be stored at a time, default value: 1000000 --posDiff : max base pair difference between possibly matching records100000 --noeof : do not expect an EOF block on a bam file. --params : print the parameter settings
Usage
./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--all] [--flag] [--mapQual] [--mate] [--isize] [--seq] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--everyTag] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]
Return Value
- 0: all records are successfully read and written.
- non-0: an error occurred processing the parameters or reading one of the files.e
Output Format
2 Output Formats:
- Diff Format
- BAM Format
Diff Format
There are 2 types of differences.
- ReadName/Fragment combo is in one file, but not in the other file within the window set by recPoolSize & posDiff
- ReadName/Fragment combo is in both files, but at least one of the specified fields to diff is different
Each difference output consists of 2 or 3 lines. If the record only appears in one of the files, the diff is 2 lines, if it appears in both files, the diff is 3 lines.
The first line of the difference output is just the read name.
The 2nd and 3rd line (if present) begin with either a '<' or a '>'. If the record is from the first file (--in1), it begins with a '<'. If the record is from the 2nd file (--in2), it begins with a '>'.
The 2nd line is the flag followed by the diff'd fields from one of the records.
The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.
The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified: After April 16, 2012:
- '<' or '>'
- flag
- chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
- cigar - if --noCigar is not specified
- mapping quality - if --mapq or --all is specified
- mate chrom:pos (chromosome name ':' 1 based position) - if --mate or --all is specified
- insert size - if --isize or --all is specified
- sequence - if --seq or --all is specified
- base quality - if --baseQual or --all is specified
- tag:type:value - for each tag:type specified in --tags or for every tag if --all or --everyTag specified
- ...
- tag:type:value
Prior to April 16, 2012:
- '<' or '>'
- flag
- chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
- cigar - if --noCigar is not specified
- sequence - if --seq is specified
- base quality - if --baseQual is specified
- tag:type:value - for each tag:type specified in --tags
- ...
- tag:type:value
If onlyDiffs
is specified, only the fields that are specified and are different get printed in lines 2 & 3.
If all fields are diffed and --onlyDiffs
is specified, it may be difficult to determine which field is different.
Example Output
Command:
../bin/bam diff --in1 testFiles/testDiff1.sam --in2 testFiles/testDiff2.sam --seq --baseQual --tags "OP:i;MD:Z" --onlyDiffs --out results/diffOrderSam.log
Output:
18:462+29M5I3M:F:295 < a1 1:78 > a1 1:74 1 > a1 1:70 3S1M1S ACGTN ;46>> OP:i:75 MD:Z:30A0C5 2 > a1 1:72 3S1M1S ACGTN ;47>> OP:i:75 MD:Z:30A0C5 ABC > cd *:0 * * * DEF > cd *:0 * * *
SAM/Bam Format
use .sam/.bam extension to output in SAM/BAM format instead of diff format.
In SAM/BAM format there will be 3 output files:
- the specified name with record diffs
- specified name with _only_<in1>.sam/bam with records only in the in1 file
- specified name with _only_<in2>.sam/bam with records only in the in2 file
When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags:
- ZF - Flag
- ZP - Chromosome:1-based Position
- ZC - Cigar
- ZM - Mapping Quality
- ZN - Chromosome:1-based Mate Position
- ZI - Insert Size
- ZS - Sequence
- ZQ - Base Quality
- ZT - Tags
If --onlyDiffs
is not specified, all fields that were compared will be printed in the tags. If --onlyDiffs
is specified, then only the differing compared fields will be printed in the tags.