Changes

From Genome Analysis Wiki
Jump to navigationJump to search
6,415 bytes added ,  17:00, 6 January 2014
Line 11: Line 11:     
By default, just the chromosome/position and cigar are compared for each record.
 
By default, just the chromosome/position and cigar are compared for each record.
 +
 +
Note: The headers are not compared.
    
Options are available to compare:
 
Options are available to compare:
 +
* all fields
 +
* flags
 +
* mapping quality
 +
* mate chromosome/position
 +
* insert size
 
* sequence
 
* sequence
 
* base quality
 
* base quality
 
* specified tags
 
* specified tags
 +
* all tags
 
* turn off position comparison
 
* turn off position comparison
 
* turn off cigar comparison
 
* turn off cigar comparison
 +
 +
 +
= Usage =
 +
./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--all] [--flag] [--mapQual] [--mate] [--isize] [--seq] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--everyTag] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]
 +
    
= Parameters =
 
= Parameters =
Line 26: Line 39:  
Optional Parameters:
 
Optional Parameters:
 
--out        : output filename, use .bam extension to output in SAM/BAM format instead of diff format.
 
--out        : output filename, use .bam extension to output in SAM/BAM format instead of diff format.
                In SAMBAM format there will be 3 output files:
+
                In SAM/BAM format there will be 3 output files:
 
                    1) the specified name with record diffs
 
                    1) the specified name with record diffs
 
                    2) specified name with _only_<in1>.sam/bam with records only in the in1 file
 
                    2) specified name with _only_<in1>.sam/bam with records only in the in1 file
 
                    3) specified name with _only_<in2>.sam/bam with records only in the in2 file
 
                    3) specified name with _only_<in2>.sam/bam with records only in the in2 file
 +
--all        : diff all the SAM/BAM fields.
 +
--flag        : diff the flags.
 +
--mapQual    : diff the mapping qualities.
 +
--mate        : diff the mate chrom/pos.
 +
--isize      : diff the insert sizes.
 
--seq        : diff the sequence bases.
 
--seq        : diff the sequence bases.
 
--baseQual    : diff the base qualities.
 
--baseQual    : diff the base qualities.
 
--tags        : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type...
 
--tags        : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type...
 +
--everyTag    : diff all the Tags
 
--noCigar    : do not diff the the cigars.
 
--noCigar    : do not diff the the cigars.
 
--noPos      : do not diff the positions.
 
--noPos      : do not diff the positions.
 
--onlyDiffs  : only print the fields that are different, otherwise for any diff all the fields that are compared are printed.
 
--onlyDiffs  : only print the fields that are different, otherwise for any diff all the fields that are compared are printed.
 
--recPoolSize : number of records to allow to be stored at a time, default value: 1000000
 
--recPoolSize : number of records to allow to be stored at a time, default value: 1000000
--posDiff    : max base pair difference between possibly matching records100000
+
                Set to -1 for unlimited number of records
 +
--posDiff    : max base pair difference between possibly matching records, default value: 100000
 
--noeof      : do not expect an EOF block on a bam file.
 
--noeof      : do not expect an EOF block on a bam file.
 
--params      : print the parameter settings
 
--params      : print the parameter settings
 
</pre>
 
</pre>
 +
{{PhoneHomeParamDesc}}
 +
 +
== Required Parameters ==
 +
 +
=== input Files 1 & 2 (<code>--in1</code> and <code>--in2</code>)  ===
 +
 +
Use <code>--in1</code> and <code>--in2</code> followed by your file names to specify the SAM/BAM input files to compare.  They are both required.
 +
 +
The program automatically determines if your input files are SAM/BAM/uncompressed BAM unless your input file is stdin.
 +
 +
A <code>-</code> is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).
 +
 +
{|border="1" cellspacing="0" cellpadding="2"
 +
|SAM/BAM/Uncompressed BAM from file
 +
| <code>--in1 yourFileName</code>
 +
|-
 +
|SAM from stdin
 +
| <code>--in1 -</code>
 +
|-
 +
|BAM from stdin
 +
| <code>--in1 -.bam</code>
 +
|-
 +
|Uncompressed BAM from stdin
 +
| <code>--in1 -.ubam</code>
 +
|}
 +
 +
 +
Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file).  This matches the <code>samtools</code> implementation so pipes between our tools and <code>samtools</code> are supported.
 +
 +
== Optional Parmaeters ==
 +
=== output File (<code>--out</code>)  ===
 +
Use <code>--out</code> (optional) to specify the name of the output file.
 +
 +
It is output in [[Diff Format]] by default.  Specify the filename with a .bam, .sam, .ubam extension to output in [[SAM/BAM Format]].
 +
 +
=== Fields to Diff (<code>--all</code>, <code>--flag</code>, <code>--mapQual</code>, <code>--mate</code>, <code>--isize</code>, <code>--seq</code>, <code>--baseQual</code>, <code>--tags</code>, <code>--everyTag</code>, <code>--noCigar</code>, <code>--noPos</code>, )===
 +
 +
By default only the chromosome/position and cigar are compared for each record.
 +
 +
SAM/BAM Record fields:
 +
{|border="1" cellspacing="0" cellpadding="2"
 +
! Field Name !! Flag to Enable !! Flag to Disable
 +
|-
 +
|Read Name ||colspan="2"|used to match records between files
 +
|-
 +
|Flag Fragment bit || colspan="2"|used to match records between files
 +
|-
 +
| Flag other bits || --flag ||
 +
|-
 +
| Reference (chrom) Name
 +
| rowspan="2" |  ''(on by default)'' || rowspan="2" |--noPos
 +
|-
 +
| Position
 +
|-
 +
|Mapping Quality || --mapQual ||
 +
|-
 +
| Cigar || ''(on by default)'' || --noCigar
 +
|-
 +
| Mate Reference (chrom) || rowspan="2" | --mate ||
 +
|-
 +
| Mate Position ||
 +
|-
 +
| Insert Size || --isize ||
 +
|-
 +
| Sequence || --seq ||
 +
|-
 +
| Quality || --baseQual ||
 +
|-
 +
|}
 +
 +
To diff all Tags, use <code>--everyTag</code>.  To diff only certain tags, use <code>--tags Tag1:Type1;Tag2:Type2;Tag3:Type3</code> specifying a semicolon separated list of tag/type pairs (separated by a colon).
 +
 +
'''OR use <code>--all</code> to diff all SAM/BAM record fields.
 +
 +
=== Only print different fields (<code>--onlyDiffs</code>)===
 +
 +
Specify <code>--onlyDiffs</code> to only print the fields that are different, otherwise for any diff all the fields that are compared are printed.  The read name is always printed.
 +
 +
=== Maximum Number of Records That Can be Allocated (<code>--recPoolSize</code>)===
 +
When comparing the files, matching reads may not have the same positions and thus may not be at the same location in the files.  In this case, reads need to be stored until its match is found in the other file.
 +
 +
<code>--recPoolSize</code> is used to specify the number of records allowed to be allocated at one time by the program.  Set it to -1 to allow unlimited records.  Note: If the number of allocated records is large, it will use up a large amount of memory.
 +
 +
The default pool size is 1000000.
 +
 +
Records are released when the match is found in the other file or when the opposite file is [[Maximum Base Pair Difference Between Possibly Matching Records (<code>--posDiff</code>)|--posDiff]] number of positions past the position in the record.
   −
= Usage =
+
When the Pool Size is exceeded, the oldest record in the file that has more records stored is released and treated as unique to that file.  If the matching record is later found in the other file, it will also be treated as unique to its file.  At the end of the run, a warning message is printed with the number of times the PoolSize was hit and records were forced to be released.
  ./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]
+
 
 +
=== Maximum Base Pair Difference Between Possibly Matching Records (<code>--posDiff</code>)===
 +
In order to limit th number of records that are held onto while looking for matching records, a maximum difference in position between the matches is used. This value is defaulted to 100000 amd cam be modified using <code>--posDiff</code>.  Any matching pairs that are further than <code>--posDiff</code> are treated as unique to their files.
 +
 
 +
Note: No warning message is printed about <code>--posDiff</code> affecting your output since the software doesn't know if the matching records don't exist or are just further away.
 +
 
 +
{{noeofBGZFParameter}}
 +
{{paramsParameter}}
 +
 
 +
{{PhoneHomeParameters}}
    
= Return Value =
 
= Return Value =
Line 68: Line 183:     
The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.
 
The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.
 +
 +
The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified:
 +
After April 16, 2012:
 +
* '<' or '>'
 +
* flag
 +
* chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
 +
* cigar - if --noCigar is not specified
 +
* mapping quality - if --mapq or --all is specified
 +
* mate chrom:pos (chromosome name ':' 1 based position) - if --mate or --all is specified
 +
* insert size - if --isize or --all is specified
 +
* sequence - if --seq or --all is specified
 +
* base quality - if --baseQual or --all is specified
 +
* tag:type:value - for each tag:type specified in --tags or for every tag if --all or --everyTag specified
 +
* ...
 +
* tag:type:value
      −
The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified:
+
 
 +
Prior to April 16, 2012:
 
* '<' or '>'
 
* '<' or '>'
 
* flag
 
* flag
Line 82: Line 213:     
If <code>onlyDiffs</code> is specified, only the fields that are specified and are different get printed in lines 2 & 3.
 
If <code>onlyDiffs</code> is specified, only the fields that are specified and are different get printed in lines 2 & 3.
 +
 +
If all fields are diffed and <code>--onlyDiffs</code> is specified, it may be difficult to determine which field is different.
    
=== Example Output ===
 
=== Example Output ===
Line 102: Line 235:  
</pre>
 
</pre>
   −
== SAM/Bam Format ==
+
== SAM/BAM Format ==
 
use .sam/.bam extension to output in SAM/BAM format instead of diff format.
 
use .sam/.bam extension to output in SAM/BAM format instead of diff format.
   Line 109: Line 242:  
# specified name with _only_<in1>.sam/bam with records only in the in1 file
 
# specified name with _only_<in1>.sam/bam with records only in the in1 file
 
# specified name with _only_<in2>.sam/bam with records only in the in2 file
 
# specified name with _only_<in2>.sam/bam with records only in the in2 file
 +
 +
Records that are identical in the two files are not written in any of these output files.
    
When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags:
 
When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags:
 
* ZF - Flag
 
* ZF - Flag
* ZP - Pos
+
* ZP - Chromosome:1-based Position
 
* ZC - Cigar
 
* ZC - Cigar
 +
* ZM - Mapping Quality
 +
* ZN - Chromosome:1-based Mate Position
 +
* ZI - Insert Size
 
* ZS - Sequence
 
* ZS - Sequence
 
* ZQ - Base Quality
 
* ZQ - Base Quality
 
* ZT - Tags
 
* ZT - Tags
 +
 +
If <code>--onlyDiffs</code> is not specified, all fields that were compared will be printed in the tags.  If <code>--onlyDiffs</code> is specified, then only the differing compared fields will be printed in the tags.

Navigation menu