Changes

From Genome Analysis Wiki
Jump to navigationJump to search
17,218 bytes removed ,  14:35, 2 September 2011
Split into mulitple pages and add missing tools
Line 18: Line 18:  
(It will be available without libStatGen in case you already have a downloaded version of libStatGen that you want to use.
 
(It will be available without libStatGen in case you already have a downloaded version of libStatGen that you want to use.
   −
=== Using github ===
+
=== Releases ===
Releases are '''Coming Soon'''.
+
Release downloads are '''Coming Soon'''.
      Line 62: Line 62:     
The bam executable has the following functions.
 
The bam executable has the following functions.
* [[C++ Executable: bam#validate|validate - Read and Validate a SAM/BAM file]]
+
* [[BamUtil: validate|validate|validate - Read and Validate a SAM/BAM file]]
 
* [[BamUtil: convert|convert - Read a SAM/BAM file and write as a SAM/BAM file (optionally converts between '=' & bases in the sequence)]]
 
* [[BamUtil: convert|convert - Read a SAM/BAM file and write as a SAM/BAM file (optionally converts between '=' & bases in the sequence)]]
* [[C++ Executable: bam#dumpHeader|dumpHeader - Print SAM/BAM header]]
+
* [[BamUtil: dumpHeader|dumpHeader - Print SAM/BAM header]]
* [[C++ Executable: bam#splitChromosome|splitChromosome - Split BAM by Chromosome]]
+
* [[BamUtil: splitChromosome|splitChromosome - Split BAM by Chromosome]]
* [[C++ Executable: bam#writeRegion|writeRegion - Write the alignments in the indexed BAM file that fall into the specified region]]
+
* [[BamUtil: writeRegion|writeRegion - Write the alignments in the indexed BAM file that fall into the specified region]]
* [[C++ Executable: bam#dumpRefInfo|dumpRefInfo - Print SAM/BAM Reference Information]]
+
* [[BamUtil: dumpRefInfo|dumpRefInfo - Print SAM/BAM Reference Information]]
* [[C++ Executable: bam#dumpIndex|dumpIndex - Dump a BAM index file into an easy to read text version]]
+
* [[BamUtil: dumpIndex|dumpIndex - Dump a BAM index file into an easy to read text version]]
* [[C++ Executable: bam#readIndexedBam|readIndexedBam - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file]]
+
* [[BamUtil: readIndexedBam|readIndexedBam - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file]]
* [[C++ Executable: bam#filter|filter - Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high]]
+
* [[BamUtil: filter|filter - Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high]]
* [[C++ Executable: bam#readReference|readReference - Print the reference string for the specified region]]
+
* [[BamUtil: readReference|readReference - Print the reference string for the specified region]]
* [[C++ Executable: bam#diff|diff - Print the diffs between 2 bams]]
+
* [[BamUtil: diff|diff - Print the diffs between 2 bams]]
 +
* [[BamUtil: stats|stats - Print the diffs between 2 bams]]
 +
* [[BamUtil: revert|revert - Revert SAM/BAM replacing the specified fields with their previous values (if known).]]
 +
* [[BamUtil: squeeze|squeeze -  reduces files size by dropping OQ fields, duplicates, specified tags, using '=' when a base matches the reference, binning quality scores.]]
 +
* [[BamUtil: findCigars|findCigars - Output just the reads that contain any of the specified CIGAR operations.]]
    
This executable is built using [[C++ Library: libStatGen]].
 
This executable is built using [[C++ Library: libStatGen]].
    
Just running ./bam will print the Usage information for the bam executable.
 
Just running ./bam will print the Usage information for the bam executable.
  −
  −
== validate ==
  −
  −
The <code>validate</code> option on the bam executable reads and validates a SAM/BAM file.  This option is documented at: [[BamValidator]]
  −
  −
== dumpHeader ==
  −
The <code>dumpHeader</code> option on the bam executable prints the header of the specified SAM/BAM file to cout. 
  −
  −
=== Parameters ===
  −
<pre>
  −
    Required Parameters:
  −
filename : the sam/bam filename whose header should be printed.
  −
</pre>
  −
  −
=== Usage ===
  −
  −
./bam dumpHeader <inputFile>
  −
  −
=== Return Value ===
  −
*    0: the header was successfully read and printed.
  −
* non-0: the header was not successfully read or was not printed.  (Returns the SamStatus.)
  −
  −
  −
=== Example Output ===
  −
<pre>
  −
@SQ SN:1 LN:247249719
  −
@SQ SN:2 LN:242951149
  −
@SQ SN:3 LN:199501827
  −
</pre>
  −
  −
  −
== splitChromosome ==
  −
  −
The <code>splitChromosome</code> option on the bam executable splits an indexed BAM file into multiple files based on the Chromosome (Reference Name). 
  −
  −
The files all have the same base name, but with an _# where # corresponds with the associated reference id from the BAM file.
  −
  −
=== Parameters ===
  −
<pre>
  −
    Required Parameters:
  −
        --in      : the BAM file to be split
  −
        --out      : the base filename for the SAM/BAM files to write into.  Does not include the extension.
  −
                    _N will be appended to the basename where N indicates the Chromosome.
  −
    Optional Parameters:
  −
        --noeof  : do not expect an EOF block on a bam file.
  −
        --bamIndex : the path/name of the bam index file
  −
                    (if not specified, uses the --in value + ".bai")
  −
        --bamout : write the output files in BAM format (default).
  −
        --samout : write the output files in SAM format.
  −
        --params : print the parameter settings
  −
</pre>
  −
  −
=== Usage ===
  −
  −
./bam splitChromosome --in <inputFilename>  --out <outputFileBaseName> [--bamIndex <bamIndexFile>] [--noeof] [--bamout|--samout] [--params]
  −
  −
  −
=== Return Value ===
  −
*    0: all records are successfully read and written.
  −
* non-0: at least one record was not successfully read or written.
  −
  −
=== Example Output ===
  −
<pre>
  −
Reference ID -1 has 2 records
  −
Reference ID 0 has 5 records
  −
Reference ID 1 has 2 records
  −
Reference ID 2 has 1 records
  −
Reference ID 3 has 0 records
  −
Reference ID 4 has 0 records
  −
Reference ID 5 has 0 records
  −
Reference ID 6 has 0 records
  −
Reference ID 7 has 0 records
  −
Reference ID 8 has 0 records
  −
Reference ID 9 has 0 records
  −
Reference ID 10 has 0 records
  −
Reference ID 11 has 0 records
  −
Reference ID 12 has 0 records
  −
Reference ID 13 has 0 records
  −
Reference ID 14 has 0 records
  −
Reference ID 15 has 0 records
  −
Reference ID 16 has 0 records
  −
Reference ID 17 has 0 records
  −
Reference ID 18 has 0 records
  −
Reference ID 19 has 0 records
  −
Reference ID 20 has 0 records
  −
Reference ID 21 has 0 records
  −
Reference ID 22 has 0 records
  −
Number of records = 10
  −
Returning: 0 (SUCCESS)
  −
</pre>
  −
  −
  −
== writeRegion ==
  −
  −
The <code>writeRegion</code> option on the bam executable writes the alignments in the indexed BAM file that fall into the specified region (reference id and start/end position).
  −
  −
=== Parameters ===
  −
<pre>
  −
    Required Parameters:
  −
        --in      : the BAM file to be read
  −
        --out      : the SAM/BAM file to write to
  −
    Optional Parameters:
  −
        --noeof  : do not expect an EOF block on a bam file.
  −
        --bamIndex : the path/name of the bam index file
  −
                    (if not specified, uses the --in value + ".bai")
  −
        --refName  : the BAM reference Name to read (either this or refID can be specified)
  −
        --refID    : the BAM reference ID to read (defaults to -1: unmapped)
  −
        --start    : inclusive 0-based start position (defaults to -1)
  −
        --end      : exclusive 0-based end position (defaults to -1: meaning til the end of the reference)
  −
        --params  : print the parameter settings
  −
</pre>
  −
  −
=== Usage ===
  −
  −
./bam writeRegion --in <inputFilename>  --out <outputFilename> [--bamIndex <bamIndexFile>] [--noeof] [--refName <reference Name> | --refID <reference ID>] [--start <0-based start pos>] [--end <0-based end psoition>] [--params]
  −
  −
=== Return Value ===
  −
*    0: all records are successfully read and written.
  −
* non-0: at least one record was not successfully read or written.
  −
  −
=== Example Output ===
  −
<pre>
  −
  −
Wrote t.sam with 2 records.
  −
</pre>
  −
  −
  −
== dumpRefInfo ==
  −
The <code>dumpRefInfo</code> option on the bam executable prints the SAM/BAM file's reference information.
  −
  −
=== Parameters ===
  −
<pre>
  −
    Required Parameters:
  −
        --in              : the SAM/BAM file to be read
  −
    Optional Parameters:
  −
        --noeof            : do not expect an EOF block on a bam file.
  −
        --printRecordRefs  : print the reference information for the records in the file (grouped by reference).
  −
        --params          : print the parameter settings
  −
</pre>
  −
  −
=== Usage ===
  −
./bam dumpRefInfo --in <inputFilename> [--noeof] [--printRecordRefs] [--params]
  −
  −
=== Return Value ===
  −
*    0: the file was processed successfully.
  −
* non-0: the file was not processed successfully.
  −
  −
  −
== dumpIndex ==
  −
The <code>dumpIndex</code> option on the bam executable prints BAM index file in an easy to read format.
  −
  −
=== Parameters ===
  −
<pre>
  −
    Required Parameters:
  −
        --bamIndex : the path/name of the bam index file to display
  −
    Optional Parameters:
  −
        --refID    : the reference ID to read, defaults to print all
  −
        --summary  : only print a summary - 1 line per reference.
  −
        --params  : print the parameter settings
  −
</pre>
  −
  −
=== Usage ===
  −
./bam dumpIndex --bamIndex <bamIndexFile> [--refID <ref#>] [--summary] [--params]
  −
  −
=== Return Value ===
  −
*    0: the BAM index file was processed successfully.
  −
* non-0: the BAM index file was not processed successfully.
  −
  −
  −
== readIndexedBam ==
  −
The <code>readIndexedBam</code> option on the bam executable reads an indexed BAM file reference id by reference id -1 to the max reference id and writes it out as a SAM/BAM file.
  −
  −
=== Parameters ===
  −
<pre>
  −
Required Parameters:
  −
inputFilename      - path/name of the input BAM file
  −
outputFile.sam/bam - path/name of the output file
  −
bamIndexFile      - path/name of the BAM index file
  −
</pre>
  −
  −
=== Usage ===
  −
./bam readIndexedBam <inputFilename> <outputFile.sam/bam> <bamIndexFile>
  −
  −
=== Return Value ===
  −
* 0
  −
  −
== filter ==
  −
  −
The <code>filter</code> option on the bam executable filters the reads in a a SAM/BAM file.  This option is documented at: [[Bam Executable: Filter]]
  −
  −
== diff ==
  −
<span style="color:#D2691E">'''***Coming Soon***'''</span>
  −
  −
The <code>diff</code> option on the bam executable prints the difference between two coordinate sorted SAM/BAM files.  This can be used to compare the outputs of running a SAM/BAM through different tools/versions of tools.
  −
  −
The <code>diff</code> tool compares records that have the same Read Name and Fragment (from the flag).  If a matching ReadName & Fragment is not found, the record is considered to be different.
  −
  −
<code>diff</code> assumes the files are coordinate sorted and uses this assumption for determining how long to store a record before determining that the other file does not contain a matching ReadName/Fragment. If the files are not coordinate sorted, this logic does not work.
  −
  −
By default, just the chromosome/position and cigar are compared for each record.
  −
  −
Options are available to compare:
  −
* sequence
  −
* base quality
  −
* specified tags
  −
* turn off position comparison
  −
* turn off cigar comparison
  −
  −
=== Parameters ===
  −
<pre>
  −
Required Parameters:
  −
--in1        : first coordinate sorted SAM/BAM file to be diffed
  −
--in2        : second coordinate sorted SAM/BAM file to be diffed
  −
Optional Parameters:
  −
--out        : output filename, use .bam extension to output in SAM/BAM format instead of diff format.
  −
                In SAMBAM format there will be 3 output files:
  −
                    1) the specified name with record diffs
  −
                    2) specified name with _only_<in1>.sam/bam with records only in the in1 file
  −
                    3) specified name with _only_<in2>.sam/bam with records only in the in2 file
  −
--seq        : diff the sequence bases.
  −
--baseQual    : diff the base qualities.
  −
--tags        : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type...
  −
--noCigar    : do not diff the the cigars.
  −
--noPos      : do not diff the positions.
  −
--onlyDiffs  : only print the fields that are different, otherwise for any diff all the fields that are compared are printed.
  −
--recPoolSize : number of records to allow to be stored at a time, default value: 1000000
  −
--posDiff    : max base pair difference between possibly matching records100000
  −
--noeof      : do not expect an EOF block on a bam file.
  −
--params      : print the parameter settings
  −
</pre>
  −
  −
=== Usage ===
  −
./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]
  −
  −
=== Return Value ===
  −
* 0: all records are successfully read and written.
  −
* non-0: an error occurred processing the parameters or reading one of the files.e
  −
  −
=== Output Format ===
  −
2 Output Formats:
  −
# Diff Format
  −
# BAM Format
  −
  −
==== Diff Format ====
  −
There are 2 types of differences.
  −
* ReadName/Fragment combo is in one file, but not in the other file within the window set by recPoolSize & posDiff
  −
* ReadName/Fragment combo is in both files, but at least one of the specified fields to diff is different
  −
  −
Each difference output consists of 2 or 3 lines.  If the record only appears in one of the files, the diff is 2 lines, if it appears in both files, the diff is 3 lines.
  −
  −
The first line of the difference output is just the read name.
  −
  −
The 2nd and 3rd line (if present) begin with either a '<' or a '>'.  If the record is from the first file (--in1), it begins with a '<'.  If the record is from the 2nd file (--in2), it begins with a '>'.
  −
  −
The 2nd line is the flag followed by the diff'd fields from one of the records.
  −
  −
The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.
  −
  −
  −
The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified:
  −
* '<' or '>'
  −
* flag
  −
* chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
  −
* cigar - if --noCigar is not specified
  −
* sequence - if --seq is specified
  −
* base quality - if --baseQual is specified
  −
* tag:type:value - for each tag:type specified in --tags
  −
* ...
  −
* tag:type:value
  −
  −
If <code>onlyDiffs</code> is specified, only the fields that are specified and are different get printed in lines 2 & 3.
  −
  −
===== Example Output =====
  −
Command:
  −
../bin/bam diff --in1 testFiles/testDiff1.sam --in2 testFiles/testDiff2.sam --seq --baseQual --tags "OP:i;MD:Z" --onlyDiffs --out results/diffOrderSam.log
  −
  −
Output:
  −
<pre>
  −
18:462+29M5I3M:F:295
  −
< a1 1:78
  −
> a1 1:74
  −
1
  −
> a1 1:70 3S1M1S ACGTN ;46>> OP:i:75 MD:Z:30A0C5
  −
2
  −
> a1 1:72 3S1M1S ACGTN ;47>> OP:i:75 MD:Z:30A0C5
  −
ABC
  −
> cd *:0 * * *
  −
DEF
  −
> cd *:0 * * *
  −
</pre>
  −
  −
==== SAM/Bam Format ====
  −
use .sam/.bam extension to output in SAM/BAM format instead of diff format.
  −
  −
In SAM/BAM format there will be 3 output files:
  −
# the specified name with record diffs
  −
# specified name with _only_<in1>.sam/bam with records only in the in1 file
  −
# specified name with _only_<in2>.sam/bam with records only in the in2 file
  −
  −
When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags:
  −
* ZF - Flag
  −
* ZP - Pos
  −
* ZC - Cigar
  −
* ZS - Sequence
  −
* ZQ - Base Quality
  −
* ZT - Tags
  −
  −
== readReference ==
  −
The <code>readReference</code> option on the bam executable prints the specified region of the reference sequence in an easy to read format.
  −
  −
=== Parameters ===
  −
<pre>
  −
    Required Parameters:
  −
        --refFile  : the reference
  −
        --refName  : the SAM/BAM reference Name to read
  −
        --start    : inclusive 0-based start position (defaults to -1)
  −
    Required Length Parameter (one but not both needs to be specified):
  −
        --end      : exclusive 0-based end position (defaults to -1: meaning til the end of the reference)
  −
        --numBases : number of bases from start to display
  −
        --params  : print the parameter settings
  −
</pre>
  −
  −
=== Usage ===
  −
./bam readReference --refFile <referenceFilename> --refName <reference Name> --start <0 based start> --end <0 based end>|--numBases <number of bases> [--params]
  −
  −
=== Return Value ===
  −
*    0: the reference file was successfully read.
  −
* non-0: the reference file was not successfully read.
  −
  −
=== Example Output ===
  −
<pre>
  −
  −
</pre>
  −
  −
== stats ==
  −
The <code>stats</code> option on the bam executable generates the specified statistics on a SAM/BAM file.
  −
  −
=== Parameters ===
  −
<pre>
  −
Required Parameters:
  −
--in : the SAM/BAM file to calculate stats for
  −
Types of Statistics that can be generated:
  −
--basic      : Turn on basic statistic generation
  −
--qual        : Generate a count for each quality (displayed as non-phred quality)
  −
--phred      : Generate a count for each quality (displayed as phred quality)
  −
--baseQC      : Write per base statistics to the specified file.
  −
Optional Parameters:
  −
--maxNumReads : Maximum number of reads to process
  −
                Defaults to -1 to indicate all reads.
  −
--unmapped    : Only process unmapped reads (requires a bamIndex file)
  −
--bamIndex    : The path/name of the bam index file
  −
                (if required and not specified, uses the --in value + ".bai")
  −
--regionList  : File containing the region list chr<tab>start_pos<tab>end<pos>.
  −
                Positions are 0 based and the end_pos is not included in the region.
  −
                Uses bamIndex.
  −
--minMapQual  : The minimum mapping quality for filtering reads in the baseQC stats.
  −
--dbsnp      : The dbSnp file of positions to exclude from baseQC analysis.
  −
--noeof      : Do not expect an EOF block on a bam file.
  −
--params      : Print the parameter settings
  −
</pre>
  −
  −
For all types of statistics, the bam file used is specified by <code>--in</code>.
  −
  −
The optional parameters are also used for all types of statistics.
  −
  −
Usage:
  −
<pre>
  −
./bam stats --in <inputFile> [--basic] [--qual] [--phred] [--baseQC <outputFileName>] [--maxNumReads <maxNum>] [--unmapped] [--bamIndex <bamIndexFile>] [--regionList <regFileName>] [--minMapQual <minMapQ>] [--dbsnp <dbsnpFile>] [--noeof] [--params]
  −
</pre>
  −
  −
  −
  −
=== Types of Statistics ===
  −
  −
==== Basic ====
  −
Prints summary statistics for the file:
  −
*TotalReads - # of reads that are in the file
  −
*MappedReads - # of reads marked mapped in the flag
  −
*PairedReads - # of reads marked paired in the flag
  −
*ProperPair - # of reads marked paired AND proper paired in the flag
  −
*DuplicateReads - # of reads marked duplicate in the flag
  −
*QCFailureReads - # of reads marked QC failure in the flag
  −
*MappingRate(%) - # of reads marked mapped in the flag / TotalReads
  −
*PairedReads(%) - # of reads marked paired in the flag / TotalReads
  −
*ProperPair(%) - # of reads marked paired AND proper paired in the flag / TotalReads
  −
*DupRate(%) - # of reads marked duplicate in the flag / TotalReads
  −
*QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
  −
*TotalBases - # of bases in all reads
  −
*BasesInMappedReads - # of bases in reads marked mapped in the flag
  −
  −
  −
  −
==== Qual/Phred ====
  −
Prints a count of the number of times each quality value appears in the file.
  −
*<code>phred</code> Displays Quality as phred integers [0-93]
  −
*<code>qual</code>  Displays Quality as non-phred integers (phred + 33) [33-126]
  −
  −
  −
==== BaseQC ====
  −
'''This capability is coming soon, so these notes may be updated prior to it being completed...'''
  −
  −
Do we print stats for positions where the reference base is 'N'??  (any special note for those?  Qplot would not count them in the depth.)
  −
  −
The <code>baseQC</code> option generates the following statistics:
  −
  −
For each position, the following counts are incremented if:
  −
# a read spans the reference position (starts before or at this reference position and ends at or after this position)
  −
# regardless of duplicate/qc failure/unmapped/mapping quality
  −
# regardless of the CIGAR for this position (other than clips at the beginning/end which are not counted, but deletions and skips are counted)
  −
*TotalReads(e6) - # of reads that span this position.
  −
*DupRate(%) - # of reads marked duplicate in the flag / TotalReads
  −
*QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
  −
*PairedReads(%) - # of reads marked paired in the flag / TotalReads
  −
*ProperPaired(%) - # of reads marked paired AND proper paired in the flag / TotalReads
  −
*MappedBases(e9) - # of reads marked mapped in the flag
  −
*MappingRate(%) - # of reads marked mapped in the flag / TotalReads
  −
*ZeroMapQual(%) - # of reads marked mapped in the flag AND have a Mapping Quality of 0 / TotalReads
  −
*MapQual<10(%) - # of reads marked mapped in the flag AND have a Mapping Quality < 10 / TotalReads
  −
*MapRate_MQpass(%) - # of reads marked mapped in the flag AND have a Mapping Quality >= a minimum Mapping Quality / TotalReads
  −
  −
  −
For each position, the following counts are incremented if:
  −
# a read spans the reference position (starts before or at this reference position and ends at or after this position)
  −
# the read is NOT a duplicate, qc failure, unmapped, or mapped with a mapping quality less than the min
  −
# the CIGAR for this position is a M/=/X (match/mismatch)
  −
TBD - should it count if the read has a base of 'N'
  −
*Depth - # of reads. 
  −
*Q20Bases(e9) - # of bases at this position with a base quality (from the read) of Q20 or higher.
  −
*Q20BasesPct(%) - Q20Bases / Depth
 

Navigation menu