BamUtil: indelDiscordance

From Genome Analysis Wiki
Jump to: navigation, search


Overview of the indelDiscordance function of bamUtil

The indelDiscordance option on the bamUtil looks at insertion/deletion discordance.

By default it looks only at the non-pseudoautosomal region of the X-Chromosome.


ASSUMPTIONS/RESTRICTIONS

  • Only works on one chromosome at a time
  • Repeats are single-base only
    • An unrepeated base has repeat count = 0 ('A')
    • A base repeated once has repeat count = 1 ('AA')
  • Skips reference positions with base 'N'

What is a Discordance

A position is considered to have a Deletion Discordance if at least 1 read has a match/mismatch AND at least 1 read has a deletion AND there are at least the "minimum depth" reads at this position.

A position is considered to have an Insertion Discordance if at least 1 read has an insertion following this position AND at least 1 read does not AND there are at least the "minimum depth" reads at this position.

Error Rate Algorithm

The weighted average error rate, weighted average deletion error rate, and weighted average insertion error rate are calculated for each repeat count.

The error rate is weighted by the depth, so the discordant counts are


For each Repeat Count in the File
  For each Depth at this Repeat Count in the file
     if (depth > MaxAllowedDepth)
        skip calculating error rate for this depth
     else
        // Note: numDiscordant is not equivalent to numDeleteDiscordant + numInsertDiscordant since a position could have both types of discordants
        numDiscordant = number of discordant positions with this repeat count and depth
        numDeleteDiscordant = number of deletion discordant positions with this repeat count and depth
        numInsertDiscordant = number of insertion discordant positions with this repeat count and depth
        count = number of positions with this repeat count and depth

        errorRate = 1 - (\tfrac{numDiscordant}{count})^{({\tfrac{1}{depth}})}
        sumErrorRates += errorRate * count * (depth-1)
        numErrorRates += count * (depth-1)
  Repeat Count Weighted Error Rate = sumErrorRates/numErrorRates


Usage

./bam indelDiscordance --in <inputFile> [--bamIndex <bamIndexFile] [--refFile <filename>] [--umRef] [--depth minDepth] [--minRepeatLen len] [--sumRepeatLen len] [--printPos] [--chrom <name>] [--start 0basedPos] [--end 0basedPos] [--noeof] [--params]

Parameters

	Required Parameters:
		--in : the SAM/BAM file to calculate indelDiscordance for
	Optional Parameters:
		--bamIndex     : The path/name of the bam index file
		                 (if required and not specified, uses the --in value + ".bai")
		--refFile      : reference file for determining repeat counts
		--umRef        : use the reference at the default UofM location, 		                 /data/local/ref/karma.ref/human.g1k.v37.umfa
		--depth        : min depth at which to report indel discordance, DEFAULT >= 2
		--minRepeatLen : min repeat length for printing repeat info, DEFAULT = 1
		--sumRepeatLen : all repeats this length and longer will be accumulated,
		                 DEFAULT = 5
		--avgDepthMult : max depth used is the average depth * this multiplier,
		                 DEFAULT = 3
		--printPos     : print details for each position
		--printCounts  : print counts of occurrances of each repeat count and of discordant cigars for each repeat count
		--sample       : output the specified sample name as part of the error rate/depth table
		--gender       : output the specified gender as part of the error rate/depth table
		--chrom        : chromosome name other than X
		--start        : use a 0-based inclusive start position other than the default, 2699520
		--end          : use a 0-based exclusive end position other than the default, 154931043
		--noeof        : Do not expect an EOF block on a bam file.
		--params       : Print the parameter settings


Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in yourFileName
SAM from stdin --in -
BAM from stdin --in -.bam
Uncompressed BAM from stdin --in -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Bam Index File (--bamIndex)

Use --bamIndex followed by your file name to specify the BAM index file to use for reading the BAM file.

If this file is required but not specified, it will use the input file name + ".bai".

Reference File (--refFile)

Use --refFile followed by the reference file name to specify the reference sequence file.

Do not require BGZF EOF block (--noeof)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Print the Program Parameters (--params)

Use --params to print the parameters for your program to stderr.


Return Value

Output

All status messages are written to stderr.


Example Output