Difference between revisions of "BamGenotypeCheck"
(total rewrite - new program) |
|||
Line 1: | Line 1: | ||
− | + | '''genotypeIdCheck''' is a program that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals). | |
− | + | == Usage == | |
− | + | A key step in any genetic analysis is to verify whether data being generated matches expectations. This program checks whether reads in a BAM file match previous genotypes for a specific sample. | |
− | + | Using a mathematical model that relates observed sequence reads to an hypothetical true genotype, genotypeIdCheck tries to decide whether sequence reads match a particular individual or are more likely to be contaminated (including a small proportion of foreign DNA), derived from a closely related individual, or derived from a completely different individual. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Basic Usage Example == | == Basic Usage Example == | ||
− | Here is | + | Here is a typical command line: |
genotypeIDcheck -r /data/local/ref/karma.ref/human.g1k.v37.fa \ | genotypeIDcheck -r /data/local/ref/karma.ref/human.g1k.v37.fa \ | ||
Line 26: | Line 18: | ||
=== Input Files === | === Input Files === | ||
− | -r '' | + | -r ''FASTA format genome reference'' |
− | - | + | -a ''allele Frequency file'' |
− | - | + | -p ''pedigree file in [[MERLIN format]]'' |
− | - | + | -d ''data file in [[MERLIN format]]'' |
− | - | + | -m ''map file in [[MERLIN format]]'' |
− | - | + | |
+ | -k ''a list of BAM files to check'' | ||
+ | -c [int] ''stop after reading [int] filtered sequence reads'' | ||
+ | -C [int] ''stop after reading [int] reads, filtered or not'' | ||
=== Output Options === | === Output Options === | ||
− | |||
− | |||
-v ''verbose output'' | -v ''verbose output'' | ||
=== Filtering === | === Filtering === | ||
− | -b [int] ''exclude | + | -b [int] ''exclude bases with quality less than [int]'' |
− | -M [int] ''exclude | + | -M [int] ''exclude reads with map quality less than [int]'' |
-F [int] ''set custom BAM flags filter (not implemented at the moment)'' | -F [int] ''set custom BAM flags filter (not implemented at the moment)'' | ||
=== Other Options === | === Other Options === | ||
− | -e [float] '' set minimum error | + | -e [float] '' set minimum error base error to [float]'' |
− | |||
== Principle of Operation == | == Principle of Operation == | ||
− | + | Each read group in a BAM file is evaluated independently. This means that in file with multiple read groups, problems will be flagged at the read group level (a plus). However, it also means that it might be hard to discern the correct assignment of read groups with very little data. | |
− | |||
− | |||
− | + | For each aligned base that overlaps a known genotype, we calculate the probability the probability that it was derived from a particular known genotype. This comparison considers only bases that overlap previously known genotypes and that meet the base quality and mapping quality thresholds. | |
− | + | Each individual in a pedigree has a different combination of genotypes, and genotypeIdCheck will systematically search for the individual whose genotypes best match the observed read data. | |
− | For more about the | + | For more about the technical details, see the page [[Verifying Sample Identities - Implementation]] |
== TODO == | == TODO == |
Revision as of 04:30, 22 June 2010
genotypeIdCheck is a program that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals).
Usage
A key step in any genetic analysis is to verify whether data being generated matches expectations. This program checks whether reads in a BAM file match previous genotypes for a specific sample.
Using a mathematical model that relates observed sequence reads to an hypothetical true genotype, genotypeIdCheck tries to decide whether sequence reads match a particular individual or are more likely to be contaminated (including a small proportion of foreign DNA), derived from a closely related individual, or derived from a completely different individual.
Basic Usage Example
Here is a typical command line:
genotypeIDcheck -r /data/local/ref/karma.ref/human.g1k.v37.fa \ -k BAMfiles.txt -p test.ped -d test.dat -m test.map
Command Line Options
Input Files
-r FASTA format genome reference -a allele Frequency file -p pedigree file in MERLIN format -d data file in MERLIN format -m map file in MERLIN format
-k a list of BAM files to check -c [int] stop after reading [int] filtered sequence reads -C [int] stop after reading [int] reads, filtered or not
Output Options
-v verbose output
Filtering
-b [int] exclude bases with quality less than [int] -M [int] exclude reads with map quality less than [int] -F [int] set custom BAM flags filter (not implemented at the moment)
Other Options
-e [float] set minimum error base error to [float]
Principle of Operation
Each read group in a BAM file is evaluated independently. This means that in file with multiple read groups, problems will be flagged at the read group level (a plus). However, it also means that it might be hard to discern the correct assignment of read groups with very little data.
For each aligned base that overlaps a known genotype, we calculate the probability the probability that it was derived from a particular known genotype. This comparison considers only bases that overlap previously known genotypes and that meet the base quality and mapping quality thresholds.
Each individual in a pedigree has a different combination of genotypes, and genotypeIdCheck will systematically search for the individual whose genotypes best match the observed read data.
For more about the technical details, see the page Verifying Sample Identities - Implementation