Changes

BamGenotypeCheck (view source)

Revision as of 04:30, 22 June 2010

1,288 bytes removed , 04:30, 22 June 2010

no edit summary

Line 1: Line 1: −

~~== Why genotypeIDcheck? ==~~

+

'''genotypeIdCheck''' is a program that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals).

−

~~When sequencing data arrives from the sequencer machines, several types of errors or contaminations can creep in.~~

+

== Usage ==

−

~~The most serious error that can occur~~ is ~~that the actual sample name for the sample gets swapped with a different sample. In this case, what we think is one person's sequencing~~ data ~~is actually another~~. ~~Although they are confidential,~~ in ~~some cases, they are intended to be part of~~ a ~~family, and such an error causes problems in later analysis~~.

+

A key step in any genetic analysis is to verify whether data being generated matches expectations. This program checks whether reads in a BAM file match previous genotypes for a specific sample.

−

~~Also, since we typically genotype the same samples along with running~~ a ~~full~~ sequence ~~of them~~, ~~we want~~ to ~~know that the genotyping information matches the~~ sequence~~, otherwise again, we will find errors in later analysis.~~

+

Using a mathematical model that relates observed sequence reads to an hypothetical true genotype, genotypeIdCheck tries to decide whether sequence reads match a particular individual or are more likely to be contaminated (including a small proportion of foreign DNA), derived from a closely related individual, or derived from a completely different individual.

−

~~Another set of problems occurs when the actual biological sample is~~ contaminated ~~in some way. It could be the case, for example, that certain common laboratory contaminates will be included in the sequencing data - knowing approximately how much of such contaminants exist~~ (~~e.g. yeast) is useful.~~

−

~~So, the program genotypeIDcheck was written in order to discover if problems such as these exist, and if so, report them to the user.~~

−

~~To work, genotypeIDcheck needs~~ a~~) a list~~ of ~~sorted, calibrated BAM files to check, b~~) ~~a genome reference to check against~~, c) a ~~pedigree that describes the samples being checked.~~

−

~~On output, genotypeIDcheck shows, for each read group in each BAM file, the~~ individual ~~in the pedigree that comes statistically closest to matching the reads in the read group~~, ~~and the degree to how close that relationship is. In addition, marker coverage of three~~ different ~~classes of markers is shown for that~~ individual ~~as well~~.

== Basic Usage Example ==

−

Here is ~~an example of how genotypeIDcheck works~~:

+

Here is a typical command line:

genotypeIDcheck -r /data/local/ref/karma.ref/human.g1k.v37.fa \

Line 26: Line 18:

=== Input Files ===

−

-r ''~~KARMA~~ genome reference''

+

-r ''FASTA format genome reference''

−

-k ''~~a filename that contains a list of BAM files to check~~''

+

-a ''allele Frequency file''

−

-a ''pedigree ~~Allele Frequency~~ file''

+

-p ''pedigree file in [[MERLIN format]]''

−

-p ''~~pedigree .ped~~ file''

+

-d ''data file in [[MERLIN format]]''

−

-d ''~~pedigree .dat~~ file''

+

-m ''map file in [[MERLIN format]]''

−

-m ''~~pedigree .map file~~''

+

-k ''a list of BAM files to check''

+

-c [int] ''stop after reading [int] filtered sequence reads''

+

-C [int] ''stop after reading [int] reads, filtered or not''

=== Output Options ===

−

~~-c [int] ''stop after reading [int] filtered sequence reads''~~

−

~~-C [int] ''stop after reading [int] reads, filtered or not''~~

-v ''verbose output''

=== Filtering ===

−

-b [int] ''exclude ~~marker positions~~ with ~~base~~ quality less than [int]''

+

-b [int] ''exclude bases with quality less than [int]''

−

-M [int] ''exclude ~~all~~ reads with map quality less than [int]''

+

-M [int] ''exclude reads with map quality less than [int]''

-F [int] ''set custom BAM flags filter (not implemented at the moment)''

=== Other Options ===

−

-e [float] '' set minimum error ~~estimate~~ to [float]''

+

-e [float] '' set minimum error base error to [float]''

−

== Principle of Operation ==

−

~~For computational and output purposes, we consider each~~ read group ~~sample~~ in ~~each~~ BAM file ~~to be distinct from all others~~.

+

Each read group in a BAM file is evaluated independently. This means that in file with multiple read groups, problems will be flagged at the read group level (a plus). However, it also means that it might be hard to discern the correct assignment of read groups with very little data.

−

~~For each aligned sample, we calculate the probability~~ that ~~the sample is from each individual~~ in ~~the pedigree according to five different probabilities of being identical by descent. So from the base quality (again, assuming calibrated base qualities), the given base in the~~ read, the ~~marker corresponding reference base and genotype chip~~ read ~~data~~, ~~you can compute the probability of~~ that ~~individual being~~ the ~~one in the sample for the given probability~~ of ~~IBD~~.

−

~~What you actually want is~~ the ~~multiple of theses probabilities taken across all reads in~~ the ~~sample~~. This ~~becomes impractically small, so we instead sum the logs of~~ the ~~probabilities~~.

+

For each aligned base that overlaps a known genotype, we calculate the probability the probability that it was derived from a particular known genotype. This comparison considers only bases that overlap previously known genotypes and that meet the base quality and mapping quality thresholds.

−

~~After the sample is read, the sample~~ in ~~the~~ pedigree ~~that contains the highest log-sum of computed Pibd values is assumed to have the strongest relationship to that sample. The ID~~ of ~~the corresponding pedigree sample name is printed~~, the ~~sample name indicated in~~ the read ~~group is printed, and coverage information is also printed~~.

+

Each individual in a pedigree has a different combination of genotypes, and genotypeIdCheck will systematically search for the individual whose genotypes best match the observed read data.

−

For more about the ~~mathematical~~ details, see the page [[Verifying Sample Identities - Implementation]]

+

For more about the technical details, see the page [[Verifying Sample Identities - Implementation]]

== TODO ==

Goncalo

Bureaucrats, Administrators

1,555

edits

Changes

BamGenotypeCheck (view source)

Revision as of 04:30, 22 June 2010

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools