Difference between revisions of "BamGenotypeCheck"

From Genome Analysis Wiki
Jump to navigationJump to search
m (moved BamIdentityCheck to GenotypeIDcheck: now matches the new program name)
(total rewrite - new program)
Line 1: Line 1:
 +
== Why genotypeIDcheck? ==
 +
 +
When sequencing data arrives from the sequencer machines, several types of errors or contaminations can creep in.
 +
 +
The most serious error that can occur is that the actual sample name for the sample gets swapped with a different sample.  In this case, what we think is one person's sequencing data is actually another.  Although they are confidential, in some cases, they are intended to be part of a family, and such an error causes problems in later analysis.
 +
 +
Also, since we typically genotype the same samples along with running a full sequence of them, we want to know that the genotyping information matches the sequence, otherwise again, we will find errors in later analysis.
 +
 +
Another set of problems occurs when the actual biological sample is contaminated in some way.  It could be the case, for example, that certain common laboratory contaminates will be included in the sequencing data - knowing approximately how much of such contaminants exist (e.g. yeast) is useful.
 +
 +
So, the program genotypeIDcheck was written in order to discover if problems such as these exist, and if so, report them to the user.
 +
 +
To work, genotypeIDcheck needs a) a list of sorted, calibrated BAM files to check, b) a genome reference to check against, c) a pedigree that describes the samples being checked.
 +
 +
On output, genotypeIDcheck shows, for each read group in each BAM file, the individual in the pedigree that comes statistically closest to matching the reads in the read group, and the degree to how close that relationship is.  In addition, marker coverage of three different classes of markers is shown for that individual as well.
 +
 
== Basic Usage Example ==
 
== Basic Usage Example ==
  
Here is an example of how laneCheck works:
+
Here is an example of how genotypeIDcheck works:
  
   lanecheck --referencegenome NCBI36.fa --dbSNPfile dbSNP.txt
+
   genotypeIDcheck -r /data/local/ref/karma.ref/human.g1k.v37.fa \
               --lanefile lane.lst --pedfile test.ped --datfile test.dat --mapfile test.map  
+
               -k BAMfiles.txt -p test.ped -d test.dat -m test.map
              --prefix result
 
  
 
== Command Line Options ==
 
== Command Line Options ==
Line 11: Line 26:
 
=== Input Files ===
 
=== Input Files ===
  
  --referencegenome ''referencegenome file''
+
  -''KARMA genome reference''
  --dbSNPfile      ''optional'' ''two-column'' ''dbsnp position file, will provide more accurate background mismatch rate if excluding dbSNP positions (e.g. 5 123456)''
+
  -''a filename that contains a list of BAM files to check''
  --lanefile        ''a list of lane file with path''
+
-''pedigree Allele Frequency file''
  --pedfile        ''genotype information of the samples for checking ''
+
  -''pedigree .ped file''
  --datfile        ''a companion data file for pedigree file (each row: M snpname, e.g. M rs1234)''
+
  -d ''pedigree .dat file''
  --mapfile        ''a companion data file for pedigree file (each row: chr snpname pos, e.g. 5 rs1234 56789)''
+
  -''pedigree .map file''
  
=== Basic Output Options ===
+
=== Output Options ===
  
  --prefix ''specify the prefix name of the output file''
+
  -c [int]  ''stop after reading [int] filtered sequence reads''
 +
-C [int]  ''stop after reading [int] reads, filtered or not''
 +
-''verbose output''
  
=== Filtering  ===
+
=== Filtering ===
  
  --minmapquality  ''reads with with mapquality falling below this threshold will be excluded''
+
  -b [int]  ''exclude marker positions with base quality less than [int]''
  --genocount      ''the maximum number of genotypes compared''  
+
  -M [int]  ''exclude all reads with map quality less than [int]''
  --verbose        ''print out detailed information for each hapmap position compared''
+
  -F [int] ''set custom BAM flags filter (not implemented at the moment)''
  --coverage        ''print out the proportion of markers in the map file covered by at least one read''
 
--countbysite    ''print out detailed mismatch counts for each base compared''
 
  
 
=== Other Options ===
 
=== Other Options ===
  
  --memorymap ''use memory map technique for efficient memory sharing of reference genome file''
+
  -e [float] '' set minimum error estimate to [float]''
 
  
== Principle of Operation: ==
 
  
The overall procedure is that the genotype identity checking program compares internal evidence from the sequence reads themselves to reference genotype information for a panel of candidate individuals. In the case of 1000 Genomes pilot data, these are HapMap genotypes from the same Coriell cell lines that are being sequenced. For each combination of [sequencing run x candidate individual] the program calculates the observed rate of mismatches at both "informative" and "background" locations and reports as "excess mismatch rate"
+
== Principle of Operation ==
  
            excess rate  =  (informative rate  -  background rate).
+
For computational and output purposes, we consider each read group sample in each BAM file to be distinct from all others.
  
"Informative" locations are those where the candidate individual is homozygous, according to the HapMap genotype information, and base calls are compared to the HapMap homozygous allele, rather than to the genome reference sequence. "Background" locations are all sites not known to be polymorphic and not recorded in dbSNP if provided.  A relative high background rate suggests possible problems in sample preparation or read mapping process. 
+
For each aligned sample, we calculate the probability that the sample is from each individual in the pedigree according to five different probabilities of being identical by descent.  So from the base quality (again, assuming calibrated base qualities), the given base in the read, the marker corresponding reference base and genotype chip read data, you can compute the probability of that individual being the one in the sample for the given probability of IBD.
  
<br>
+
What you actually want is the multiple of theses probabilities taken across all reads in the sample.  This becomes impractically small, so we instead sum the logs of the probabilities.
  
== TODO ==
+
After the sample is read, the sample in the pedigree that contains the highest log-sum of computed Pibd values is assumed to have the strongest relationship to that sample.  The ID of the corresponding pedigree sample name is printed, the sample name indicated in the read group is printed, and coverage information is also printed.
  
1. Separate the results by "Read group classifier".
+
For more about the mathematical details, see the page [[Verifying Sample Identities - Implementation]]
  
The mapped .bam file may contains sequence data from different instrument runs.&nbsp;The read identifiers often are dot or colon-separated strings of the form 'run_name&lt;sep&gt;read_number'. The 'run_name' may be either an SRR / ERR identifier or the sequencing center's own alpha-numeric internal run identifier. Allow users to input&nbsp;extended regular expression such as '\(^[^.:]+\)[.:].*'&nbsp;hich matches just the part of each read identifier that is common to all reads from one instrument run and which differs between instrument runs.
+
== TODO ==
 
 
2. Use model based approach to calculate probability of lane coming from the claimed individual in the index file given a pool of individuals. &nbsp;
 

Revision as of 11:06, 21 June 2010

Why genotypeIDcheck?

When sequencing data arrives from the sequencer machines, several types of errors or contaminations can creep in.

The most serious error that can occur is that the actual sample name for the sample gets swapped with a different sample. In this case, what we think is one person's sequencing data is actually another. Although they are confidential, in some cases, they are intended to be part of a family, and such an error causes problems in later analysis.

Also, since we typically genotype the same samples along with running a full sequence of them, we want to know that the genotyping information matches the sequence, otherwise again, we will find errors in later analysis.

Another set of problems occurs when the actual biological sample is contaminated in some way. It could be the case, for example, that certain common laboratory contaminates will be included in the sequencing data - knowing approximately how much of such contaminants exist (e.g. yeast) is useful.

So, the program genotypeIDcheck was written in order to discover if problems such as these exist, and if so, report them to the user.

To work, genotypeIDcheck needs a) a list of sorted, calibrated BAM files to check, b) a genome reference to check against, c) a pedigree that describes the samples being checked.

On output, genotypeIDcheck shows, for each read group in each BAM file, the individual in the pedigree that comes statistically closest to matching the reads in the read group, and the degree to how close that relationship is. In addition, marker coverage of three different classes of markers is shown for that individual as well.

Basic Usage Example

Here is an example of how genotypeIDcheck works:

  genotypeIDcheck  -r /data/local/ref/karma.ref/human.g1k.v37.fa \
             -k BAMfiles.txt -p test.ped -d test.dat -m test.map

Command Line Options

Input Files

-r  KARMA genome reference
-k  a filename that contains a list of BAM files to check
-a  pedigree Allele Frequency file
-p  pedigree .ped file
-d  pedigree .dat file
-m  pedigree .map file

Output Options

-c [int]  stop after reading [int] filtered sequence reads
-C [int]  stop after reading [int] reads, filtered or not
-v  verbose output

Filtering

-b [int]  exclude marker positions with base quality less than [int]
-M [int]  exclude all reads with map quality less than [int]
-F [int]  set custom BAM flags filter (not implemented at the moment)

Other Options

-e [float]  set minimum error estimate to [float]


Principle of Operation

For computational and output purposes, we consider each read group sample in each BAM file to be distinct from all others.

For each aligned sample, we calculate the probability that the sample is from each individual in the pedigree according to five different probabilities of being identical by descent. So from the base quality (again, assuming calibrated base qualities), the given base in the read, the marker corresponding reference base and genotype chip read data, you can compute the probability of that individual being the one in the sample for the given probability of IBD.

What you actually want is the multiple of theses probabilities taken across all reads in the sample. This becomes impractically small, so we instead sum the logs of the probabilities.

After the sample is read, the sample in the pedigree that contains the highest log-sum of computed Pibd values is assumed to have the strongest relationship to that sample. The ID of the corresponding pedigree sample name is printed, the sample name indicated in the read group is printed, and coverage information is also printed.

For more about the mathematical details, see the page Verifying Sample Identities - Implementation

TODO