Difference between revisions of "BamGenotypeCheck"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 7: Line 7:
 
Here is an example of how laneCheck works:
 
Here is an example of how laneCheck works:
  
   lanecheck  --referencegenome NCBI36.fa --dbSNPfile dbSNP.txt \
+
   lanecheck  --referencegenome NCBI36.fa --dbSNPfile dbSNP.txt  
 
             --lanefile lane.lst --pedfile test.ped --datfile test.dat --mapfile test.map --prefix result
 
             --lanefile lane.lst --pedfile test.ped --datfile test.dat --mapfile test.map --prefix result
  

Revision as of 22:56, 23 November 2009

LaneCheck


Basic Usage Example

Here is an example of how laneCheck works:

  lanecheck  --referencegenome NCBI36.fa --dbSNPfile dbSNP.txt 
            --lanefile lane.lst --pedfile test.ped --datfile test.dat --mapfile test.map --prefix result

Command Line Options

Input Files

--referencegenome referencegenome file
--dbSNPfile       optional two-column dbsnp position file, will provide more accurate background mismatch rate if excluding dbSNP positions (e.g. 5 123456)
--lanefile        a list of lane file with path
--pedfile         genotype information of the samples for checking 
--datfile         a companion data file for pedigree file (each row: M snpname, e.g. M rs1234)
--mapfile         a companion data file for pedigree file (each row: chr snpname pos, e.g. 5 rs1234 56789)

Basic Output Options

--prefix specify the prefix name of the output file

Filtering 

--minmapquality   reads with with mapquality falling below this threshold will be excluded
--genocount       the maximum number of genotypes compared 
--verbose         print out detailed information for each hapmap position compared
--coverage        print out the proportion of markers in the map file covered by at least one read 
--countbysite     print out detailed mismatch counts for each base compared

Other Options

--memorymap use memory map technique for efficient memory sharing of reference genome file

Principle of operation:

The overall procedure is that the genotype identity checking program compares internal evidence from the sequence reads themselves to reference genotype information for a panel of candidate individuals. In the case of 1000 Genomes pilot data, these are HapMap genotypes from the same Coriell cell lines that are being sequenced. For each combination of [sequencing run x candidate individual] the program calculates the observed rate of mismatches at both "informative" and "background" locations and reports as "excess mismatch rate"

           excess rate  =  (informative rate  -  background rate).

"Informative" locations are those where the candidate individual is homozygous, according to the HapMap genotype information, and base calls are compared to the HapMap homozygous allele, rather than to the genome reference sequence. "Background" locations are all sites not known to be polymorphic and not recorded in dbSNP if provided.  A relative high background rate suggests possible problems in sample preparation or read mapping process. 


TODO

1. Separate the results by "Read group classifier".

The mapped .bam file may contains sequence data from different instrument runs. The read identifiers often are dot or colon-separated strings of the form 'run_name<sep>read_number'. The 'run_name' may be either an SRR / ERR identifier or the sequencing center's own alpha-numeric internal run identifier. Allow users to input extended regular expression such as '\(^[^.:]+\)[.:].*' hich matches just the part of each read identifier that is common to all reads from one instrument run and which differs between instrument runs.


2. Use model based approach to calculate probability of lane coming from the claimed individual in the index file given a pool of individuals.