Changes

From Genome Analysis Wiki
Jump to navigationJump to search
178 bytes removed ,  22:25, 23 November 2009
no edit summary
Line 5: Line 5:  
== Basic Usage Example ==
 
== Basic Usage Example ==
   −
Here is an example of how <code>glfTrio</code> works:
+
Here is an example of how laneCheck&nbsp;works:
   −
   lanecheck  --referencegenome /usr/cluster/share/karma/NCBI36.fa --dbSNPfile /home/1000G/data/GenomeSNP.dbsnp,  \
+
   lanecheck  --referencegenome NCBI36.fa --dbSNPfile dbSNP.txt \
            --lanefile lane.lst --pedfile test.ped --datfile test.dat --mapfile test.map --prefix result
+
            --lanefile lane.lst --pedfile test.ped --datfile test.dat --mapfile test.map --prefix result
 
   
 
   
   Line 15: Line 15:  
=== Input Files ===
 
=== Input Files ===
   −
  --referencegenome referencegenome file
+
  --referencegenome ''referencegenome file''
  --dbSNPfile dbsnp position file  
+
  --dbSNPfile       ''optional'' ''dbsnp position file, will provide more accurate background mismatch rate if excluding dbSNP positions''
  --lanefile a list of lane file with path
+
  --lanefile       ''a list of lane file with path''
  --pedfile genotype information of the samples for checking  
+
  --pedfile         ''genotype information of the samples for checking ''
  --datfile a companion data file for pedigree file (each row: M snpname, e.g. M rs1234)
+
  --datfile         ''a companion data file for pedigree file (each row: M snpname, e.g. M rs1234)''
  --mapfile a companion data file for pedigree file (each row: chr snpname pos, e.g. 5 rs1234 56789)
+
  --mapfile         ''a companion data file for pedigree file (each row: chr snpname pos, e.g. 5 rs1234 56789)''
 
   
 
   
    
=== Basic Output Options ===
 
=== Basic Output Options ===
   −
  --prefix specify the name of the output file
+
  --prefix ''specify the prefix name of the output file''
    
=== Filtering&nbsp; ===
 
=== Filtering&nbsp; ===
   −
  --minmapquality reads with with mapquality falling below this threshold will be excluded
+
  --minmapquality   ''reads with with mapquality falling below this threshold will be excluded''
  --genocount the maximum number of genotypes compared  
+
  --genocount       ''the maximum number of genotypes compared''
  --verbose generate detailed  
+
  --verbose         ''generate detailed''
 
  --coverage  
 
  --coverage  
  --countbysite generate'' detailed statistic for each base compared
+
  --countbysite     ''generate detailed statistic for each base compared''
 
   
 
   
    
=== Other Options ===
 
=== Other Options ===
   −
  --memorymap use memory map technique for efficient memory sharing
+
  --memorymap ''use memory map technique for efficient memory sharing of reference genome file
 +
''
   −
=== X Chromosome Variant Calling ===
+
== Principle of operation: ==
 
  −
  --xChr ''chromosomeName''          Label for the 'X' chromosome in the GLF file
  −
--xStart ''sexChromosomeStart''    Start of the non-pseudo-autosomal portion of the X (2,709,521 bp in build 36)
  −
--xStop ''sexChromosomeEnd''      End of the non-pseudo-autosomal portion of the X (154,584,237 bp in build 36)
  −
 
  −
Principle of operation:
      
The overall procedure is that the genotype identity checking program compares internal evidence from the sequence reads themselves to reference genotype information for a panel of candidate individuals. In the case of 1000 Genomes pilot data, these are HapMap genotypes from the same Coriell cell lines that are being sequenced. For each combination of [sequencing run x candidate individual] the program calculates the observed rate of mismatches at both "informative" and "background" locations and reports as "excess mismatch rate"
 
The overall procedure is that the genotype identity checking program compares internal evidence from the sequence reads themselves to reference genotype information for a panel of candidate individuals. In the case of 1000 Genomes pilot data, these are HapMap genotypes from the same Coriell cell lines that are being sequenced. For each combination of [sequencing run x candidate individual] the program calculates the observed rate of mismatches at both "informative" and "background" locations and reports as "excess mismatch rate"
Line 53: Line 48:     
"Informative" locations are those where the candidate individual is homozygous, according to the HapMap genotype information, and base calls are compared to the HapMap homozygous allele, rather than to the genome reference sequence. "Background" locations are all sites not known to be polymorphic and not recorded in dbSNP.
 
"Informative" locations are those where the candidate individual is homozygous, according to the HapMap genotype information, and base calls are compared to the HapMap homozygous allele, rather than to the genome reference sequence. "Background" locations are all sites not known to be polymorphic and not recorded in dbSNP.
  −
abc
      
<br>
 
<br>
   −
== '''TODO''' ==
+
== TODO ==
    
Frequently, we will want to run lane checking on a mapped .bam file which already contains sequence data from many different instrument runs merged together. They are merged because the sequencing center said they all belong to the same individual. In Pilot 3, this was true for all of the Baylor LS 454 sequencing data. In this case, the read identifier in column 1 of the .sam file carries information about which sequencing run each read belongs to, as well as information that uniquely identifies that read within its run. The read identifiers often are dot or colon-separated strings of the form 'run_name&lt;sep&gt;read_number'. The 'run_name' may be either an SRR / ERR identifier or the sequencing center's own alpha-numeric internal run identifier.
 
Frequently, we will want to run lane checking on a mapped .bam file which already contains sequence data from many different instrument runs merged together. They are merged because the sequencing center said they all belong to the same individual. In Pilot 3, this was true for all of the Baylor LS 454 sequencing data. In this case, the read identifier in column 1 of the .sam file carries information about which sequencing run each read belongs to, as well as information that uniquely identifies that read within its run. The read identifiers often are dot or colon-separated strings of the form 'run_name&lt;sep&gt;read_number'. The 'run_name' may be either an SRR / ERR identifier or the sequencing center's own alpha-numeric internal run identifier.
533

edits

Navigation menu