Difference between revisions of "BamGenotypeCheck"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 7: Line 7:
 
Here is an example of how <code>glfTrio</code> works:
 
Here is an example of how <code>glfTrio</code> works:
  
   lanecheck -f NA19239.chrom20.SLX.glf -m NA19238.chrom20.SLX.glf -c NA19240.chrom20.SLX.glf \
+
   lanecheck     Input Files : --referencegenome [/usr/cluster/share/karma/NCBI36.fa], \
        --father NA19239 --mother NA19238 --child NA19240 \
+
                  --dbSNPfile [/home/1000G/data/GenomeSNP.dbsnp],  \
        --minMapQuality 30 --minTotalDepth 0 --maxTotalDepth 1000 \
+
                  --lanefile [], --pedfile [], --datfile [], --mapfile [], \
        -b YRI.chrom20.SLX.vcf &gt; YRI.chrom20.SLX.log
+
                  --mapquality [-1], --genocount [1000000], --nonrefhomo, \
 +
                  --verbose, --coverage, --countbysite, --memorymap \
 +
  Output Files : --prefix []
 +
  
 
== Command Line Options ==
 
== Command Line Options ==
Line 16: Line 19:
 
=== Input Files ===
 
=== Input Files ===
  
  -f ''genotype likelihood file''    Father's [[GLF|GLF]]-format genotype likelihood file
+
-f ''genotype likelihood file''    Father's [[GLF|GLF]]-format genotype likelihood file
 
  -m ''genotype likelihood file''    Mother's [[GLF|GLF]]-format genotype likelihood file
 
  -m ''genotype likelihood file''    Mother's [[GLF|GLF]]-format genotype likelihood file
 
  -c ''genotype likelihood file''    Child's [[GLF|GLF]]-format genotype likelihood file
 
  -c ''genotype likelihood file''    Child's [[GLF|GLF]]-format genotype likelihood file
Line 31: Line 34:
 
   --minMapQuality ''threshold''      Positions where the root-means squared mapping quality falls below this threshold will be excluded.
 
   --minMapQuality ''threshold''      Positions where the root-means squared mapping quality falls below this threshold will be excluded.
 
  --strict                      When the map quality is interpreted ''strictly'', all three trio individuals must exceed ''minMapQuality''  
 
  --strict                      When the map quality is interpreted ''strictly'', all three trio individuals must exceed ''minMapQuality''  
                              before a call is made. Without the --strict option, reads for individuals below the threshold are ignored.
+
                            before a call is made. Without the --strict option, reads for individuals below the threshold are ignored.
  
 
   --minTotalDepth ''threshold''          Positions where the read depth falls below this threshold will be excluded.
 
   --minTotalDepth ''threshold''          Positions where the read depth falls below this threshold will be excluded.
Line 61: Line 64:
  
 
== '''TODO''' ==
 
== '''TODO''' ==
 +
 
Frequently, we will want to run lane checking on a mapped .bam file which already contains sequence data from many different instrument runs merged together. They are merged because the sequencing center said they all belong to the same individual. In Pilot 3, this was true for all of the Baylor LS 454 sequencing data. In this case, the read identifier in column 1 of the .sam file carries information about which sequencing run each read belongs to, as well as information that uniquely identifies that read within its run. The read identifiers often are dot or colon-separated strings of the form 'run_name&lt;sep&gt;read_number'. The 'run_name' may be either an SRR / ERR identifier or the sequencing center's own alpha-numeric internal run identifier.
 
Frequently, we will want to run lane checking on a mapped .bam file which already contains sequence data from many different instrument runs merged together. They are merged because the sequencing center said they all belong to the same individual. In Pilot 3, this was true for all of the Baylor LS 454 sequencing data. In this case, the read identifier in column 1 of the .sam file carries information about which sequencing run each read belongs to, as well as information that uniquely identifies that read within its run. The read identifiers often are dot or colon-separated strings of the form 'run_name&lt;sep&gt;read_number'. The 'run_name' may be either an SRR / ERR identifier or the sequencing center's own alpha-numeric internal run identifier.
  
 
The "Read group classifier" is an extended regular expression such as '\(^[^.:]+\)[.:].*' which matches just the part of each read identifier that is common to all reads from one instrument run and which differs between instrument runs. The regular expression is passed into the lane checking program as an ascii string. The program keeps track of all distinct values it has seen for the matched portion, and must keep a separate tally of matches and mismatches for each combination of [read group x candidate individual]. By itself, the matched portion of each read identifier does not fully specify which original .fastq file a read came from. The 'bitwise flag' value in column 2 of the .sam format has the remaining information. This is able to distinguish between the 'left end', 'right end' and 'single end' reads which come from each Illumina paired-end sequencing run. The Baylor LS 454 data were all single end reads, so I did not have to deal with this complication.<br>
 
The "Read group classifier" is an extended regular expression such as '\(^[^.:]+\)[.:].*' which matches just the part of each read identifier that is common to all reads from one instrument run and which differs between instrument runs. The regular expression is passed into the lane checking program as an ascii string. The program keeps track of all distinct values it has seen for the matched portion, and must keep a separate tally of matches and mismatches for each combination of [read group x candidate individual]. By itself, the matched portion of each read identifier does not fully specify which original .fastq file a read came from. The 'bitwise flag' value in column 2 of the .sam format has the remaining information. This is able to distinguish between the 'left end', 'right end' and 'single end' reads which come from each Illumina paired-end sequencing run. The Baylor LS 454 data were all single end reads, so I did not have to deal with this complication.<br>
 +
 +
<br>

Revision as of 17:46, 22 November 2009

LaneCheck


Basic Usage Example

Here is an example of how glfTrio works:

  lanecheck     Input Files : --referencegenome [/usr/cluster/share/karma/NCBI36.fa], \
                 --dbSNPfile [/home/1000G/data/GenomeSNP.dbsnp],  \
                 --lanefile [], --pedfile [], --datfile [], --mapfile [], \
                 --mapquality [-1], --genocount [1000000], --nonrefhomo, \
                 --verbose, --coverage, --countbysite, --memorymap \
  Output Files : --prefix []

Command Line Options

Input Files

-f genotype likelihood file    Father's GLF-format genotype likelihood file
-m genotype likelihood file    Mother's GLF-format genotype likelihood file
-c genotype likelihood file    Child's GLF-format genotype likelihood file

Basic Output Options

 -b baseCallFile                Specifies the name of the output VCF-format base call file
-p threshold                   The threshold for base calling. Base calls will be made when their posterior likelihood exceeds threshold
--reference                    Positions called as homozygous reference will be included in the output.  
--verbose                      Print debug information to the screen

Filtering According to Depth and Map Quality

 --minMapQuality threshold      Positions where the root-means squared mapping quality falls below this threshold will be excluded.
--strict                       When the map quality is interpreted strictly, all three trio individuals must exceed minMapQuality 
                            before a call is made. Without the --strict option, reads for individuals below the threshold are ignored.
 --minTotalDepth threshold           Positions where the read depth falls below this threshold will be excluded.
--maxTotalDepth threshold           Positions where the read depth exceeds this threshold will be excluded.

Sample Labels

 --father fatherLabel           Specifies a label for the male parent, which will be included in the output VCF file
--mother motherLabel           Specifies a label for the female parent, which will be included in the output VCF file
--child childLabel             Specifies a label for the child, which will be included in the output VCF file

X Chromosome Variant Calling

 --xChr chromosomeName          Label for the 'X' chromosome in the GLF file
--xStart sexChromosomeStart    Start of the non-pseudo-autosomal portion of the X (2,709,521 bp in build 36)
--xStop sexChromosomeEnd       End of the non-pseudo-autosomal portion of the X (154,584,237 bp in build 36)

Principle of operation:

The overall procedure is that the genotype identity checking program compares internal evidence from the sequence reads themselves to reference genotype information for a panel of candidate individuals. In the case of 1000 Genomes pilot data, these are HapMap genotypes from the same Coriell cell lines that are being sequenced. For each combination of [sequencing run x candidate individual] the program calculates the observed rate of mismatches at both "informative" and "background" locations and reports as "excess mismatch rate"

           excess rate  =  (informative rate  -  background rate).

"Informative" locations are those where the candidate individual is homozygous, according to the HapMap genotype information, and base calls are compared to the HapMap homozygous allele, rather than to the genome reference sequence. "Background" locations are all sites not known to be polymorphic and not recorded in dbSNP.

abc


TODO

Frequently, we will want to run lane checking on a mapped .bam file which already contains sequence data from many different instrument runs merged together. They are merged because the sequencing center said they all belong to the same individual. In Pilot 3, this was true for all of the Baylor LS 454 sequencing data. In this case, the read identifier in column 1 of the .sam file carries information about which sequencing run each read belongs to, as well as information that uniquely identifies that read within its run. The read identifiers often are dot or colon-separated strings of the form 'run_name<sep>read_number'. The 'run_name' may be either an SRR / ERR identifier or the sequencing center's own alpha-numeric internal run identifier.

The "Read group classifier" is an extended regular expression such as '\(^[^.:]+\)[.:].*' which matches just the part of each read identifier that is common to all reads from one instrument run and which differs between instrument runs. The regular expression is passed into the lane checking program as an ascii string. The program keeps track of all distinct values it has seen for the matched portion, and must keep a separate tally of matches and mismatches for each combination of [read group x candidate individual]. By itself, the matched portion of each read identifier does not fully specify which original .fastq file a read came from. The 'bitwise flag' value in column 2 of the .sam format has the remaining information. This is able to distinguish between the 'left end', 'right end' and 'single end' reads which come from each Illumina paired-end sequencing run. The Baylor LS 454 data were all single end reads, so I did not have to deal with this complication.