Changes

From Genome Analysis Wiki
Jump to navigationJump to search
42 bytes added ,  10:46, 25 May 2010
no edit summary
Line 114: Line 114:  
=== Does --mle overwrite input genotypes?  ===
 
=== Does --mle overwrite input genotypes?  ===
   −
Yes, but not often. The --mle option outputs the most likely genotype configuration taking into account observed genotypes and integration over the most similar reference haplotypes. The original genotypes will be changed only if the underlying reference haplotypes strongly contradict the input genotype.
+
Yes, but not often. The --mle option outputs the most likely genotype configuration taking into account observed genotypes and integration over the most similar reference haplotypes. The original genotypes will be changed only if the underlying reference haplotypes strongly contradict the input genotype.  
 +
 
 +
 
    
=== How do I get imputation quality estimates?  ===
 
=== How do I get imputation quality estimates?  ===
Line 126: Line 128:  
A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use   [http://www.sph.umich.edu/csg/ylwtx/CalcMatch.1.0.5.tgz CalcMatch ]and [http://www.sph.umich.edu/csg/ylwtx/doseR2.tgz doseR2.pl ]to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from [http://www.sph.umich.edu/csg/ylwtx/software.html http://www.sph.umich.edu/csg/ylwtx/software.html].  
 
A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use   [http://www.sph.umich.edu/csg/ylwtx/CalcMatch.1.0.5.tgz CalcMatch ]and [http://www.sph.umich.edu/csg/ylwtx/doseR2.tgz doseR2.pl ]to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from [http://www.sph.umich.edu/csg/ylwtx/software.html http://www.sph.umich.edu/csg/ylwtx/software.html].  
   −
'''Warning''': Imputation involving masked datasets should be performed separately for imputation quality estimation. For production, one should use all available information.
+
'''Warning''': Imputation involving masked datasets should be performed separately for imputation quality estimation. For production, one should use all available information.  
   −
<br>
+
<br>  
    
=== '''Shall I apply QC before or after imputation? If so, how? '''  ===
 
=== '''Shall I apply QC before or after imputation? If so, how? '''  ===
   −
We strongly recommend QC both before and after imputation. Before imputation, we recommend the standard battery of QC filters including HWE, MAF (recommended cutoff is 1% for genotyping-based GWAS), completeness, Mendelian inconsistency etc. Post-imputation, we recommend Rsq 0.3 (which removes &gt;70% of poorly-imputed SNPs at the cost of &lt;0.5% well-imputed SNPs) and MAF of 1%.
+
We strongly recommend QC both before and after imputation. Before imputation, we recommend the standard battery of QC filters including HWE, MAF (recommended cutoff is 1% for genotyping-based GWAS), completeness, Mendelian inconsistency etc. Post-imputation, we recommend Rsq 0.3 (which removes &gt;70% of poorly-imputed SNPs at the cost of &lt;0.5% well-imputed SNPs) and MAF of 1%.  
    
=== '''How do I get reference files for an region of interest? '''  ===
 
=== '''How do I get reference files for an region of interest? '''  ===
   −
1. For HapMapII format, download haplotypes from http://www.sph.umich.edu/csg/ylwtx/HapMapForMach.tgz
+
1. For HapMapII format, download haplotypes from http://www.sph.umich.edu/csg/ylwtx/HapMapForMach.tgz 2. For MACH format, you can do the following:  
2. For MACH format, you can do the following:  
+
 
* First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position.  
+
*First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position.  
* Then:  
+
*Then:
 +
 
 
   @ first = `grep -n rsFIRST orig.snps | cut -f1 -d ':'`
 
   @ first = `grep -n rsFIRST orig.snps | cut -f1 -d ':'`
 
   @ last = `grep -n rsLAST orig.snps | cut -f1 -d ':'`
 
   @ last = `grep -n rsLAST orig.snps | cut -f1 -d ':'`
* Finally (assuming the third field contains the actual haplotypes, where alleles are separated by whitespace):  
+
 
 +
*Finally (assuming the third field contains the actual haplotypes, where alleles are separated by whitespace):
 +
 
 
   awk '{print $3}' orig.hap | cut -c${first}-${last} &gt; region.hap
 
   awk '{print $3}' orig.hap | cut -c${first}-${last} &gt; region.hap
    
=== '''Do I have to sort the pedigree file by physical positions? '''  ===
 
=== '''Do I have to sort the pedigree file by physical positions? '''  ===
   −
If you use external reference, you do not have to as long as the external reference is in correct order. **HOWEVER**, you will probably avoid problems by including markers in the pedigree file sorted in chromosome order.
+
If you use external reference, you do not have to as long as the external reference is in correct order. **HOWEVER**, you will probably avoid problems by including markers in the pedigree file sorted in chromosome order.  
 +
 
 +
 
    
=== '''What if I specified --states R where R exceeds the maximum possible (2*number diploid individuals - 2 + number_haplotypes)? '''  ===
 
=== '''What if I specified --states R where R exceeds the maximum possible (2*number diploid individuals - 2 + number_haplotypes)? '''  ===
   −
Mach will automatically change this to the maximum possible value.
+
Mach will automatically change this to the maximum possible value.  
    
=== '''How is AL1 defined? Which allele dosage is .dose/.mldose counting?'''  ===
 
=== '''How is AL1 defined? Which allele dosage is .dose/.mldose counting?'''  ===
   −
AL1 is an arbitrary allele. Typically, it is the first allele read in the reference haplotypes (file fed to -h or --haps). The earliest versions (prior to April 2007) of mach counted the expected number copies of AL2 and more recent versions count the number of AL1. One can find out which allele is counted following the steps below.
+
AL1 is an arbitrary allele. Typically, it is the first allele read in the reference haplotypes (file fed to -h or --haps). The earliest versions (prior to April 2007) of mach counted the expected number copies of AL2 and more recent versions count the number of AL1. One can find out which allele is counted following the steps below.  
    
#. First, find the two alleles for one of the markers in your data
 
#. First, find the two alleles for one of the markers in your data
Line 163: Line 170:  
  SNP      Al1 Al2 Freq1  MAF    Quality  Rsq  
 
  SNP      Al1 Al2 Freq1  MAF    Quality  Rsq  
 
  rs885550 2  4  0.9840  0.0160  0.9682  0.992
 
  rs885550 2  4  0.9840  0.0160  0.9682  0.992
</source>
+
</source>  
    
#. Second, check the dosage for a few individuals at this SNP.
 
#. Second, check the dosage for a few individuals at this SNP.
Line 172: Line 179:  
  1.000
 
  1.000
 
  0.078
 
  0.078
</source>
+
</source>  
    
#. Finally, compare these dosages to genotypes.
 
#. Finally, compare these dosages to genotypes.
Line 181: Line 188:  
  2/4
 
  2/4
 
  4/4
 
  4/4
</source>
+
</source>  
   −
In this example, you can see that the first individual has a high dosage count (1.962) and most likely genotype 2/2. The last individual has a low dosage count and most likely genotype 4/4. Thus, the output corresponds to version of Mach released after April 2007, which should tally allele 1 counts.
+
In this example, you can see that the first individual has a high dosage count (1.962) and most likely genotype 2/2. The last individual has a low dosage count and most likely genotype 4/4. Thus, the output corresponds to version of Mach released after April 2007, which should tally allele 1 counts.
 +
 
 +
Note that, on the example above, .mldose could be replaced with .dose and .mlgeno could be replaced with .geno.
 +
 
 +
Based on the three files above, we've confirmed that dosage is the number of AL1 copies: you will only to check for one informative case (i.e, dosage values close to 0 or 2) since it's consistent across all individuals and all SNPs.  
   −
Note that, on the example above, .mldose could be replaced with .dose and .mlgeno could be replaced with .geno.
     −
Based on the three files above, we've confirmed that dosage is the number of AL1 copies: you will only to check for one informative case (i.e, dosage values close to 0 or 2) since it's consistent across all individuals and all SNPs.
      
=== '''Can I used unphased reference?'''  ===
 
=== '''Can I used unphased reference?'''  ===
   −
A: Yes. You could create pedigree (.ped) and data files (.dat) that include both reference panel and sample genotypes or request that MaCH merge apppropriate files on the fly.
+
A: Yes. You could create pedigree (.ped) and data files (.dat) that include both reference panel and sample genotypes or request that MaCH merge apppropriate files on the fly.  
    
For example, if you have:  
 
For example, if you have:  
   −
'''reference.dat'''
+
'''reference.dat'''  
 +
 
 
  M SNP1
 
  M SNP1
 
  M SNP2
 
  M SNP2
Line 202: Line 212:  
  M SNP5
 
  M SNP5
   −
'''reference.ped'''
+
'''reference.ped'''  
 +
 
 
  REF1 REF1 0 0 1 A/C C/C G/G G/A A/A
 
  REF1 REF1 0 0 1 A/C C/C G/G G/A A/A
   −
'''sample.dat'''
+
'''sample.dat'''  
 +
 
 
  M SNP1
 
  M SNP1
 
  M SNP4  
 
  M SNP4  
   −
'''sample.ped'''
+
'''sample.ped'''  
 +
 
 
  1 1 0 0 1 A/A G/G
 
  1 1 0 0 1 A/A G/G
   −
Your could create a combined data set as:
+
Your could create a combined data set as:  
 +
 
 +
'''comb.dat'''
   −
'''comb.dat'''
   
  M SNP1
 
  M SNP1
 
  M SNP2
 
  M SNP2
Line 221: Line 235:  
  M SNP5
 
  M SNP5
   −
'''comb.ped'''
+
'''comb.ped''' REF1 REF1 0 0 1 A/C C/C G/G G/A A/A  
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A
+
 
 
   1    1 0 0 1 A/A ./. ./. G/G ./.  
 
   1    1 0 0 1 A/A ./. ./. G/G ./.  
   −
Equivalently, you could write -d reference.dat,sample.dat -p reference.ped,sample.ped on the command line and MACH would merge both files ''on-the-fly''.
+
Equivalently, you could write -d reference.dat,sample.dat -p reference.ped,sample.ped on the command line and MACH would merge both files ''on-the-fly''.  
 +
 
 +
 
    
=== '''How long does imputation take?'''  ===
 
=== '''How long does imputation take?'''  ===
Line 231: Line 247:  
A: The following factors/parameters affect computational time:  
 
A: The following factors/parameters affect computational time:  
   −
#m, # of genotyped markers (number of markers in .dat file)<br>
+
#m, # of genotyped markers (number of markers in .dat file)<br>  
#n, # of individuals<br>
+
#n, # of individuals<br>  
#h, # of reference haplotypes (determined by --greedy or states, by default, h = 2*number diploid individuals - 2 + number_haplotypes)<br>
+
#h, # of reference haplotypes (determined by --greedy or states, by default, h = 2*number diploid individuals - 2 + number_haplotypes)<br>  
 
#r, # of rounds (-r or --rounds, --mle corresponds to 1-2 rounds)
 
#r, # of rounds (-r or --rounds, --mle corresponds to 1-2 rounds)
    +
<br>
    +
&nbsp;&nbsp;&nbsp; Computational time increases linearly with m, n, r and quadratically with h. On our Xeon 3.0GHz machine, imputation with m=25K, n=250, h=120, and r=100 takes ~20 hours. <br>
   −
&nbsp;&nbsp;&nbsp; Computational time increases linearly with m, n, r and quadratically with h. On our Xeon 3.0GHz machine, imputation with m=25K, n=250, h=120, and r=100 takes ~20 hours. <br>
+
<br>  
 
  −
 
  −
 
  −
<br>
      
=== '''More questions?'''  ===
 
=== '''More questions?'''  ===
212

edits

Navigation menu