Changes

From Genome Analysis Wiki
Jump to navigationJump to search
11,531 bytes added ,  07:42, 6 August 2013
Line 5: Line 5:     
This page describes characteristics of variants for proposed exome chip genotyping arrays. Illumina and Affymetrix are currently implementing arrays based on the design principles described here and the variant lists are available to any others who would like to design similar arrays and commit to making them broadly available to the scientific community.
 
This page describes characteristics of variants for proposed exome chip genotyping arrays. Illumina and Affymetrix are currently implementing arrays based on the design principles described here and the variant lists are available to any others who would like to design similar arrays and commit to making them broadly available to the scientific community.
 +
 +
= Online Content =
 +
 +
* [[Media:2011.09_-_Exome_Chip.pdf‎|Slides describing exome chip design]]
 +
* [http://genome.sph.umich.edu/wiki/Exome_Chip_Design Description of exome chip design process and goals]
 +
* [ftp://share.sph.umich.edu/exomeChip/ FTP site with list of variants selected for design]
 +
* Excel format files with list of [ftp://share.sph.umich.edu/exomeChip/ProposedContent/codingContent/nonsynonymous.csv nonsynomous],[ftp://share.sph.umich.edu/exomeChip/ProposedContent/codingContent/splice.csv splice] and [ftp://share.sph.umich.edu/exomeChip/ProposedContent/codingContent/stop.csv stop] variants
    
= Key Credit =
 
= Key Credit =
Line 28: Line 35:  
Key members of this group included:
 
Key members of this group included:
   −
* '''Benjamin Neale'''
+
* '''Benjamin Neale''' (Massachusetts General Hospital)
 +
* Kyle Gaulton (University of Oxford)
 
* Goo Jun (University of Michigan)
 
* Goo Jun (University of Michigan)
 
* Hyun Min Kang (University of Michigan)
 
* Hyun Min Kang (University of Michigan)
 
* Shaun Purcell (Mount Sinai School of Medicine / Broad Institute)
 
* Shaun Purcell (Mount Sinai School of Medicine / Broad Institute)
 +
* Manuel Rivas (University of Oxford)
 
* Josh Smith (University of Washington)
 
* Josh Smith (University of Washington)
 +
 +
== Content Contributors ==
 +
 +
 +
{| class="sortable wikitable" border="1" cellspacing="0"
 +
|- bgcolor="lightblue"
 +
|'''Contributor'''
 +
|'''Enrichment'''
 +
|'''Major Ethnicity'''
 +
|'''N'''    
 +
|'''Contact'''
 +
|-
 +
|NHLBI Exome Sequencing Project (5 tranches) 
 +
|Cardiovascular Traits, Lung Traits, Obesity
 +
|European, African American 
 +
|4260
 +
|D. Nickerson, D. Altshuler and S. Rich
 +
|-
 +
|Autism (2 tranches)
 +
|Autism
 +
|European
 +
|1778
 +
|M. Daly and R. Gibbs
 +
|-
 +
|GO T2D (2 tranches)
 +
|Type 2 diabetes
 +
|European
 +
|1618
 +
|D. Altshuler, M. Boehnke, M. McCarthy
 +
|-
 +
|1000 Genomes Project (2 tranches)
 +
|Random Sample
 +
|Diverse
 +
|1128
 +
|H. M. Kang
 +
|-
 +
|Sweden Schizophrenia Study
 +
|Schizophrenia
 +
|European
 +
|525
 +
|S. Purcell and P. Sklar
 +
|-
 +
|SardiNIA
 +
|Random Sample
 +
|European
 +
|508
 +
|F. Cucca and G. Abecasis
 +
|-
 +
|Sanger / CoLaus
 +
|Overweight, Diabetes, Fasting Glucose
 +
|European
 +
|456
 +
|I. Barroso
 +
|-
 +
|Cancer Genome Atlas
 +
|Cancer
 +
|European
 +
|422
 +
|S. Gabriel
 +
|-
 +
|T2D Genes
 +
|Type 2 diabetes
 +
|Hispanic
 +
|362
 +
|J. Blangero and G. Abecasis
 +
|-
 +
|Cancer Cohort Study
 +
|Cancer
 +
|Chinese
 +
|327
 +
|W. Zheng
 +
|-
 +
|Pfizer – MGH – Broad
 +
|Type 2 diabetes extremes of risk
 +
|European
 +
|182
 +
|D. Altshuler
 +
|-
 +
|Lipid Extremes
 +
|Lipid Extremes
 +
|European
 +
|131
 +
|S. Kathiresan and D. Rader
 +
|-
 +
|Int’l HIV Controllers Study
 +
|HIV Controllers
 +
|European
 +
|121
 +
|P. De Bakker
 +
|-
 +
|SAEC DILI (merged w/Autism tranches)
 +
|Augmentin DILI
 +
|European
 +
|117
 +
|M. Daly and A. Holden
 +
|-
 +
|I2B2 - Major Depression
 +
|Major Depression, Major Depressive Disorder
 +
|European
 +
|50
 +
|R. Perlis and J. Smoller
 +
|-
 +
|BMI Extremes
 +
|BMI Extremes
 +
|European
 +
|46
 +
|J. Hirschhorn
 +
|}
    
= Chip Goals =
 
= Chip Goals =
Line 44: Line 161:  
== Non-synonymous Variants ==
 
== Non-synonymous Variants ==
   −
We tallied 1,107,05 nonsynonymous variants seen at least once across ~12,000 sequenced samples. Among the non-synonymous variants, the majority were seen only once (646,888) or twice (163,044), as expected. Of the remaining variants (297,119), a total of 260,054 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was 2.0 in the full set of variants and 2.49 in the set of variants that were seen at least three times and in two or more studies.
+
We tallied 1,107,051 nonsynonymous variants seen at least once across ~12,000 sequenced samples. Among the non-synonymous variants, the majority were seen only once (646,888) or twice (163,044), as expected. Of the remaining variants (297,119), a total of 260,054 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was 2.0 in the full set of variants and 2.49 in the set of variants that were seen at least three times and in two or more studies.
    
The set of variants selected for array design is estimated to include 97-98% of the nonsynonymous variants detected in average genome through exome sequencing.
 
The set of variants selected for array design is estimated to include 97-98% of the nonsynonymous variants detected in average genome through exome sequencing.
Line 92: Line 209:  
== Functionally interesting variants ==
 
== Functionally interesting variants ==
   −
Defined by the ESP groups for specific interest
+
A small set of SNPs of high interest to groups participating in the NHLBI Exome Sequencing Project was included.
    
== Random set of synonymous variants (as comparator) ==
 
== Random set of synonymous variants (as comparator) ==
   −
A set of 5,000 synonymous variants was sampled at random.
+
A set of 5,000 synonymous variants was sampled at random. It is anticipated that these markers might useful as a comparator (for example, for genomic control based analyses) when interpreting the results of assays for coding variants.
 
  −
Among these, 4,355 passed Illumina assay design.
      
== Fingerprint SNPs ==
 
== Fingerprint SNPs ==
    
A set of 274 SNPs currently used as fingerprint SNPs at the University of Washington and Broad Institute were included. These SNPs are shared among several major genotyping platforms and facilitate sample tracking.
 
A set of 274 SNPs currently used as fingerprint SNPs at the University of Washington and Broad Institute were included. These SNPs are shared among several major genotyping platforms and facilitate sample tracking.
  −
Among these 274 SNPs, assays could be designed for 259 SNPs.
      
== Mitochondrial SNPs ==
 
== Mitochondrial SNPs ==
   −
A set of XXX coding variants in the mitochondrial, drawn from the 1000 Genomes Project.
+
A set of 246 coding variants in the mitochondria, drawn from the 1000 Genomes Project.
    
== Chromosome Y SNPs ==
 
== Chromosome Y SNPs ==
   −
A set of 180 Y chromosome SNPs contributed by *** at Sanger, based on analyses of 1000 Genomes data and European haplogroups.
+
A set of 188 Y chromosome SNPs contributed by Jim WIlson at Sanger, based on analyses of 1000 Genomes data and European haplogroups.
    
== HLA tag SNPs ==
 
== HLA tag SNPs ==
    
A set of 2,536 HLA tag SNPs selected by Paul de Bakker.  
 
A set of 2,536 HLA tag SNPs selected by Paul de Bakker.  
  −
Among these, 2,459 could be designed into Illumina Assays.
      
Paul De Bakker's HLA tag SNPs are listed here:
 
Paul De Bakker's HLA tag SNPs are listed here:
 
http://www.broadinstitute.org/~debakker/hla_tags_exome.txt
 
http://www.broadinstitute.org/~debakker/hla_tags_exome.txt
 +
 +
= Second Generation Arrays =
 +
 +
A second generation of exome arrays will be available in 2013, from both Illumina and Affymetrix. In addition to the original exome array content, each of these includes a grid of SNPs across the genome that facilitates analysis of common variants when a suitably large reference panel is available. These grids were selected both to ensure good coverage of the genome and to ensure that assays could be manufactured inexpensively, taking into account proprietary platform specific constraints.
 +
 +
To evaluate the accuracy of these grids, we have used data from the Go T2D project (led by David Altshuler, Mike Boehnke and Mark McCarthy). The dataset includes ~2,650 individuals that have been whole genome sequenced (depth ~4x) and whole exome sequenced (depth ~80-100x). We focused on chr20 and 600 samples from the UK and, for each of these in turn, tried to impute missing genotypes using the remaining sequenced individuals as a reference panel. The results show both the Affymetrix and Illumina arrays are expected to provide excellent coverage of the genome provided that a suitably large reference panel is available.
 +
 +
The specific numbers are that:
 +
 +
* For variants with MAF >1%, the average r<sup>2</sup> correlation between imputed and true genotypes will be 0.8879 (Affymetrix) and 0.8590 (Illumina).
 +
* For variants with MAF >5%, the average r<sup>2</sup> correlation between imputed and true genotypes will be 0.9458 (Affymetrix) and 0.9256 (Illumina).
 +
 +
The fraction of variants imputed with r<sup>2</sup> > 0.80 will be:
 +
 +
* For variants with MAF >1%, 82.1% (Affymetrix) and 76.5% (Illumina)
 +
* For variants with MAF >5%, 94.6% (Affymetrix) and 89.2% (Illumina)
 +
 +
The evaluation is based on imputation of ~600 UK samples that have been whole genome and whole exome sequenced (comparing imputed genotypes and the sequenced based calls) and using a panel of 2,650 sequenced individuals from the T2D-Go Project (Altshuler, Boehnke, McCarthy) as a reference.
 +
 +
All evaluations used [[Minimac]] and were carried out by [mailto:cfuchsb@umich.edu Christian Fuchsberger].
    
= Illumina Exome Arrays =
 
= Illumina Exome Arrays =
   −
== Coding Variants ==
+
== Coding Variants ==
   −
Design criteria included a requirement for an assay design score >= 0.50, a primer that didn't overlap a nearby variant with minor allele count >100, a primer that didn't map with 0, 1 or 2 mismatches to other genomic locations. Assay design failures appeared to be largely independent of frequency.
+
Design criteria included a requirement for an assay design score &gt;= 0.50, a primer that didn't overlap a nearby variant with minor allele count &gt;100, a primer that didn't map with 0, 1 or 2 mismatches to other genomic locations. Whenever possible, we favored assays that extended into exons (rather than into introns) so as to maximize the utility of these arrays when applied to RNA samples. Assay design failures appeared to be largely independent of frequency.  
    
In the Illumina platform, 243,094 of the original set of 275,165 coding variants (non-synonymous, stop and splice) passed assay design criteria. We expect that 80-90% of the variants that pass design criteria will ultimately be included in genotyping arrays.
 
In the Illumina platform, 243,094 of the original set of 275,165 coding variants (non-synonymous, stop and splice) passed assay design criteria. We expect that 80-90% of the variants that pass design criteria will ultimately be included in genotyping arrays.
 +
 +
== Assay Design Rates  ==
 +
 +
{| cellpadding="2" cellspacing="0" border="1" summary="Summarizes the Number of SNPs in Each Category That Passed Assay Design. Note that Categories Overlap."
 +
|+ '''Illumina Assay Design Summary'''
 +
|-
 +
! bgcolor="lightblue" scope="col" | SNP Set
 +
! bgcolor="lightblue" scope="col" align = "center" | Number of <br>&nbsp;&nbsp;Candidates&nbsp;&nbsp;
 +
! bgcolor="lightblue" scope="col" align = "center" | Number of <br>&nbsp;&nbsp;Successful Designs&nbsp;&nbsp;
 +
! bgcolor="lightblue" scope="col" align = "center" | Additional Notes
 +
|-
 +
! bgcolor="lightgray" scope="row" | Coding Content
 +
| align="right" | 275,165
 +
| align="right" | 243,094
 +
| &nbsp;&nbsp;An additional set of 8,242 SNPs that were unique to the 1000 Genomes Project and populations under-represented in the design was added.
 +
|-
 +
! bgcolor="lightgray" scope="row" | GWAS Tag SNPs
 +
| align="right" | 5,763
 +
| align="right" | 5,325
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" | Grid of Common Variants
 +
| align="right" | 5,710
 +
| align="right" | 5,286
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" | Randomly Selected Synonymous SNPs
 +
| align="right" | 5,000
 +
| align="right" | 4,651
 +
| &nbsp;&nbsp;For 1,000 SNPs, assays were generated on both strands in order to faciliate QC efforts and future development of methods for genotyping of rare variants.
 +
|-
 +
! bgcolor="lightgray" scope="row" | AIM - African Ancestry
 +
| align="right" | 3,388
 +
| align="right" | 3,241
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" | AIM - Native American Ancestry
 +
| align="right" | 1,000
 +
| align="right" | 998
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" | HLA Tags
 +
| align="right" | 2,536
 +
| align="right" | 2,459
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" | ESP Requests
 +
| align="right" | 1,003
 +
| align="right" | 843
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" | Fingerprint SNPs
 +
| align="right" | 285
 +
| align="right" | 259
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" | MicroRNA Target Sites
 +
| align="right" | 285
 +
| align="right" | 270
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" | Mitochondrial Variants
 +
| align="right" | 246
 +
| align="right" | 246
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" | Chromosome Y
 +
| align="right" | 188
 +
| align="right" | 128
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" | Indels
 +
| align="right" | 181
 +
| align="right" | 181
 +
|}
 +
 +
&nbsp;
 +
 +
 +
== Sites to be Careful About ==
 +
 +
Peter Chines, working with Francis Collins, provided a list of 333 exome chip variants sites that should be treated with caution. The sites include variants for which the SNP probe differs from the expected reference genome sequence, could not be mapped back to the reference, mapped to multiple places, or where neither allele matches the reference genome.
 +
 +
A plain text list of these sites [ftp://share.sph.umich.edu/exomeChip/IlluminaDesigns/cautiousSites/cautiousSite.sorted.sites list] and corresponding [ftp://share.sph.umich.edu/exomeChip/IlluminaDesigns/cautiousSites/cautiousSite.sorted.README descriptions] are available.
 +
 +
= Affymetrix Exome Arrays =
 +
 +
== Coding Variants: Design Criteria ==
 +
Probe sequences were a priori excluded if there was an adjacent polymorphism within 5bp of the target variant or if the cumulative genome-frequency count of each 16-mer in the probe exceeded 300.
 +
The array was wet-lab validated against HapMap 270 and ~1000 Genomes Sample Collections.
 +
 +
{| cellpadding="2" cellspacing="1" border="0" summary="Summarizes the Number of SNPs in Each Category tat were attempted and those that passed wet lab validation. Note that Categories Overlap."
 +
|+ '''Affymetrix Assay Design Summary'''
 +
|-
 +
! bgcolor="lightblue" scope="col" |  Categories
 +
! bgcolor="lightblue" scope="col"  align="right"| Candidates
 +
! bgcolor="lightblue" scope="col" align="center" | &nbsp; # wet lab validated  <br>&nbsp;&nbsp;& working on Axiom
 +
! bgcolor="lightblue" scope="col"  | Comments
 +
|-
 +
! scope="row" align="left" |  Non-synomynous Coding SNPs<br>&nbsp;/splice & stop
 +
| align="right" |259,976<br>&nbsp;/19,672
 +
| align="right" | 247,546<br>&nbsp;/17,066
 +
| &nbsp;Includes  16K additional non-synonymous coding variants from <br> &nbsp; the  Axiom Genomic Database. .
 +
|-
 +
! bgcolor="lightgray" scope="row" align="left" |  GWAS
 +
| bgcolor="lightgray" scope="row"  align="right" | 5,542
 +
| bgcolor="lightgray" scope="row" align="right" | 5,053
 +
| bgcolor="lightgray" scope="row" |
 +
|-
 +
! scope="row" align="left"  | Grid
 +
| align="right" | 5,719
 +
| align="right" | 5,478
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" align="left" | Synonymous cSNPs
 +
| bgcolor="lightgray" scope="row"  align="right" | 5,000
 +
| bgcolor="lightgray" scope="row"  align="right" | 4,367
 +
| bgcolor="lightgray" scope="row"  align="right" |
 +
|-
 +
!  scope="row" align="left" |  AIMs  (Eur/African Ancestry)
 +
| align="right" | 3,388
 +
| align="right" | 3,283
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" align="left" |  AIMs  (Native American Ancestry)
 +
| bgcolor="lightgray" scope="row"  align="right" | 1,000
 +
| bgcolor="lightgray" scope="row"  align="right" | 962
 +
| bgcolor="lightgray" scope="row"  align="right" |
 +
|-
 +
! scope="row" align="left"  |  AIMs  (Other)
 +
| align="right" | 271
 +
| align="right" | 271
 +
|  &nbsp;Includes  supplemental AIMs from the Latin American Cancer <br> &nbsp; Epidemiology  (LACE) Consortium. 
 +
|-
 +
! bgcolor="lightgray" scope="row" align="left" |  HLA
 +
| bgcolor="lightgray" scope="row"  align="right" | 2,536
 +
| bgcolor="lightgray" scope="row"  align="right"| 2,262
 +
| bgcolor="lightgray" scope="row"  align="right" |
 +
|-
 +
! scope="row" align="left"  |  ESP
 +
| align="right" | 1,003
 +
| align="right" | 952
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" align="left" | Fingerprint
 +
| bgcolor="lightgray" scope="row"  align="right" | 285
 +
| bgcolor="lightgray" scope="row"  align="right" | 268
 +
| bgcolor="lightgray" scope="row"  align="right" |
 +
|-
 +
! scope="row" align="left"  |  miRNA
 +
| align="right" | 285
 +
| align="right" | 250
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row"  align="left" |  Mitochondrial DNA
 +
| bgcolor="lightgray" scope="row"  align="right"| 246
 +
| bgcolor="lightgray" scope="row"  align="right"| 207
 +
| bgcolor="lightgray" scope="row"  align="right" |
 +
|-
 +
! scope="row" align="left" |  Chromosome Y
 +
| align="right" | 232
 +
| align="right" |  161
 +
|
 +
|-
 +
! bgcolor="lightgray" scope="row" align="left" |  Indels
 +
| bgcolor="lightgray" scope="row"  align="right" | 56,095
 +
| bgcolor="lightgray" scope="row"  align="right" | 35,137
 +
| bgcolor="lightgray" scope="row"  | &nbsp; Includes biallelic indels from the draft Phase 1 1000 Genomes Project and  previously validated indels in the Axiom Genomic Database;  <br> &nbsp; indel size ranges from 1-138bp. 
 +
|-
 +
! scope="row" |
 +
| align="right" |
 +
| align="right" |
 +
|
 +
|-
 +
! bgcolor="lightblue" scope="row" border = "1"| Total Number Target Variants
 +
| bgcolor="lightblue"  align="right" | 369,656
 +
| bgcolor="lightblue"  align="right" | 318,983
 +
| bgcolor="lightblue" |
 +
|}
550

edits

Navigation menu