Exome Chip Design
This page describes characteristics of variants for proposed exome chip genotyping arrays. Illumina and Affymetrix are currently implementing arrays based on the design principles described here and the variant lists are available to any others who would like to design similar arrays and commit to making them broadly available to the scientific community.
- Slides describing exome chip design
- Description of exome chip design process and goals
- FTP site with list of variants selected for design
- Excel format files with list of nonsynomous,splice and stop variants
We thank all the individuals who generously shared sequence data and marker lists prior to publication. Without their dedication to great science, this effort would not have been possible.
Coordination of the Exome Chip effort was the responsibility of the following individuals (in alphabetical order):
- Goncalo Abecasis (University of Michigan)
- David Altshuler (Broad Institute)
- Michael Boehnke (University of Michigan)
- Mark Daly (Massachussets General Hospital)
- Mark McCarthy (University of Oxford)
- Debbie Nickerson (University of Washington)
- Steve Rich (University of Virginia)
A small group of individuals, led by Mark Daly and Goncalo Abecasis, was responsible for content selection.
Key members of this group included:
- Benjamin Neale (Massachusetts General Hospital)
- Kyle Gaulton (University of Oxford)
- Goo Jun (University of Michigan)
- Hyun Min Kang (University of Michigan)
- Shaun Purcell (Mount Sinai School of Medicine / Broad Institute)
- Manuel Rivas (University of Oxford)
- Josh Smith (University of Washington)
|NHLBI Exome Sequencing Project (5 tranches)||Cardiovascular Traits, Lung Traits, Obesity||European, African American||4260||D. Nickerson, D. Altshuler and S. Rich|
|Autism (2 tranches)||Autism||European||1778||M. Daly and R. Gibbs|
|GO T2D (2 tranches)||Type 2 diabetes||European||1618||D. Altshuler, M. Boehnke, M. McCarthy|
|1000 Genomes Project (2 tranches)||Random Sample||Diverse||1128||H. M. Kang|
|Sweden Schizophrenia Study||Schizophrenia||European||525||S. Purcell and P. Sklar|
|SardiNIA||Random Sample||European||508||F. Cucca and G. Abecasis|
|Sanger / CoLaus||Overweight, Diabetes, Fasting Glucose||European||456||I. Barroso|
|Cancer Genome Atlas||Cancer||European||422||S. Gabriel|
|T2D Genes||Type 2 diabetes||Hispanic||362||J. Blangero and G. Abecasis|
|Cancer Cohort Study||Cancer||Chinese||327||W. Zheng|
|Pfizer – MGH – Broad||Type 2 diabetes extremes of risk||European||182||D. Altshuler|
|Lipid Extremes||Lipid Extremes||European||131||S. Kathiresan and D. Rader|
|Int’l HIV Controllers Study||HIV Controllers||European||121||P. De Bakker|
|SAEC DILI (merged w/Autism tranches)||Augmentin DILI||European||117||M. Daly and A. Holden|
|I2B2 - Major Depression||Major Depression, Major Depressive Disorder||European||50||R. Perlis and J. Smoller|
|BMI Extremes||BMI Extremes||European||46||J. Hirschhorn|
The goal of this array is to enable an intermediate experiment between current genotyping arrays, which focus on relatively common variants, and exome sequencing of very large numbers of samples, which will enable examination of coding variants, down to singletons. The array aims to include coding variants seen several times in existing sequence datasets. Towards this end, we have assembled information on ~12,000 sequenced genomes and exomes and catalogued, for each variant that potentially affects protein structure, the total number of times it was seen and the total number of datasets that included the variant. Our working definition of a variant that has been seen "several times" focuses on non-synonymous variants seen at least 3+ times across at 2+ datasets. A more lenient criterium was used for splice and non-sense variants.
Variants that Alter Protein Function
In the genome of an average individual (as represented by the exome sequenced individuals contributed for chip design), we expect to see ~8,000 - 10,000 nonsynonymous variants, ~200 - 300 splice variants and ~80 - 100 stop altering variants.
We tallied 1,107,051 nonsynonymous variants seen at least once across ~12,000 sequenced samples. Among the non-synonymous variants, the majority were seen only once (646,888) or twice (163,044), as expected. Of the remaining variants (297,119), a total of 260,054 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was 2.0 in the full set of variants and 2.49 in the set of variants that were seen at least three times and in two or more studies.
The set of variants selected for array design is estimated to include 97-98% of the nonsynonymous variants detected in average genome through exome sequencing.
We tallied 44,529 splice variants seen at least once across ~12,000 sequenced samples. Among these splice variants, the majority were seen only once (27,265). Of the remaining variants (17,264), a total of 12,662 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was 2.13 for all variants and 2.94 for variants that met criteria for inclusion in these arrays (being seen 2 or more times and in 2 or more studies).
We estimate the candidate list of variants includes 94-95% of the splice altering variants detected in an average genome through exome sequencing.
Stop Altering Variants
We tallied 31,003 stop altering variants (stop gains or losses) seen at least once across ~12,000 sequenced samples. Among the non-synonymous variants, the majority were seen only once (20,637). Of the remaining variants (10,366), a total of 7,137 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was 1.68 for all variants and 2.29 for variants that met criteria for inclusion in these arrays (being seen 2 or more times and in 2 or more studies).
We estimate the candidate list of variants includes 94-95% of the stop altering variants detected in an average genome through exome sequencing.
List of Candidate Variants
The full list of candidate variants is available by anonymous ftp. Each variant is annotated with the number of times it was seen, the number of studies in which it was seen, and its impact on protein sequence.
In addition to the core content of protein altering variants, described above, we aggregated information on several classes of interesting variation that will help in the interpretation of the results of exome sequencing experiments.
Tags for Previously Described GWAS Hits
We collected all SNPs reported as associated in the NHGRI list, as of August 16, 2011. This list was augmented with unpublished hits from consortia working on diabetes, blood lipids, blood pressure, lung function, myocardical infraction, antropometric traits, psychiatric traits, Crohn's disease and age related macular degeneration resulting in a total of 5,542 GWAS hit SNPs. We were inclusive in our SNP selection criteria, including all SNPs in the NHGRI list whether or not they reached the standard p<5x10-8 criteria for genomewide significance.
Ancestry Informative Markers
African American vs European Ancestry
We selected a grid of 3,388 markers (distributed approximately one per megabase, across the autosomes and the X chromosome) that showed strong differentiation between African- and European-ancestry samples sequenced by the 1000 Genomes Project. Markers previously genotyped on the Illumina Omni 2.5M array were favored and markers with A/T or G/C alleles were avoided.
Native American vs European Ancestry
A grid of 1,000 markers selected to be informative for Native American vs. European ancestry. These AIMs were selected to be in low linkage disequilibrium of one another (defined as R2 <= 0.1 in Native American populations, to be conservative) and widely separated (by requiring that they should be at least 250 kbases from other European vs Native American ancestry AIMs). SNPs with significant within continent heterogeneity were excluded.
These markers were previously genotyped in three samples of European ancestry (consisting of CEU and TSI samples and a population of Spaniards) and six samples of Native Americans ( Mayan, Nahuan, Zapoteca, Tepehuano, Quechuan and Aymaran).
Scaffold for Identity by Descent
We selected a grid of 5,710 markers (distributed approximately one per 500 kb across the autosomes and the X chromosome) that showed little differentiation between African-, European- and Asian-ancestry sequenced by the 1000 Genomes Project and allele frequency close to 0.50. Markers previously genotyped on the Illumina Omni 2.5M array were favored and markers with A/T or G/C alleles were avoided.
These markers can be used to identify stretches of identity by descent between apparently unrelated individuals or to support linkage-based analyses in appropriate family samples.
Functionally interesting variants
A small set of SNPs of high interest to groups participating in the NHLBI Exome Sequencing Project was included.
Random set of synonymous variants (as comparator)
A set of 5,000 synonymous variants was sampled at random. It is anticipated that these markers might useful as a comparator (for example, for genomic control based analyses) when interpreting the results of assays for coding variants.
A set of 274 SNPs currently used as fingerprint SNPs at the University of Washington and Broad Institute were included. These SNPs are shared among several major genotyping platforms and facilitate sample tracking.
A set of 246 coding variants in the mitochondria, drawn from the 1000 Genomes Project.
Chromosome Y SNPs
A set of 188 Y chromosome SNPs contributed by Jim WIlson at Sanger, based on analyses of 1000 Genomes data and European haplogroups.
HLA tag SNPs
A set of 2,536 HLA tag SNPs selected by Paul de Bakker.
Paul De Bakker's HLA tag SNPs are listed here: http://www.broadinstitute.org/~debakker/hla_tags_exome.txt
Second Generation Arrays
A second generation of exome arrays will be available in 2013, from both Illumina and Affymetrix. In addition to the original exome array content, each of these includes a grid of SNPs across the genome that facilitates analysis of common variants when a suitably large reference panel is available. These grids were selected both to ensure good coverage of the genome and to ensure that assays could be manufactured inexpensively, taking into account proprietary platform specific constraints.
To evaluate the accuracy of these grids, we have used data from the Go T2D project (led by David Altshuler, Mike Boehnke and Mark McCarthy). The dataset includes ~2,650 individuals that have been whole genome sequenced (depth ~4x) and whole exome sequenced (depth ~80-100x). We focused on chr20 and 600 samples from the UK and, for each of these in turn, tried to impute missing genotypes using the remaining sequenced individuals as a reference panel. The results show both the Affymetrix and Illumina arrays are expected to provide excellent coverage of the genome provided that a suitably large reference panel is available.
The specific numbers are that:
- For variants with MAF >1%, the average r2 correlation between imputed and true genotypes will be 0.8879 (Affymetrix) and 0.8590 (Illumina).
- For variants with MAF >5%, the average r2 correlation between imputed and true genotypes will be 0.9458 (Affymetrix) and 0.9256 (Illumina).
The fraction of variants imputed with r2 > 0.80 will be:
- For variants with MAF >1%, 82.1% (Affymetrix) and 76.5% (Illumina)
- For variants with MAF >5%, 94.6% (Affymetrix) and 89.2% (Illumina)
The evaluation is based on imputation of ~600 UK samples that have been whole genome and whole exome sequenced (comparing imputed genotypes and the sequenced based calls) and using a panel of 2,650 sequenced individuals from the T2D-Go Project (Altshuler, Boehnke, McCarthy) as a reference.
Illumina Exome Arrays
Design criteria included a requirement for an assay design score >= 0.50, a primer that didn't overlap a nearby variant with minor allele count >100, a primer that didn't map with 0, 1 or 2 mismatches to other genomic locations. Whenever possible, we favored assays that extended into exons (rather than into introns) so as to maximize the utility of these arrays when applied to RNA samples. Assay design failures appeared to be largely independent of frequency.
In the Illumina platform, 243,094 of the original set of 275,165 coding variants (non-synonymous, stop and splice) passed assay design criteria. We expect that 80-90% of the variants that pass design criteria will ultimately be included in genotyping arrays.
Assay Design Rates
|SNP Set|| Number of
| Number of
|Coding Content||275,165||243,094||An additional set of 8,242 SNPs that were unique to the 1000 Genomes Project and populations under-represented in the design was added.|
|GWAS Tag SNPs||5,763||5,325|
|Grid of Common Variants||5,710||5,286|
|Randomly Selected Synonymous SNPs||5,000||4,651||For 1,000 SNPs, assays were generated on both strands in order to faciliate QC efforts and future development of methods for genotyping of rare variants.|
|AIM - African Ancestry||3,388||3,241|
|AIM - Native American Ancestry||1,000||998|
|MicroRNA Target Sites||285||270|
Sites to be Careful About
Peter Chines, working with Francis Collins, provided a list of 333 exome chip variants sites that should be treated with caution. The sites include variants for which the SNP probe differs from the expected reference genome sequence, could not be mapped back to the reference, mapped to multiple places, or where neither allele matches the reference genome.
Affymetrix Exome Arrays
Coding Variants: Design Criteria
Probe sequences were a priori excluded if there was an adjacent polymorphism within 5bp of the target variant or if the cumulative genome-frequency count of each 16-mer in the probe exceeded 300. The array was wet-lab validated against HapMap 270 and ~1000 Genomes Sample Collections.
|Categories||Candidates|| # wet lab validated
& working on Axiom
| Non-synomynous Coding SNPs
/splice & stop
| Includes 16K additional non-synonymous coding variants from |
the Axiom Genomic Database. .
|AIMs (Eur/African Ancestry)||3,388||3,283|
|AIMs (Native American Ancestry)||1,000||962|
|AIMs (Other)||271||271|| Includes supplemental AIMs from the Latin American Cancer |
Epidemiology (LACE) Consortium.
|Indels||56,095||35,137|| Includes biallelic indels from the draft Phase 1 1000 Genomes Project and previously validated indels in the Axiom Genomic Database; |
indel size ranges from 1-138bp.
|Total Number Target Variants||369,656||318,983|