Exome Chip Design

From Genome Analysis Wiki
Jump to: navigation, search
Preliminary - Work in Progress

This page describes characteristics of variants for proposed exome chip genotyping arrays. Illumina and Affymetrix are currently implementing arrays based on the design principles described here and the variant lists are available to any others who would like to design similar arrays and commit to making them broadly available to the scientific community.

Contents

Online Content

Key Credit

We thank all the individuals who generously shared sequence data and marker lists prior to publication. Without their dedication to great science, this effort would not have been possible.

Coordination

Coordination of the Exome Chip effort was the responsibility of the following individuals (in alphabetical order):

  • Goncalo Abecasis (University of Michigan)
  • David Altshuler (Broad Institute)
  • Michael Boehnke (University of Michigan)
  • Mark Daly (Massachussets General Hospital)
  • Mark McCarthy (University of Oxford)
  • Debbie Nickerson (University of Washington)
  • Steve Rich (University of Virginia)

Grunt Work

A small group of individuals, led by Mark Daly and Goncalo Abecasis, was responsible for content selection.

Key members of this group included:

  • Benjamin Neale (Massachusetts General Hospital)
  • Kyle Gaulton (University of Oxford)
  • Goo Jun (University of Michigan)
  • Hyun Min Kang (University of Michigan)
  • Shaun Purcell (Mount Sinai School of Medicine / Broad Institute)
  • Manuel Rivas (University of Oxford)
  • Josh Smith (University of Washington)

Content Contributors

Contributor Enrichment Major Ethnicity N     Contact
NHLBI Exome Sequencing Project (5 tranches)  Cardiovascular Traits, Lung Traits, Obesity European, African American  4260 D. Nickerson, D. Altshuler and S. Rich
Autism (2 tranches) Autism European 1778 M. Daly and R. Gibbs
GO T2D (2 tranches) Type 2 diabetes European 1618 D. Altshuler, M. Boehnke, M. McCarthy
1000 Genomes Project (2 tranches) Random Sample Diverse 1128 H. M. Kang
Sweden Schizophrenia Study Schizophrenia European 525 S. Purcell and P. Sklar
SardiNIA Random Sample European 508 F. Cucca and G. Abecasis
Sanger / CoLaus Overweight, Diabetes, Fasting Glucose European 456 I. Barroso
Cancer Genome Atlas Cancer European 422 S. Gabriel
T2D Genes Type 2 diabetes Hispanic 362 J. Blangero and G. Abecasis
Cancer Cohort Study Cancer Chinese 327 W. Zheng
Pfizer – MGH – Broad Type 2 diabetes extremes of risk European 182 D. Altshuler
Lipid Extremes Lipid Extremes European 131 S. Kathiresan and D. Rader
Int’l HIV Controllers Study HIV Controllers European 121 P. De Bakker
SAEC DILI (merged w/Autism tranches) Augmentin DILI European 117 M. Daly and A. Holden
I2B2 - Major Depression Major Depression, Major Depressive Disorder European 50 R. Perlis and J. Smoller
BMI Extremes BMI Extremes European 46 J. Hirschhorn

Chip Goals

The goal of this array is to enable an intermediate experiment between current genotyping arrays, which focus on relatively common variants, and exome sequencing of very large numbers of samples, which will enable examination of coding variants, down to singletons. The array aims to include coding variants seen several times in existing sequence datasets. Towards this end, we have assembled information on ~12,000 sequenced genomes and exomes and catalogued, for each variant that potentially affects protein structure, the total number of times it was seen and the total number of datasets that included the variant. Our working definition of a variant that has been seen "several times" focuses on non-synonymous variants seen at least 3+ times across at 2+ datasets. A more lenient criterium was used for splice and non-sense variants.

Variants that Alter Protein Function

In the genome of an average individual (as represented by the exome sequenced individuals contributed for chip design), we expect to see ~8,000 - 10,000 nonsynonymous variants, ~200 - 300 splice variants and ~80 - 100 stop altering variants.

Non-synonymous Variants

We tallied 1,107,051 nonsynonymous variants seen at least once across ~12,000 sequenced samples. Among the non-synonymous variants, the majority were seen only once (646,888) or twice (163,044), as expected. Of the remaining variants (297,119), a total of 260,054 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was 2.0 in the full set of variants and 2.49 in the set of variants that were seen at least three times and in two or more studies.

The set of variants selected for array design is estimated to include 97-98% of the nonsynonymous variants detected in average genome through exome sequencing.

Splice Variants

We tallied 44,529 splice variants seen at least once across ~12,000 sequenced samples. Among these splice variants, the majority were seen only once (27,265). Of the remaining variants (17,264), a total of 12,662 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was 2.13 for all variants and 2.94 for variants that met criteria for inclusion in these arrays (being seen 2 or more times and in 2 or more studies).

We estimate the candidate list of variants includes 94-95% of the splice altering variants detected in an average genome through exome sequencing.

Stop Altering Variants

We tallied 31,003 stop altering variants (stop gains or losses) seen at least once across ~12,000 sequenced samples. Among the non-synonymous variants, the majority were seen only once (20,637). Of the remaining variants (10,366), a total of 7,137 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was 1.68 for all variants and 2.29 for variants that met criteria for inclusion in these arrays (being seen 2 or more times and in 2 or more studies).

We estimate the candidate list of variants includes 94-95% of the stop altering variants detected in an average genome through exome sequencing.

List of Candidate Variants

The full list of candidate variants is available by anonymous ftp. Each variant is annotated with the number of times it was seen, the number of studies in which it was seen, and its impact on protein sequence.

Additional Content

In addition to the core content of protein altering variants, described above, we aggregated information on several classes of interesting variation that will help in the interpretation of the results of exome sequencing experiments.

Tags for Previously Described GWAS Hits

We collected all SNPs reported as associated in the NHGRI list, as of August 16, 2011. This list was augmented with unpublished hits from consortia working on diabetes, blood lipids, blood pressure, lung function, myocardical infraction, antropometric traits, psychiatric traits, Crohn's disease and age related macular degeneration resulting in a total of 5,542 GWAS hit SNPs. We were inclusive in our SNP selection criteria, including all SNPs in the NHGRI list whether or not they reached the standard p<5x10-8 criteria for genomewide significance.

Ancestry Informative Markers

African American vs European Ancestry

We selected a grid of 3,388 markers (distributed approximately one per megabase, across the autosomes and the X chromosome) that showed strong differentiation between African- and European-ancestry samples sequenced by the 1000 Genomes Project. Markers previously genotyped on the Illumina Omni 2.5M array were favored and markers with A/T or G/C alleles were avoided.

Native American vs European Ancestry

A grid of 1,000 markers selected to be informative for Native American vs. European ancestry. These AIMs were selected to be in low linkage disequilibrium of one another (defined as R2 <= 0.1 in Native American populations, to be conservative) and widely separated (by requiring that they should be at least 250 kbases from other European vs Native American ancestry AIMs). SNPs with significant within continent heterogeneity were excluded.

These markers were previously genotyped in three samples of European ancestry (consisting of CEU and TSI samples and a population of Spaniards) and six samples of Native Americans ( Mayan, Nahuan, Zapoteca, Tepehuano, Quechuan and Aymaran).

Scaffold for Identity by Descent

We selected a grid of 5,710 markers (distributed approximately one per 500 kb across the autosomes and the X chromosome) that showed little differentiation between African-, European- and Asian-ancestry sequenced by the 1000 Genomes Project and allele frequency close to 0.50. Markers previously genotyped on the Illumina Omni 2.5M array were favored and markers with A/T or G/C alleles were avoided.

These markers can be used to identify stretches of identity by descent between apparently unrelated individuals or to support linkage-based analyses in appropriate family samples.

Functionally interesting variants

A small set of SNPs of high interest to groups participating in the NHLBI Exome Sequencing Project was included.

Random set of synonymous variants (as comparator)

A set of 5,000 synonymous variants was sampled at random. It is anticipated that these markers might useful as a comparator (for example, for genomic control based analyses) when interpreting the results of assays for coding variants.

Fingerprint SNPs

A set of 274 SNPs currently used as fingerprint SNPs at the University of Washington and Broad Institute were included. These SNPs are shared among several major genotyping platforms and facilitate sample tracking.

Mitochondrial SNPs

A set of 246 coding variants in the mitochondria, drawn from the 1000 Genomes Project.

Chromosome Y SNPs

A set of 188 Y chromosome SNPs contributed by Jim WIlson at Sanger, based on analyses of 1000 Genomes data and European haplogroups.

HLA tag SNPs

A set of 2,536 HLA tag SNPs selected by Paul de Bakker.

Paul De Bakker's HLA tag SNPs are listed here: http://www.broadinstitute.org/~debakker/hla_tags_exome.txt

Second Generation Arrays

A second generation of exome arrays will be available in 2013, from both Illumina and Affymetrix. In addition to the original exome array content, each of these includes a grid of SNPs across the genome that facilitates analysis of common variants when a suitably large reference panel is available. These grids were selected both to ensure good coverage of the genome and to ensure that assays could be manufactured inexpensively, taking into account proprietary platform specific constraints.

To evaluate the accuracy of these grids, we have used data from the Go T2D project (led by David Altshuler, Mike Boehnke and Mark McCarthy). The dataset includes ~2,650 individuals that have been whole genome sequenced (depth ~4x) and whole exome sequenced (depth ~80-100x). We focused on chr20 and 600 samples from the UK and, for each of these in turn, tried to impute missing genotypes using the remaining sequenced individuals as a reference panel. The results show both the Affymetrix and Illumina arrays are expected to provide excellent coverage of the genome provided that a suitably large reference panel is available.

The specific numbers are that:

  • For variants with MAF >1%, the average r2 correlation between imputed and true genotypes will be 0.8879 (Affymetrix) and 0.8590 (Illumina).
  • For variants with MAF >5%, the average r2 correlation between imputed and true genotypes will be 0.9458 (Affymetrix) and 0.9256 (Illumina).

The fraction of variants imputed with r2 > 0.80 will be:

  • For variants with MAF >1%, 82.1% (Affymetrix) and 76.5% (Illumina)
  • For variants with MAF >5%, 94.6% (Affymetrix) and 89.2% (Illumina)

The evaluation is based on imputation of ~600 UK samples that have been whole genome and whole exome sequenced (comparing imputed genotypes and the sequenced based calls) and using a panel of 2,650 sequenced individuals from the T2D-Go Project (Altshuler, Boehnke, McCarthy) as a reference.

All evaluations used Minimac and were carried out by Christian Fuchsberger.

Illumina Exome Arrays

Coding Variants

Design criteria included a requirement for an assay design score >= 0.50, a primer that didn't overlap a nearby variant with minor allele count >100, a primer that didn't map with 0, 1 or 2 mismatches to other genomic locations. Whenever possible, we favored assays that extended into exons (rather than into introns) so as to maximize the utility of these arrays when applied to RNA samples. Assay design failures appeared to be largely independent of frequency.

In the Illumina platform, 243,094 of the original set of 275,165 coding variants (non-synonymous, stop and splice) passed assay design criteria. We expect that 80-90% of the variants that pass design criteria will ultimately be included in genotyping arrays.

Assay Design Rates

Illumina Assay Design Summary
SNP Set Number of
  Candidates  
Number of
  Successful Designs  
Additional Notes
Coding Content 275,165 243,094   An additional set of 8,242 SNPs that were unique to the 1000 Genomes Project and populations under-represented in the design was added.
GWAS Tag SNPs 5,763 5,325
Grid of Common Variants 5,710 5,286
Randomly Selected Synonymous SNPs 5,000 4,651   For 1,000 SNPs, assays were generated on both strands in order to faciliate QC efforts and future development of methods for genotyping of rare variants.
AIM - African Ancestry 3,388 3,241
AIM - Native American Ancestry 1,000 998
HLA Tags 2,536 2,459
ESP Requests 1,003 843
Fingerprint SNPs 285 259
MicroRNA Target Sites 285 270
Mitochondrial Variants 246 246
Chromosome Y 188 128
Indels 181 181

 


Sites to be Careful About

Peter Chines, working with Francis Collins, provided a list of 333 exome chip variants sites that should be treated with caution. The sites include variants for which the SNP probe differs from the expected reference genome sequence, could not be mapped back to the reference, mapped to multiple places, or where neither allele matches the reference genome.

A plain text list of these sites list and corresponding descriptions are available.

Affymetrix Exome Arrays

Coding Variants: Design Criteria

Probe sequences were a priori excluded if there was an adjacent polymorphism within 5bp of the target variant or if the cumulative genome-frequency count of each 16-mer in the probe exceeded 300. The array was wet-lab validated against HapMap 270 and ~1000 Genomes Sample Collections.

Affymetrix Assay Design Summary
Categories Candidates   # wet lab validated
  & working on Axiom
Comments
Non-synomynous Coding SNPs
 /splice & stop
259,976
 /19,672
247,546
 /17,066
 Includes  16K additional non-synonymous coding variants from
  the Axiom Genomic Database. .
GWAS 5,542 5,053
Grid 5,719 5,478
Synonymous cSNPs 5,000 4,367
AIMs (Eur/African Ancestry) 3,388 3,283
AIMs (Native American Ancestry) 1,000 962
AIMs (Other) 271 271  Includes  supplemental AIMs from the Latin American Cancer
  Epidemiology (LACE) Consortium. 
HLA 2,536 2,262
ESP 1,003 952
Fingerprint 285 268
miRNA 285 250
Mitochondrial DNA 246 207
Chromosome Y 232 161
Indels 56,095 35,137   Includes biallelic indels from the draft Phase 1 1000 Genomes Project and previously validated indels in the Axiom Genomic Database;
  indel size ranges from 1-138bp. 
Total Number Target Variants 369,656 318,983