Exome Chip Design

From Genome Analysis Wiki
Jump to navigationJump to search
Preliminary - Work in Progress

This page describes characteristics of variants for proposed exome chip genotyping arrays. Illumina and Affymetrix are currently implementing arrays based on the design principles described here and the variant lists are available to any others who would like to design similar arrays and commit to making them broadly available to the scientific community.

Key Credit

We thank all the individuals who generously shared sequence data and marker lists prior to publication. Without their dedication to great science, this effort would not have been possible.

Coordination

Coordination of the Exome Chip effort was the responsibility of the following individuals (in alphabetical order):

  • Goncalo Abecasis (University of Michigan)
  • David Altshuler (Broad Institute)
  • Michael Boehnke (University of Michigan)
  • Mark Daly (Massachussets General Hospital)
  • Mark McCarthy (University of Oxford)
  • Debbie Nickerson (University of Washington)
  • Steve Rich (University of Virginia)

Grunt Work

A small group of individuals, led by Mark Daly and Goncalo Abecasis, was responsible for content selection.

Key members of this group included:

  • Benjamin Neale
  • Goo Jun (University of Michigan)
  • Hyun Min Kang (University of Michigan)
  • Shaun Purcell (Mount Sinai School of Medicine / Broad Institute)
  • Josh Smith (University of Washington)

Chip Goals

The goal of this array is to enable an intermediate experiment between current genotyping arrays, which focus on relatively common variants, and exome sequencing of very large numbers of samples, which will enable examination of coding variants, down to singletons. The array aims to include coding variants seen several times in existing sequence datasets. Towards this end, we have assembled information on ~12,000 sequenced genomes and exomes and catalogued, for each variant that potentially affects protein structure, the total number of times it was seen and the total number of datasets that included the variant. Our working definition of a variant that has been seen "several times" focuses on non-synonymous variants seen at least 3+ times across at 2+ datasets. A more lenient criterium was used for splice and non-sense variants.

Variants that Alter Protein Function

In the genome of an average individual (as represented by the exome sequenced individuals contributed for chip design), we expect to see ~8,000 - 10,000 nonsynonymous variants, ~200 - 300 splice variants and ~80 - 100 stop altering variants.

Non-synonymous Variants

We tallied 1,107,05 nonsynonymous variants seen at least once across ~12,000 sequenced samples. Among the non-synonymous variants, the majority were seen only once (646,888) or twice (163,044), as expected. Of the remaining variants (297,119), a total of 260,054 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was 2.0 in the full set of variants and 2.49 in the set of variants that were seen at least three times and in two or more studies.

The set of variants selected for array design is estimated to include 97-98% of the nonsynonymous variants detected in average genome through exome sequencing.

Splice Variants

We tallied 44,529 splice variants seen at least once across ~12,000 sequenced samples. Among these splice variants, the majority were seen only once (27,265). Of the remaining variants (17,264), a total of 12,662 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was 2.13 for all variants and 2.94 for variants that met criteria for inclusion in these arrays (being seen 2 or more times and in 2 or more studies).

We estimate the candidate list of variants includes 94-95% of the splice altering variants detected in an average genome through exome sequencing.

Stop Altering Variants

We tallied 31,003 stop altering variants (stop gains or losses) seen at least once across ~12,000 sequenced samples. Among the non-synonymous variants, the majority were seen only once (20,637). Of the remaining variants (10,366), a total of 7,137 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was 1.68 for all variants and 2.29 for variants that met criteria for inclusion in these arrays (being seen 2 or more times and in 2 or more studies).

We estimate the candidate list of variants includes 94-95% of the stop altering variants detected in an average genome through exome sequencing.

List of Candidate Variants

The full list of candidate variants is available by anonymous ftp. Each variant is annotated with the number of times it was seen, the number of studies in which it was seen, and its impact on protein sequence.

Additional Content

In addition to the core content of protein altering variants, described above, we aggregated information on several classes of interesting variation that will help in the interpretation of the results of exome sequencing experiments.

Tags for Previously Described GWAS Hits

We collected all SNPs reported as associated in the NHGRI list, as of August 16, 2011. This list was augmented with unpublished hits from consortia working on diabetes, blood lipids, blood pressure, lung function, myocardical infraction, antropometric traits, psychiatric traits, Crohn's disease and age related macular degeneration resulting in a total of 5,542 GWAS hit SNPs. We were inclusive in our SNP selection criteria, including all SNPs in the NHGRI list whether or not they reached the standard p<5x10-8 criteria for genomewide significance.

Ancestry Informative Markers

African American vs European Ancestry

We selected a grid of 3,388 markers (distributed approximately one per megabase, across the autosomes and the X chromosome) that showed strong differentiation between African- and European-ancestry samples sequenced by the 1000 Genomes Project. Markers previously genotyped on the Illumina Omni 2.5M array were favored and markers with A/T or G/C alleles were avoided.

Native American vs European Ancestry

A grid of 1,000 markers selected to be informative for Native American vs. European ancestry. These AIMs were selected to be in low linkage disequilibrium of one another (defined as R2 <= 0.1 in Native American populations, to be conservative) and widely separated (by requiring that they should be at least 250 kbases from other European vs Native American ancestry AIMs). SNPs with significant within continent heterogeneity were excluded.

These markers were previously genotyped in three samples of European ancestry (consisting of CEU and TSI samples and a population of Spaniards) and six samples of Native Americans ( Mayan, Nahuan, Zapoteca, Tepehuano, Quechuan and Aymaran).

Scaffold for Identity by Descent

We selected a grid of 5,710 markers (distributed approximately one per 500 kb across the autosomes and the X chromosome) that showed little differentiation between African-, European- and Asian-ancestry sequenced by the 1000 Genomes Project and allele frequency close to 0.50. Markers previously genotyped on the Illumina Omni 2.5M array were favored and markers with A/T or G/C alleles were avoided.

These markers can be used to identify stretches of identity by descent between apparently unrelated individuals or to support linkage-based analyses in appropriate family samples.

Functionally interesting variants

A small set of SNPs of high interest to groups participating in the NHLBI Exome Sequencing Project was included.

Random set of synonymous variants (as comparator)

A set of 5,000 synonymous variants was sampled at random. It is anticipated that these markers might useful as a comparator (for example, for genomic control based analyses) when interpreting the results of assays for coding variants.

Fingerprint SNPs

A set of 274 SNPs currently used as fingerprint SNPs at the University of Washington and Broad Institute were included. These SNPs are shared among several major genotyping platforms and facilitate sample tracking.

Mitochondrial SNPs

A set of 246 coding variants in the mitochondria, drawn from the 1000 Genomes Project.

Chromosome Y SNPs

A set of 188 Y chromosome SNPs contributed by Jim WIlson at Sanger, based on analyses of 1000 Genomes data and European haplogroups.

HLA tag SNPs

A set of 2,536 HLA tag SNPs selected by Paul de Bakker.

Paul De Bakker's HLA tag SNPs are listed here: http://www.broadinstitute.org/~debakker/hla_tags_exome.txt

Illumina Exome Arrays

Coding Variants

Design criteria included a requirement for an assay design score >= 0.50, a primer that didn't overlap a nearby variant with minor allele count >100, a primer that didn't map with 0, 1 or 2 mismatches to other genomic locations. Assay design failures appeared to be largely independent of frequency.

In the Illumina platform, 243,094 of the original set of 275,165 coding variants (non-synonymous, stop and splice) passed assay design criteria. We expect that 80-90% of the variants that pass design criteria will ultimately be included in genotyping arrays.

Assay Design Rates

Illumina Assay Design Summary
SNP Set Number of
  Candidates  
Number of
  Successful Designs  
Additional Notes
Coding Content 275,165 243,094   An additional set of 8,242 SNPs that were unique to the 1000 Genomes Project and populations under-represented in the design was added.
GWAS Tag SNPs 5,763 5,325
Grid of Common Variants 5,710 5,286
Randomly Selected Synonymous SNPs 5,000 4,651   For 1,000 SNPs, assays were generated on both strands in order to faciliate QC efforts and future development of methods for genotyping of rare variants.
AIM - African Ancestry 3,388 3,241
AIM - Native American Ancestry 1,000 998
HLA Tags 2,536 2,459
ESP Requests 1,003 843
Fingerprint SNPs 285 259
MicroRNA Target Sites 285 270
Mitochondrial Variants 246 246
Chromosome Y 188 128
Indels 181 181

Affymetrix Exome Arrays

Information on assay design is not available at this point.