Exome Chip Design

From Genome Analysis Wiki
Jump to navigationJump to search
Preliminary - Work in Progress

This page describes characteristics of variants for proposed exome chip genotyping arrays. Illumina and Affymetrix are currently implementing arrays based on the design principles described here and the variant lists are available to any others who would like to design similar arrays and commit to making them broadly available to the scientific community.

Key Credit

We thank all the individuals who generously shared sequence data and marker lists prior to publication. Without their dedication to great science, this effort would not have been possible.

Coordination

Coordination of the Exome Chip effort was the responsibility of the following individuals (in alphabetical order):

  • Goncalo Abecasis (University of Michigan)
  • David Altshuler (Broad Institute)
  • Michael Boehnke (University of Michigan)
  • Mark Daly (Massachussets General Hospital)
  • Mark McCarthy (University of Oxford)
  • Debbie Nickerson (University of Washington)
  • Steve Rich (University of Virginia)

Grunt Work

A small group of individuals, led by Mark Daly and Goncalo Abecasis, was responsible for content selection.

Key members of this group included:

  • Benjamin Neale
  • Goo Jun (University of Michigan)
  • Hyun Min Kang (University of Michigan)
  • Shaun Purcell (Mount Sinai School of Medicine / Broad Institute)
  • Josh Smith (University of Washington)

Chip Goals

The goal of this array is to enable an intermediate experiment between current genotyping arrays, which focus on relatively common variants, and exome sequencing of very large numbers of samples, which will enable examination of coding variants, down to singletons. The array aims to include coding variants seen several times in existing sequence datasets. Towards this end, we have assembled information on ~12,000 sequenced genomes and exomes and catalogued, for each variant that potentially affects protein structure, the total number of times it was seen and the total number of datasets that included the variant. Our working definition of a variant that has been seen "several times" focuses on non-synonymous variants seen at least 3+ times across at 2+ datasets. A more lenient criterium was used for splice and non-sense variants.

Variants that Alter Protein Function

In the genome of an average individual (as represented by the exome sequenced individuals contributed for chip design), we expect to see ~8,000 - 10,000 nonsynonymous variants, ~200 - 300 splice variants and ~80 - 100 stop altering variants.

Non-synonymous Variants

We tallied 1,107,05 nonsynonymous variants seen at least once across ~12,000 sequenced samples. Among the non-synonymous variants, the majority were seen only once (646,888) or twice (163,044), as expected. Of the remaining variants (297,119), a total of 260,054 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was 2.0 in the full set of variants and 2.49 in the set of variants that were seen at least three times and in two or more studies.

The set of variants selected for array design is estimated to include 97-98% of the nonsynonymous variants detected in average genome through exome sequencing.

Splice Variants

We tallied 44,529 splice variants seen at least once across ~12,000 sequenced samples. Among these splice variants, the majority were seen only once (27,265). Of the remaining variants (17,264), a total of 12,662 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was 2.13 for all variants and 2.94 for variants that met criteria for inclusion in these arrays (being seen 2 or more times and in 2 or more studies).

We estimate the candidate list of variants includes 94-95% of the stop altering variants detected in an average genome through exome sequencing.

Stop Altering Variants

We tallied 30,508 stop altering variants (stop gains or losses) seen at least once across ~12,000 sequenced samples. Among the non-synonymous variants, the majority were seen only once (20,391). Of the remaining variants (10,117), a total of 7,029 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was XXX for variants in dbSNP 132 and XXX for non dbSNP variants.

These variants include %%% of the XXX stop altering variants detected in average genome through exome sequencing.

Additional Content

In addition to the core content of protein altering variants, described above, exome arrays can include several additional classes of interesting variation.

Tags for Previously Described GWAS Hits

We collected all SNPs reported as associated in the NHGRI list, as of August 16, 2011. This list was augmented with unpublished hits from consortia working on diabetes, blood lipids, blood pressure, lung function, myocardical infraction, antropometric traits, psychiatric traits, Crohn's disease and age related macular degeneration resulting in a total of 5,542 GWAS hit SNPs.

Among the GWAS hit SNPs, 5,325 could be designed into Illumina assays.

NOTE: Also have lists for myocardial infraction and lung function. Would disclosing that these lists are included among a large set of thousands of tag SNPs seem reasonable?? What to do?

NOTE2: I originally planned to filter this on p-value, requiring at least 5x10-8. However, it seems that among the 5000+ hits on the NHGRI list, more than half don't meet this p-value threshold and also that some of the SNPs that don't meet the threshold are quite interesting (in a random inspection). Unless someone wants to debug how the NHGRI p-values were tabulated (combined after replication versus discovery sample, for example), I propose we just aim to tag everything.

Ancestry Informative Markers

African American vs European Ancestry

We selected a grid of 3,380 markers (distributed approximately one per megabase, across the autosomes and the X chromosome) that showed strong differentiation between African- and European-ancestry samples sequenced by the 1000 Genomes Project. Markers previously genotyped on the Illumina Omni 2.5M array were favored and markers with A/T or G/C alleles were avoided.

Among these markers, 2,470 resulted in successful assays.

Native American vs European Ancestry

A grid of 1,000 markers selected to be informative for Native American vs. European ancestry. These AIMs were selected to be in low linkage disequilibrium of one another (defined as R2 <= 0.1 in Native American populations, to be conservative) and widely separated (by requiring that they should be at least 250 kbases from other European vs Native American ancestry AIMs. SNPs with significant within continent heterogeneity were excluded.

The ancestral populations include CEU, TSI, and a population of Spaniards for European ancestry, and six populations of Native Americans: Mayan, Nahuan, Zapoteca, Tepehuano, Quechuan and Aymaran.

Among these SNPs, 998 could be designed into Illumina assays.

Scaffold for Identity by Descent

We selected a grid of 5,805 markers (distributed approximately one per 500 kb across the autosomes and the X chromosome) that showed little differentiation between African-, European- and Asian-ancestry sequenced by the 1000 Genomes Project and allele frequency close to 0.50. Markers previously genotyped on the Illumina Omni 2.5M array were favored and markers with A/T or G/C alleles were avoided.

These markers can be used to identify stretches of identity by descent between apparently unrelated individuals or to support linkage-based analyses in appropriate family samples.

In this category, 3,369 markers passed assay design.

Functionally interesting variants

Defined by the ESP groups for specific interest

Random set of synonymous variants (as comparator)

A set of 5,000 synonymous variants was sampled at random.

Among these, 4,355 passed Illumina assay design.

Fingerprint SNPs

A set of 274 SNPs currently used as fingerprint SNPs at the University of Washington and Broad Institute were included. These SNPs are shared among several major genotyping platforms and facilitate sample tracking.

Among these 274 SNPs, assays could be designed for 259 SNPs.

Mitochondrial SNPs

A set of XXX coding variants in the mitochondrial, drawn from the 1000 Genomes Project.

Chromosome Y SNPs

A set of 180 Y chromosome SNPs contributed by *** at Sanger, based on analyses of 1000 Genomes data and European haplogroups.

HLA tag SNPs

A set of 2,536 HLA tag SNPs selected by Paul de Bakker.

Among these, 2,459 could be designed into Illumina Assays.

Paul De Bakker's HLA tag SNPs are listed here: http://www.broadinstitute.org/~debakker/hla_tags_exome.txt

Illumina Exome Arrays

Coding Variants

Design criteria included a requirement for an assay design score >= 0.50, a primer that didn't overlap a nearby variant with minor allele count >100, a primer that didn't map with 0, 1 or 2 mismatches to other genomic locations. Assay design failures appeared to be largely independent of frequency.

In the Illumina platform, 243,094 of the original set of 275,165 coding variants (non-synonymous, stop and splice) passed assay design criteria. We expect that 80-90% of the variants that pass design criteria will ultimately be included in genotyping arrays.