Exome Chip Design

From Genome Analysis Wiki
Jump to navigationJump to search

This page describes characteristics of the Illumina Exome Chip. The plan is that this page will be publicly visible.

Key Credit

We thank all the individuals who generously shared sequence data and marker lists prior to publication. Without their dedication to great science, this effort would not have been possible.

Coordination

Coordination of the Exome Chip effort was the responsibility of the following individuals (in alphabetical order):

  • Goncalo Abecasis (University of Michigan)
  • David Altshuler (Broad Institute)
  • Michael Boehnke (University of Michigan)
  • Mark Daly (Massachussets General Hospital)
  • Rick Lifton (Yale University)
  • Mark McCarthy (University of Oxford)
  • Debbie Nickerson (University of Washington)
  • Steve Rich (University of Virginia)

Grunt Work

A small group of individuals, led by Mark Daly and Goncalo Abecasis, was responsible for content selection.

Key members of this group included:

  • Benjamin Neale
  • Goo Jun (University of Michigan)
  • Hyun Min Kang (University of Michigan)
  • Shaun Purcell (Mount Sinai School of Medicine / Broad Institute)
  • Josh Smith (University of Washington)

Chip Goals

The goal of this array is to enable an intermediate experiment between current genotyping arrays, which focus on relatively common variants, and exome sequencing of very large numbers of samples, which will enable examination of coding variants, down to singletons. The array aims to include coding variants seen several times in existing sequence datasets. Towards this end, we have assembled information on ~12,000 sequenced genomes and exomes and catalogued, for each variant that potentially affects protein structure, the total number of times it was seen and the total number of datasets that included the variant. Our working definition of a variant that has been seen "several times" focuses on non-synonymous variants seen at least 3+ times across at 2+ datasets. A more lenient criterium was used for splice and non-sense variants.

Variants that Alter Protein Function

In the genome of an average individual (as represented by the exome sequenced individuals contributed for chip design), we expect to see XXX nonsynonymous variants, XXX splice variants and XXX stop altering variants.

Non-synonymous Variants

We tallied 1,081,805 nonsynonymous and 613,106 variants seen at least once across ~12,000 sequenced samples. Among the non-synonymous variants, the majority were seen only once (634,787) or twice (157,261), as expected. Of the remaining variants (289,757), a total of 254,035 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was XXX for variants in dbSNP 132 and XXX for non dbSNP variants.

These variants include %%% of the XXX nonsynonymous variants detected in average genome through exome sequencing.

In the Illumina platform, 232,125 of these variants were assigned an assay design score >0.50 using a probe that didn't overlap nearby SNPs seen at least 100 times. These are considered candidates for manufacturing. These variants include %%% of the XXX nonsynonymous variants detected in an average genome through exome sequencing. The actual number that can be detected by genotyping arrays will depend on the fraction of successfully manufactured probes, expected to be 85-90%.

Splice Variants

We tallied 43,702 splice variants seen at least once across ~12,000 sequenced samples. Among these splice variants, the majority were seen only once (26,847). Of the remaining variants (16,855), a total of 12,459 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was XXX for variants in dbSNP 132 and XXX for non dbSNP variants.

These variants include %%% of the XXX stop altering variants detected in average genome through exome sequencing.

Stop Altering Variants

We tallied 30,508 stop altering variants (stop gains or losses) seen at least once across ~12,000 sequenced samples. Among the non-synonymous variants, the majority were seen only once (20,391). Of the remaining variants (10,117), a total of 7,029 were seen in at least 2 datasets and are considered as candidates for inclusion in exome SNP arrays. The transition transversion ratio of this class of variants was XXX for variants in dbSNP 132 and XXX for non dbSNP variants.

These variants include %%% of the XXX stop altering variants detected in average genome through exome sequencing.

Additional Content

In addition to the core content of protein altering variants, described above, exome arrays can include several additional classes of interesting variation.

Tags for Previously Described GWAS Hits

We collected all SNPs reported as associated in the NHGRI list, as of August 16, 2011. This list was augmented with unpublished hits from consortia working on diabetes, blood lipids, blood pressure, antropometric traits, psychiatric traits, Crohn's disease and age related macular degeneration.

NOTE: Also have lists for myocardial infraction and lung function. Would disclosing that these lists are included among a large set of thousands of tag SNPs seem reasonable?? What to do?

NOTE2: I originally planned to filter this on p-value, requiring at least 5x10-8. However, it seems that among the 5000+ hits on the NHGRI list, more than half don't meet this p-value threshold and also that some of the SNPs that don't meet the threshold are quite interesting (in a random inspection). Unless someone wants to debug how the NHGRI p-values were tabulated (combined after replication versus discovery sample, for example), I propose we just aim to tag everything.

Ancestry Informative Markers

We selected a grid of 3,380 markers (distributed approximately one per megabase, across the autosomes and the X chromosome) that showed strong differentiation between African- and European-ancestry samples sequenced by the 1000 Genomes Project. Markers previously genotyped on the Illumina Omni 2.5M array were favored and markers with A/T or G/C alleles were avoided.

Native American vs European Ancestry

A grid of 1,000 markers selected to be informative for Native American vs. European ancestry. These AIMs were selected to be in low linkage disequilibrium of one another (defined as R2 <= 0.1 in Native American populations, to be conservative) and widely separated (by requiring that they should be at least 250 kbases from other European vs Native American ancestry AIMs. SNPs with significant within continent heterogeneity were excluded.

The ancestral populations include CEU, TSI, and a population of Spaniards for European ancestry, and six populations of Native Americans: Mayan, Nahuan, Zapoteca, Tepehuano, Quechuan and Aymaran.

Among these SNPs, 998 could be designed into Illumina assays.

Scaffold for Identity by Descent

We selected a grid of 5,805 markers (distributed approximately one per 500 kb across the autosomes and the X chromosome) that showed little differentiation between African-, European- and Asian-ancestry sequenced by the 1000 Genomes Project and allele frequency close to 0.50. Markers previously genotyped on the Illumina Omni 2.5M array were favored and markers with A/T or G/C alleles were avoided.

These markers can be used to identify stretches of identity by descent between apparently unrelated individuals or to support linkage-based analyses in appropriate family samples.

Functionally interesting variants

Defined by the ESP groups for specific interest

Random set of synonymous variants (as comparator)

A set of 5,000 synonymous variants was sampled at random.

Fingerprint SNPs

A set of 274 SNPs currently used as fingerprint SNPs at the University of Washington and Broad Institute were included. These SNPs are shared among several major genotyping platforms and facilitate sample tracking.

Among these 274 SNPs, assays could be designed for 259 SNPs.

Mitochondrial SNPs

A set of XXX coding variants in the mitochondrial, drawn from the 1000 Genomes Project.

Chromosome Y SNPs

A set of 180 Y chromosome SNPs contributed by *** at Sanger, based on analyses of 1000 Genomes data and European haplogroups.

HLA tag SNPs

A set of 2,536 HLA tag SNPs selected by Paul de Bakker.

Among these, 2,459 could be designed into Illumina Assays.

Paul De Bakker's HLA tag SNPs are listed here: http://www.broadinstitute.org/~debakker/hla_tags_exome.txt