SardiNIA

From Genome Analysis Wiki
Jump to navigationJump to search

Project Leaders

  • David Schlessinger (National Institutes on Aging, Baltimore)
  • Manuela Uda (National Research Center, Cagliari, Italy)
  • Goncalo Abecasis (University of Michigan, Ann Arbor)

Genomewide Association Study

We carried out an initial genomewide association by genotyping 1405 samples with Affymetrix 500K SNP arrays. Then, because SardiNIA samples are closely related to each other, we were able to use these genotypes to impute the genomes of many close relatives who were genotyped with Affymetrix 10K SNP arrays. Typically, our SardiNIA GWAS analysis thus include approximately 4300 genotyped or imputed individuals. To learn more about the approach see Chen and Abecasis (2007) and Scuteri et al (2007).

Planned Updates

Several ongoing experiments are expected to gradually improve our GWAS data. First, we expect to integrate Affymetrix 1M SNP chip genotypes into our analyses. Second, we plan to genotype all sampled individuals with the Metabochip, which includes 200,000 SNPs. This will enable us to fill in genotypes for relatives of those genotyped with denser arrays more accurately, because we will more precisely identify shared stretches of chromosome.

Medical Sequencing Project

We are sequencing the genomes of 1,000 individuals to learn about the genetics of blood lipid levels and personality.

Status as of June 2016

Locations of Files for Current Data Freeze of 3839 Samples

NOTICE: We identified late in the process that two of the samples (22855 and 22385) were actually the same individual. They both should be the same individual 22855. Therefore, there are 3840 sample IDs in each of the files below, but only 22855 should move on to later processes. In future data freezes with this data, these two sequencing sets should be merged into a single 22855 individual.

  • List of Sample Numbers
  • List of paths to BAMs used in this data freeze (Index file)
    • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPCALL/FINAL_index_20150504.index
  • Pedigree (Not too helpful -- used for SNPCall)
    • All Samples: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPCALL/FINAL_ped_20150510.ped
  • QC Summary
  • SNPCall Results (Produced with Gotcloud SNPCall and phased using Beagle4)
    • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPCALL/beagle4/chr*/chr*.filtered.PASS.beagled.vcf.gz
  • IndelCall Results (Produced with Gotcloud Indel)
    • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/indel/final/all.genotypes.sites.vcf.gz
    • SNPEff and VEP declarations of Indel types can be found:
      • SNPEff: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/snpEff/*
      • VEP: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/VEP/*
    • We used an Indel filtering strategy composed of many levels.
      1. AC must be 1 or greater -- eliminate Indels with AC=0
      2. the Indel should overlap with a VNTR region or overlap with another Indel. We used Adrian's annotate indels program to identify such overlaps. Results for Indels on all chromosomes can be found here:
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/AnnotateIndels/All.annotated.sites.vcf.gz
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/AnnotateIndels/Overlaps.txt
      3. at least 50% of the Indels should have informative AD field (we define "informative" to mean that the sample has U/(R+A+U)<0.50). Results for Indels on all chromosomes can be found here:
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/filter2_output.txt
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/INDEL_filtering_final.txt
      4. at least 50% of the Indels should have informative PL field (we define "informative" to mean that the PL field for the sample is anything BUT ././. or 0/0/0). Results for Indels on all chromosomes can be found here:
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/filter1_output.txt
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/INDEL_filtering_final.txt
      5. the Indel needs to have BF_LRE_LUD (a Bayes factor comparing a. related & HWE to b. unrelated & HWD) > -10. Higher BF_LRE_LUD should indicate a better Indel. We used Hyun's MiLK program to obtain BF_LRE_LUD values. Results for Indels on all chromosomes can be found here:
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/MiLK_filtering/all.genotypes.milk.sites.vcf
      • Overall results from all of the below filtering can be found here:
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/FINAL/FinalIndelFilteringStatistics.txt
      • VCFs of Indels after filtering can be found here:
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/FINAL/all.genotypes.*.PASS.vcf.gz
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/FINAL/all.genotypes.PASS.vcf.gz
  • Merged Indel VCF with beagles SNP VCF, then sorted to get the following VCFS:
    • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPINDEL/chr*.PASS.beagled.vcf.gz
  • Indel and SNP VCFs that have been combined AND BEAGLED AGAIN TOGETHER using Beagle4. These are the latest VCFS:
    • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPINDEL/beagle4/beagle4_chr*/beagle4/chr*/chr*.filtered.PASS.beagled.vcf.gz
  • mtDNA Copy Number Summary
    • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/MTDNA_COPYNUMBER/Coverage_IncludingUnpairedReads/sardiniaCopyNumber_include.txt
    • By sample copy number results by sample can be found: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/MTDNA_COPYNUMBER/Coverage_IncludingUnpairedReads/*.CopyNumber.noRand.500000.*.txt
  • Filtering Samples: Discrepancy Figures
    • We compared chip data from previous work to the sequencing data produced now in hopes of identifying which samples have reliable sequencing data. This comparison was done for each chromosome separately and then combined into an overall discrepancy figure. Only 3189 of the 3840 samples had
    • By chromosome discrepancy figures can be found:
      • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr*/*.diff.discordance_matrix
    • Overall discrepancy can be found:
      • IN PROCESS
  • Sample Name Conversion
    • The CSCT samples had two different names.... a numeric and an alphanumeric. The conversion key can be found here:
      • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/SampleNameChanges.txt
    • The Sardinia samples had a Sardinia ID and a Sequencing Core ID. The conversion key can be found here:
      • IN PROCESS

What is Complete

Future Directions

Key References

If you are looking to learn about the project, I strongly recommend that you read the following papers

  • Pilia G, Chen WM, Scuteri A, Orru M, Albai G, Dei M, Lai S, Usala G, Lai M, Loi P, Mameli C, Vacca L, Deiana M, Olla N, Masala M, Cao A, Najjar SS, Terracciano A, Nedorezov T, Sharov A, Zonderman AB, Abecasis GR, Costa P, Lakatta E and Schlessinger D (2006). Heritability of Cardiovascular and Personality Traits in 6,148 Sardinians. PLoS Genet 2:1207-1223 [Abstract and PDF]
  • Scuteri A, Sanna S, Chen WM, Uda M, Albai G, Strait J, Najjar S, Nagarajah R, Orru M, Usala G, Dei M, Lai S, Maschio A, Busonero F, Mulas A, Ehret GB, Fink AA, Weder A, Cooper R, Galan P, Chakravarti A, Schlessinger D, Cao A, Lakatta E and Abecasis GR (2007). Genome Wide Association Scan shows Genetic Variants in the FTO gene are Associated with Obesity Related Traits PLoS Genetics 3:1200-10 [Abstract and PDF]
  • Chen WM and Abecasis GR (2007). Family-based association tests for genomewide association scans. Am J Hum Genet 81:913-26 [Abstract and PDF]