From Genome Analysis Wiki
Jump to navigationJump to search

Project Leaders

  • David Schlessinger (National Institutes on Aging, Baltimore)
  • Manuela Uda (National Research Center, Cagliari, Italy)
  • Goncalo Abecasis (University of Michigan, Ann Arbor)

Genomewide Association Study

We carried out an initial genomewide association by genotyping 1405 samples with Affymetrix 500K SNP arrays. Then, because SardiNIA samples are closely related to each other, we were able to use these genotypes to impute the genomes of many close relatives who were genotyped with Affymetrix 10K SNP arrays. Typically, our SardiNIA GWAS analysis thus include approximately 4300 genotyped or imputed individuals. To learn more about the approach see Chen and Abecasis (2007) and Scuteri et al (2007).

Planned Updates

Several ongoing experiments are expected to gradually improve our GWAS data. First, we expect to integrate Affymetrix 1M SNP chip genotypes into our analyses. Second, we plan to genotype all sampled individuals with the Metabochip, which includes 200,000 SNPs. This will enable us to fill in genotypes for relatives of those genotyped with denser arrays more accurately, because we will more precisely identify shared stretches of chromosome.

Medical Sequencing Project

We are sequencing the genomes of 1,000 individuals to learn about the genetics of blood lipid levels and personality.

Status as of July 2018

Locations of Files for Current Data Freeze of 3839 Samples

NOTICE: We identified late in the process that two of the samples (22855 and 22358) were actually the same individual. They both should be the same individual 22855. Therefore, there are 3840 sample IDs in each of the files below, but only 22855 should move on to later processes. In future data freezes with this data, these two sequencing sets should be merged into a single 22855 individual. For association analyses, 22358 has been removed.

  • List of Sample Numbers
    • The following file contains three columns: [SampleID used in these analyses] [ID supplied by CSCT or Sardinia or other project] [Sequencing core ID (if different)]:
      • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/sampleIDConversion.txt
  • List of paths to BAMs used in this data freeze (Index file)
    • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPCALL/FINAL_index_20150504.index
    • 401 of these samples have some new BAM contribution since the previous data freeze... their BAMs can be found here: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/newsamples.index
  • Pedigree (Not too helpful -- used for SNPCall)
    • All Samples: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPCALL/FINAL_ped_20150510.ped
    • Disjoint Trios: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/MiLK_filtering/Pedigree_Fall15_DataFreeze_Triplets.ped
  • QC Summary
    • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/GeneratingQCDistributions/QCStats.txt
    • For a list of paths to all of the separate QPLOT files for each sample, see the file: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/GeneratingQCDistributions/QCFileListFinal.txt
  • SNPCall Results (Produced with Gotcloud SNPCall and phased using Beagle4)
    • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPCALL/beagle4/chr*/chr*.filtered.PASS.beagled.vcf.gz
  • IndelCall Results (Produced with Gotcloud Indel)
    • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/indel/final/all.genotypes.sites.vcf.gz
    • SNPEff and VEP declarations of Indel types can be found:
      • SNPEff: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/snpEff/*
      • VEP: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/VEP/*
    • We used an Indel filtering strategy composed of many levels.
      1. AC must be 1 or greater -- eliminate Indels with AC=0
      2. the Indel should overlap with a VNTR region or overlap with another Indel. We used Adrian's annotate indels program to identify such overlaps. Results for Indels on all chromosomes can be found here:
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/AnnotateIndels/All.annotated.sites.vcf.gz
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/AnnotateIndels/Overlaps.txt
      3. at least 50% of the Indels should have informative AD field (we define "informative" to mean that the sample has U/(R+A+U)<0.50). Results for Indels on all chromosomes can be found here:
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/filter2_output.txt
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/INDEL_filtering_final.txt
      4. at least 50% of the Indels should have informative PL field (we define "informative" to mean that the PL field for the sample is anything BUT ././. or 0/0/0). Results for Indels on all chromosomes can be found here:
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/filter1_output.txt
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/INDEL_filtering_final.txt
      5. the Indel needs to have BF_LRE_LUD (a Bayes factor comparing a. related & HWE to b. unrelated & HWD) > -10. Higher BF_LRE_LUD should indicate a better Indel. We used Hyun's MiLK program to obtain BF_LRE_LUD values. Results for Indels on all chromosomes can be found here:
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/MiLK_filtering/all.genotypes.milk.sites.vcf
      • Overall results from all of the below filtering can be found here:
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/FINAL/FinalIndelFilteringStatistics.txt
      • VCFs of Indels after filtering can be found here:
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/FINAL/all.genotypes.*.PASS.vcf.gz
        • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/FINAL/all.genotypes.PASS.vcf.gz
  • Merged Indel VCF with beagles SNP VCF, then sorted to get the following VCFS:
    • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPINDEL/chr*.PASS.beagled.vcf.gz
  • Indel and SNP VCFs that have been combined AND BEAGLED AGAIN TOGETHER using Beagle4. These are the latest VCFS:
    • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPINDEL/beagle4/beagle4_chr*/beagle4/chr*/chr*.filtered.PASS.beagled.vcf.gz
  • mtDNA Copy Number Summary
    • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/MTDNA_COPYNUMBER/Coverage_IncludingUnpairedReads/sardiniaCopyNumber_include.txt
    • By sample copy number results by sample can be found: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/MTDNA_COPYNUMBER/Coverage_IncludingUnpairedReads/*.CopyNumber.noRand.500000.*.txt
  • Filtering Samples: Discrepancy Figures
    • We compared chip data from previous work to the sequencing data produced now in hopes of identifying which samples have reliable sequencing data. This comparison was done for each chromosome separately and then combined into an overall discrepancy figure. Only 3189 of the 3840 samples (3188 of 3839 if you throw out sample with two sets of data) had chip data.
    • By chromosome discrepancy figures for each sample can be found:
      • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr*/*.diff.discordance_matrix
    • By chromosome discrepancy figures for all samples in one file can be found:
      • /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr*/all.discordance.summary.txt

What is Complete

  • SNP Call
    • 24,901,469 SNPs passed filters
      • 16,822,922 are in dbSNP (67.6%)
      • %Known Ts/Tv - 2.24
      • %Novel Ts/Tv - 1.95
  • InDel Call
    • 1,194,945 passed filters

Future Directions

  • NOTE: For sample filtering below, you will need to finish chromosome 1 for me. I have it currently running. Once it is done in a few days, you will need to run the command 'python' while in the '/net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr1/' directory. --> Complete
  • Sample Filtering
    • We did not do any filtering of samples (based on dupRate, genome coverage, mapping rate, proper paired, mean depth, or any other QPLOT stats) prior to SNP and Indel calling. Because of this, we want to do this filtering now. 3,188 or 3,839 samples have genome chip data from a few years ago. For these, we could look at the non-reference concordance between the chip genotypes and the sequencing genotypes and declare 'bad' samples to be those that fall below a certain threshold, such as 98% non-ref concordance. However, since the remaining 651 samples do not have chip data, this is not an option for them. Therefore, we decided on the following strategy instead:
      1. Calculate non-reference concordance for the 3,188 samples that have chip data.
      2. Create a prediction model using QPLOT statistics as predictors of non-reference concordance. Either do so on all of the 3,188 samples and look at R^2 (likely inflated from overfitting) or use cross-validation (test and training set) to give a measure of external predictive power.
      3. If reasonable predictive power/R^2, use the prediction model to estimate the non-reference concordance amongst the 651 samples that do not have chip data. Also use the prediction model to estimate the non-reference concordance among the 3,188 samples that do have chip data.
      4. Set a cut-off for 'good' versus 'bad' samples based on the estimated non-reference concordance and use it to filter samples.
    • NOTE: The number of positions for which the chip data give 0/0 and sequencing gives 0/0, chip data gives 0/0 and sequencing gives 0/1, chip data gives 0/0 and sequencing gives 1/1, chip gives 0/1 and sequencing gives 0/0, etc. etc. BY chromosome can be found in the files /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr*/all.discordance.summary.txt. These can be used to calculate overall non-reference concordance across all chromosomes.
  • Mitochondrial Depth Analysis
  • Telomere Length Analysis
    • Investigate the associations between telomere length (an indicator of aging) and variants. Likely interesting in Sardinia population because Sardinians have longer lifespans & centenarians.

New Updates

  • The duplicate person in the sequencing files was removed, and those files were used to impute the phased genotyping files.
  • Imputation was conducted using Minimac3. SNPs were imputed, and also indels were imputed.
  • Merged files: /net/wonderland/home/csidore/Epacts_vcfs/Michelle_panel/newpanel.chr*.vcf.gz

Key References

If you are looking to learn about the project, I strongly recommend that you read the following papers

  • Pilia G, Chen WM, Scuteri A, Orru M, Albai G, Dei M, Lai S, Usala G, Lai M, Loi P, Mameli C, Vacca L, Deiana M, Olla N, Masala M, Cao A, Najjar SS, Terracciano A, Nedorezov T, Sharov A, Zonderman AB, Abecasis GR, Costa P, Lakatta E and Schlessinger D (2006). Heritability of Cardiovascular and Personality Traits in 6,148 Sardinians. PLoS Genet 2:1207-1223 [Abstract and PDF]
  • Scuteri A, Sanna S, Chen WM, Uda M, Albai G, Strait J, Najjar S, Nagarajah R, Orru M, Usala G, Dei M, Lai S, Maschio A, Busonero F, Mulas A, Ehret GB, Fink AA, Weder A, Cooper R, Galan P, Chakravarti A, Schlessinger D, Cao A, Lakatta E and Abecasis GR (2007). Genome Wide Association Scan shows Genetic Variants in the FTO gene are Associated with Obesity Related Traits PLoS Genetics 3:1200-10 [Abstract and PDF]
  • Chen WM and Abecasis GR (2007). Family-based association tests for genomewide association scans. Am J Hum Genet 81:913-26 [Abstract and PDF]