Changes

From Genome Analysis Wiki
Jump to navigationJump to search
8,643 bytes added ,  15:21, 4 April 2017
Line 17: Line 17:  
We are sequencing the genomes of 1,000 individuals to learn about the genetics of blood lipid levels and personality.
 
We are sequencing the genomes of 1,000 individuals to learn about the genetics of blood lipid levels and personality.
   −
== Status as of June 2016 ==
+
== Status as of July 2016 ==
   −
=== Locations of Files ===
+
=== Locations of Files for Current Data Freeze of 3839 Samples===
   −
* List of Sample Numbers
+
NOTICE: We identified late in the process that two of the samples (22855 and 22358)  were actually the same individual. They both should be the same individual 22855. Therefore, there are 3840 sample IDs in each of the files below, but only 22855 should move on to later processes. In future data freezes with this data, these two sequencing sets should be merged into a single 22855 individual.
**
  −
* Recalibrated and deduced BAM Files (can be found in the following locations)
  −
** /net/sardinia/progenia/SardiNIA/Pula_Final/BAMS/by_sample/*.bam
  −
** /net/sardinia/progenia/SardiNIA/recal_20150210/*.recal.bam
  −
** /net/sardinia/progenia/SardiNIA/recal_20150330/*.recal.bam
  −
** /net/mrtoad/stuff.from.csgspare/gpistis/Deduped_recalibrated_BAMS/*.uniq.dedup.recal.bam
  −
** /net/sardinia/progenia/SardiNIA/bams/decoy_ref/by_sample/*.bam
  −
** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/RealigningCarlosSamples/Giorgio_Carlo_merged_20150412/*.CG.recal.merged.bam
  −
*** some BAM files were generated by Carlo and Giorgio separately for the same individual. These BAMs were merged, recalibrated, and deduped and can be found here
      +
* List of '''Sample Numbers'''
 +
** The following file contains three columns: [SampleID used in these analyses] [ID supplied by CSCT or Sardinia or other project] [Sequencing core ID (if different)]:
 +
*** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/sampleIDConversion.txt
    +
* List of paths to '''BAMs''' used in this data freeze (Index file)
 +
** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPCALL/FINAL_index_20150504.index
 +
** 401 of these samples have some new BAM contribution since the previous data freeze... their BAMs can be found here: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/newsamples.index
 +
 +
* '''Pedigree''' (Not too helpful -- used for SNPCall)
 +
** All Samples: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPCALL/FINAL_ped_20150510.ped
 +
** Disjoint Trios: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/MiLK_filtering/Pedigree_Fall15_DataFreeze_Triplets.ped
 +
 +
* '''QC''' Summary
 +
** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/GeneratingQCDistributions/QCStats.txt
 +
** For a list of paths to all of the separate QPLOT files for each sample, see the file: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/GeneratingQCDistributions/QCFileListFinal.txt
 +
 +
* '''SNPCall''' Results (Produced with Gotcloud SNPCall and phased using Beagle4)
 +
** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPCALL/beagle4/chr*/chr*.filtered.PASS.beagled.vcf.gz
 +
 +
* '''IndelCall''' Results (Produced with Gotcloud Indel)
 +
** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/indel/final/all.genotypes.sites.vcf.gz
 +
** SNPEff and VEP declarations of Indel types can be found:
 +
*** SNPEff: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/snpEff/*
 +
*** VEP: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/VEP/*
 +
** We used an '''Indel filtering strategy''' composed of many levels.
 +
**# AC must be 1 or greater -- eliminate Indels with AC=0
 +
**# the Indel should overlap with a VNTR region or overlap with another Indel. We used Adrian's annotate indels program to identify such overlaps. Results for Indels on all chromosomes can be found here:
 +
**#* /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/AnnotateIndels/All.annotated.sites.vcf.gz
 +
**#* /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/AnnotateIndels/Overlaps.txt
 +
**# at least 50% of the Indels should have informative AD field (we define "informative" to mean that the sample has U/(R+A+U)<0.50). Results for Indels on all chromosomes can be found here:
 +
**#* /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/filter2_output.txt
 +
**#* /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/INDEL_filtering_final.txt
 +
**# at least 50% of the Indels should have informative PL field (we define "informative" to mean that the PL field for the sample is anything BUT ././. or 0/0/0). Results for Indels on all chromosomes can be found here:
 +
**#* /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/filter1_output.txt
 +
**#* /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/INDEL_filtering_final.txt
 +
**# the Indel needs to have BF_LRE_LUD (a Bayes factor comparing a. related & HWE to b. unrelated & HWD) > -10. Higher BF_LRE_LUD should indicate a better Indel. We used Hyun's MiLK program to obtain BF_LRE_LUD values. Results for Indels on all chromosomes can be found here:
 +
**#* /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/MiLK_filtering/all.genotypes.milk.sites.vcf
 +
*** Overall results from all of the below filtering can be found here:
 +
**** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/FINAL/FinalIndelFilteringStatistics.txt
 +
*** VCFs of Indels after filtering can be found here:
 +
**** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/FINAL/all.genotypes.*.PASS.vcf.gz
 +
**** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/INDELCALL/filterIndels/FINAL/all.genotypes.PASS.vcf.gz
 +
 +
* Merged Indel VCF with beagles SNP VCF, then sorted to get the following VCFS:
 +
** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPINDEL/chr*.PASS.beagled.vcf.gz
 +
 +
* Indel and SNP VCFs that have been combined AND BEAGLED AGAIN TOGETHER using Beagle4. '''These are the latest VCFS''':
 +
** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/SNPINDEL/beagle4/beagle4_chr*/beagle4/chr*/chr*.filtered.PASS.beagled.vcf.gz
 +
 +
* '''mtDNA Copy Number''' Summary
 +
** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/MTDNA_COPYNUMBER/Coverage_IncludingUnpairedReads/sardiniaCopyNumber_include.txt
 +
** By sample copy number results by sample can be found: /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/MTDNA_COPYNUMBER/Coverage_IncludingUnpairedReads/*.CopyNumber.noRand.500000.*.txt
 +
 +
* '''Filtering Samples''': Discrepancy Figures
 +
** We compared chip data from previous work to the sequencing data produced now in hopes of identifying which samples have reliable sequencing data. This comparison was done for each chromosome separately and then combined into an overall discrepancy figure. Only 3189 of the 3840 samples (3188 of 3839 if you throw out sample with two sets of data) had chip data.
 +
** By chromosome discrepancy figures for each sample can be found:
 +
*** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr*/*.diff.discordance_matrix
 +
** By chromosome discrepancy figures for all samples in one file can be found:
 +
*** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr*/all.discordance.summary.txt
    
=== What is Complete ===
 
=== What is Complete ===
 +
* SNP Call
 +
** 24,901,469 SNPs passed filters
 +
*** 16,822,922 are in dbSNP (67.6%)
 +
*** %Known Ts/Tv - 2.24
 +
*** %Novel Ts/Tv - 1.95
 +
* InDel Call
 +
** 1,194,945 passed filters
    
=== Future Directions ===
 
=== Future Directions ===
 +
* NOTE: For sample filtering below, you will need to finish chromosome 1 for me. I have it currently running. Once it is done in a few days, you will need to run the command 'python calculateConcordance_onefile.py' while in the '/net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr1/' directory. <b>--> Complete </b>
 +
* '''Sample Filtering'''
 +
** We did not do any filtering of samples (based on dupRate, genome coverage, mapping rate, proper paired, mean depth, or any other QPLOT stats) prior to SNP and Indel calling. Because of this, we want to do this filtering now. 3,188 or 3,839 samples have genome chip data from a few years ago. For these, we could look at the non-reference concordance between the chip genotypes and the sequencing genotypes and declare 'bad' samples to be those that fall below a certain threshold, such as 98% non-ref concordance. However, since the remaining 651 samples do not have chip data, this is not an option for them. Therefore, we decided on the following strategy instead:
 +
**# Calculate non-reference concordance for the 3,188 samples that have chip data.
 +
**# Create a prediction model using QPLOT statistics as predictors of non-reference concordance. Either do so on all of the 3,188 samples and look at R^2 (likely inflated from overfitting) or use cross-validation (test and training set) to give a measure of external predictive power.
 +
**# If reasonable predictive power/R^2, use the prediction model to estimate the non-reference concordance amongst the 651 samples that do not have chip data. Also use the prediction model to estimate the non-reference concordance among the 3,188 samples that do have chip data.
 +
**# Set a cut-off for 'good' versus 'bad' samples based on the estimated non-reference concordance and use it to filter samples.
 +
** NOTE: The number of positions for which the chip data give 0/0 and sequencing gives 0/0, chip data gives 0/0 and sequencing gives 0/1, chip data gives 0/0 and sequencing gives 1/1, chip gives 0/1 and sequencing gives 0/0, etc. etc. BY chromosome can be found in the files /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr*/all.discordance.summary.txt. These can be used to calculate overall non-reference concordance across all chromosomes.
 +
* '''Mitochondrial Depth Analysis'''
 +
* '''Telomere Length Analysis'''
 +
** Investigate the associations between telomere length (an indicator of aging) and variants. Likely interesting in Sardinia population because Sardinians have longer lifespans & centenarians.
 +
* '''Phenotype Study'''
 +
** Likely will not yield much because not many additional samples since Carlo's last data freeze (3,514 samples there) <b>--> GWASs on 120 Visit 1 traits Complete </b>
 +
 +
== New Updates ==
 +
* The duplicate person in the sequencing files was removed, and those files were used to impute the phased genotyping files.
 +
* Imputation was conducted using Minimac3. SNPs were imputed, and also indels were imputed.
 +
* Merged files: /net/wonderland/home/csidore/Epacts_vcfs/Michelle_panel/newpanel.chr*.vcf.gz
    
== Key References ==
 
== Key References ==

Navigation menu