Changes

From Genome Analysis Wiki
Jump to navigationJump to search
1,811 bytes added ,  15:21, 4 April 2017
Line 21: Line 21:  
=== Locations of Files for Current Data Freeze of 3839 Samples===
 
=== Locations of Files for Current Data Freeze of 3839 Samples===
   −
NOTICE: We identified late in the process that two of the samples (22855 and 22385)  were actually the same individual. They both should be the same individual 22855. Therefore, there are 3840 sample IDs in each of the files below, but only 22855 should move on to later processes. In future data freezes with this data, these two sequencing sets should be merged into a single 22855 individual.
+
NOTICE: We identified late in the process that two of the samples (22855 and 22358)  were actually the same individual. They both should be the same individual 22855. Therefore, there are 3840 sample IDs in each of the files below, but only 22855 should move on to later processes. In future data freezes with this data, these two sequencing sets should be merged into a single 22855 individual.
    
* List of '''Sample Numbers'''
 
* List of '''Sample Numbers'''
Line 77: Line 77:     
* '''Filtering Samples''': Discrepancy Figures
 
* '''Filtering Samples''': Discrepancy Figures
** We compared chip data from previous work to the sequencing data produced now in hopes of identifying which samples have reliable sequencing data. This comparison was done for each chromosome separately and then combined into an overall discrepancy figure. Only 3189 of the 3840 samples (3188 of 3839 if you throw out sample with two sets of data) had  
+
** We compared chip data from previous work to the sequencing data produced now in hopes of identifying which samples have reliable sequencing data. This comparison was done for each chromosome separately and then combined into an overall discrepancy figure. Only 3189 of the 3840 samples (3188 of 3839 if you throw out sample with two sets of data) had chip data.
** By chromosome discrepancy figures can be found:
+
** By chromosome discrepancy figures for each sample can be found:
 
*** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr*/*.diff.discordance_matrix
 
*** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr*/*.diff.discordance_matrix
 
+
** By chromosome discrepancy figures for all samples in one file can be found:
 +
*** /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr*/all.discordance.summary.txt
    
=== What is Complete ===
 
=== What is Complete ===
Line 92: Line 93:     
=== Future Directions ===
 
=== Future Directions ===
 +
* NOTE: For sample filtering below, you will need to finish chromosome 1 for me. I have it currently running. Once it is done in a few days, you will need to run the command 'python calculateConcordance_onefile.py' while in the '/net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr1/' directory. <b>--> Complete </b>
 
* '''Sample Filtering'''
 
* '''Sample Filtering'''
 
** We did not do any filtering of samples (based on dupRate, genome coverage, mapping rate, proper paired, mean depth, or any other QPLOT stats) prior to SNP and Indel calling. Because of this, we want to do this filtering now. 3,188 or 3,839 samples have genome chip data from a few years ago. For these, we could look at the non-reference concordance between the chip genotypes and the sequencing genotypes and declare 'bad' samples to be those that fall below a certain threshold, such as 98% non-ref concordance. However, since the remaining 651 samples do not have chip data, this is not an option for them. Therefore, we decided on the following strategy instead:  
 
** We did not do any filtering of samples (based on dupRate, genome coverage, mapping rate, proper paired, mean depth, or any other QPLOT stats) prior to SNP and Indel calling. Because of this, we want to do this filtering now. 3,188 or 3,839 samples have genome chip data from a few years ago. For these, we could look at the non-reference concordance between the chip genotypes and the sequencing genotypes and declare 'bad' samples to be those that fall below a certain threshold, such as 98% non-ref concordance. However, since the remaining 651 samples do not have chip data, this is not an option for them. Therefore, we decided on the following strategy instead:  
Line 98: Line 100:  
**# If reasonable predictive power/R^2, use the prediction model to estimate the non-reference concordance amongst the 651 samples that do not have chip data. Also use the prediction model to estimate the non-reference concordance among the 3,188 samples that do have chip data.
 
**# If reasonable predictive power/R^2, use the prediction model to estimate the non-reference concordance amongst the 651 samples that do not have chip data. Also use the prediction model to estimate the non-reference concordance among the 3,188 samples that do have chip data.
 
**# Set a cut-off for 'good' versus 'bad' samples based on the estimated non-reference concordance and use it to filter samples.
 
**# Set a cut-off for 'good' versus 'bad' samples based on the estimated non-reference concordance and use it to filter samples.
 +
** NOTE: The number of positions for which the chip data give 0/0 and sequencing gives 0/0, chip data gives 0/0 and sequencing gives 0/1, chip data gives 0/0 and sequencing gives 1/1, chip gives 0/1 and sequencing gives 0/0, etc. etc. BY chromosome can be found in the files /net/sardinia/progenia/SardiNIA/VariantCalling_20150330/filteringSamples/chr*/all.discordance.summary.txt. These can be used to calculate overall non-reference concordance across all chromosomes.
 +
* '''Mitochondrial Depth Analysis'''
 +
* '''Telomere Length Analysis'''
 +
** Investigate the associations between telomere length (an indicator of aging) and variants. Likely interesting in Sardinia population because Sardinians have longer lifespans & centenarians.
 +
* '''Phenotype Study'''
 +
** Likely will not yield much because not many additional samples since Carlo's last data freeze (3,514 samples there) <b>--> GWASs on 120 Visit 1 traits Complete </b>
 +
 +
== New Updates ==
 +
* The duplicate person in the sequencing files was removed, and those files were used to impute the phased genotyping files.
 +
* Imputation was conducted using Minimac3. SNPs were imputed, and also indels were imputed.
 +
* Merged files: /net/wonderland/home/csidore/Epacts_vcfs/Michelle_panel/newpanel.chr*.vcf.gz
    
== Key References ==
 
== Key References ==

Navigation menu