Changes

From Genome Analysis Wiki
Jump to navigationJump to search
1,468 bytes added ,  11:55, 6 July 2016
Line 92: Line 92:     
=== Future Directions ===
 
=== Future Directions ===
 +
* Sample Filtering
 +
** We did not do any filtering of samples (based on dupRate, genome coverage, mapping rate, proper paired, mean depth, or any other QPLOT stats) prior to SNP and Indel calling. Because of this, we want to do this filtering now. 3,188 or 3,839 samples have genome chip data from a few years ago. For these, we could look at the non-reference concordance between the chip genotypes and the sequencing genotypes and declare 'bad' samples to be those that fall below a certain threshold, such as 98% non-ref concordance. However, since the remaining 651 samples do not have chip data, this is not an option for them. Therefore, we decided on the following strategy instead:
 +
**# Calculate non-reference concordance for the 3,188 samples that have chip data.
 +
**# Create a prediction model using QPLOT statistics as predictors of non-reference concordance. Either do so on all of the 3,188 samples and look at R^2 (likely inflated from overfitting) or use cross-validation (test and training set) to give a measure of external predictive power.
 +
**# If reasonable predictive power/R^2, use the prediction model to estimate the non-reference concordance amongst the 651 samples that do not have chip data. Also use the prediction model to estimate the non-reference concordance among the 3,188 samples that do have chip data.
 +
**# Set a cut-off for 'good' versus 'bad' samples based on the estimated non-reference concordance and use it to filter samples.
    
== Key References ==
 
== Key References ==
87

edits

Navigation menu