Difference between revisions of "UMAKE-glfSingle"

From Genome Analysis Wiki
Jump to navigationJump to search
(Created page with "This is a modification of UMAKE to incorporate an individual-based variant caller in the pipeline. The idea is to use glfSingle to generate sample-specific VCF after pileup, ...")
 
 
(8 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
The idea is to use glfSingle to generate sample-specific VCF after pileup, and then replace the glfMultiples step by a merging step. The merging generates a population VCF that looks the same as what would have been the glfMultiples output. Subsequent filtering and imputation steps can follow as usual.
 
The idea is to use glfSingle to generate sample-specific VCF after pileup, and then replace the glfMultiples step by a merging step. The merging generates a population VCF that looks the same as what would have been the glfMultiples output. Subsequent filtering and imputation steps can follow as usual.
  
Ingredients:
+
==Ingredients==
*Index file - same as original UMAKE index. An example is at /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.index
+
*Index file
*Configuration file - same as original UMAKE conf. An example is at /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.conf . Note that the glfSingle and merging steps are implicitly included in these two steps:
+
**same as original UMAKE index
   RUN_PILEUP = TRUE      # create GLF file from BAM then individual VCF using glfSingle
+
**An example is at /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.index
 +
*Configuration file
 +
**same as original UMAKE conf
 +
**An example is at /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.conf
 +
**Note that the glfSingle and merging steps are included by enabling these two steps:
 +
   RUN_PILEUP = TRUE      # create GLF file from BAM then individual VCF using glfSingle (sorry, you have to redo the pileups)
 
   RUN_GLFMULTIPLES = TRUE # create unfiltered SNP calls, population VCF by merging the glfSingle outputs
 
   RUN_GLFMULTIPLES = TRUE # create unfiltered SNP calls, population VCF by merging the glfSingle outputs
  
*Perl script for generating Makefile - /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.pl . It is modified from umake.pl:
+
*Perl script for generating Makefile
**calls glfSingle and merge_glfS_vcf.py (for merging across single-sample VCF) from /net/wonderland/home/yancylo/bin/umake-glfSingle
+
**Found at /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.pl
 +
**It is modified from umake.pl to use glfSingle and merge_glfS_vcf.py (for merging across single-sample VCF)
 +
***These two programs are available at /net/wonderland/home/yancylo/bin/umake-glfSingle
 
**To generate the Makefile corresponding to this new pipeline flow, do:
 
**To generate the Makefile corresponding to this new pipeline flow, do:
 
   perl /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.pl --conf /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.conf
 
   perl /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.pl --conf /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.conf
**To change paths to glfSingle and merge_glfS_vcf.py, go to the following lines of umake-glfSingle.pl:
+
 
***line 1000 -  my $cmd = "python [your-path]/merge_glfS_vcf.py --file-list $glfAlias --chr $chr --outfile $vcf > $vcf.log";
+
==Customization==
***line 1073 -  $cmd .= "\n\t".&getMosixCmd("[your-path]/glfSingle -g $smGlf -b $smVcf -l $allSMs[$i] --minMapQuality 0 --minDepth 1 --maxDepth 100000 --reference --uniformTsTv > $smVcf.log");
+
*To change paths to glfSingle and merge_glfS_vcf.py, go to the following lines of umake-glfSingle.pl:
 +
**line 1000 -  my $cmd = "python [your-path]/merge_glfS_vcf.py --file-list $glfAlias --chr $chr --outfile $vcf > $vcf.log";
 +
**line 1073 -  $cmd .= "\n\t".&getMosixCmd("[your-path]/glfSingle -g $smGlf -b $smVcf -l $allSMs[$i] --minMapQuality 0 --minDepth 1 --maxDepth 100000 --reference > $smVcf.log");
 +
*To apply the uniform Ts/Tv model to glfSingle, go to the following line of umake-glfSingle.pl and make the changes in bold:
 +
**line 1073 - $cmd .= "\n\t".&getMosixCmd("/net/wonderland/home/yancylo/bin/umake-glfSingle/'''glfSingle_ut''' -g $smGlf -b $smVcf -l $allSMs[$i] --minMapQuality 0 --minDepth 1 --maxDepth 100000 --reference '''--uniformTsTv''' > $smVcf.log");
 +
*To delete single-sample VCFs after merging, add the --d option to line 1000 (do python merge_glfS_vcf.py to view option parameters)
 +
 
 +
==Remarks==
 +
*The --reference option should be enabled for glfSingle, such that it calls the homozygous reference genotypes per sample. This is necessary to distinguish between homref and missing genotypes during merging.
 +
*The merging program combines across individual-sample VCFs in small chunks of positions, hence it does NOT create a memory issue even when merging across large sample sizes and big regions.
 +
*One potential concern of this pipeline is the number and size of additional files, since each sample now has its own set of VCFs (~2Gb for chr20 per sample).
 +
 
 +
==Contact==
 +
Please contact Yancy if you have any questions.

Latest revision as of 12:32, 22 January 2014

This is a modification of UMAKE to incorporate an individual-based variant caller in the pipeline.

The idea is to use glfSingle to generate sample-specific VCF after pileup, and then replace the glfMultiples step by a merging step. The merging generates a population VCF that looks the same as what would have been the glfMultiples output. Subsequent filtering and imputation steps can follow as usual.

Ingredients

  • Index file
    • same as original UMAKE index
    • An example is at /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.index
  • Configuration file
    • same as original UMAKE conf
    • An example is at /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.conf
    • Note that the glfSingle and merging steps are included by enabling these two steps:
 RUN_PILEUP = TRUE       # create GLF file from BAM then individual VCF using glfSingle (sorry, you have to redo the pileups)
 RUN_GLFMULTIPLES = TRUE # create unfiltered SNP calls, population VCF by merging the glfSingle outputs
  • Perl script for generating Makefile
    • Found at /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.pl
    • It is modified from umake.pl to use glfSingle and merge_glfS_vcf.py (for merging across single-sample VCF)
      • These two programs are available at /net/wonderland/home/yancylo/bin/umake-glfSingle
    • To generate the Makefile corresponding to this new pipeline flow, do:
 perl /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.pl --conf /net/wonderland/home/yancylo/bin/umake-glfSingle/umake-glfSingle.conf

Customization

  • To change paths to glfSingle and merge_glfS_vcf.py, go to the following lines of umake-glfSingle.pl:
    • line 1000 - my $cmd = "python [your-path]/merge_glfS_vcf.py --file-list $glfAlias --chr $chr --outfile $vcf > $vcf.log";
    • line 1073 - $cmd .= "\n\t".&getMosixCmd("[your-path]/glfSingle -g $smGlf -b $smVcf -l $allSMs[$i] --minMapQuality 0 --minDepth 1 --maxDepth 100000 --reference > $smVcf.log");
  • To apply the uniform Ts/Tv model to glfSingle, go to the following line of umake-glfSingle.pl and make the changes in bold:
    • line 1073 - $cmd .= "\n\t".&getMosixCmd("/net/wonderland/home/yancylo/bin/umake-glfSingle/glfSingle_ut -g $smGlf -b $smVcf -l $allSMs[$i] --minMapQuality 0 --minDepth 1 --maxDepth 100000 --reference --uniformTsTv > $smVcf.log");
  • To delete single-sample VCFs after merging, add the --d option to line 1000 (do python merge_glfS_vcf.py to view option parameters)

Remarks

  • The --reference option should be enabled for glfSingle, such that it calls the homozygous reference genotypes per sample. This is necessary to distinguish between homref and missing genotypes during merging.
  • The merging program combines across individual-sample VCFs in small chunks of positions, hence it does NOT create a memory issue even when merging across large sample sizes and big regions.
  • One potential concern of this pipeline is the number and size of additional files, since each sample now has its own set of VCFs (~2Gb for chr20 per sample).

Contact

Please contact Yancy if you have any questions.