Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 1: Line 1:  +
==Introduction==
 +
Main Workshop wiki page: [[SeqShop: December 2014]]
 +
 +
See [[Media:Seqshop cnv partb 2014 06.pdf|lecture slides]] for the lecture slides associated with this tutorial.
 +
 
== Goals of This Session ==
 
== Goals of This Session ==
* What we want to learn is calling large deletions using GenomeSTRiP implemented in [[GotCloud]] pipeline
+
* What we want to learn  
 
** How to prepare metadata for running GenomeSTRiP.
 
** How to prepare metadata for running GenomeSTRiP.
 
** How to perform variant discovery and filtering for large deletions
 
** How to perform variant discovery and filtering for large deletions
 
** How to perform genotyping for large deletions
 
** How to perform genotyping for large deletions
 
** How to perform variant discovery and filtering from third party sites.
 
** How to perform variant discovery and filtering from third party sites.
  −
Please refer to [[Media:Seqshop cnv partb 2014 06.pdf|Lecture slides]] for more general background.
      
== GenomeSTRiP ==
 
== GenomeSTRiP ==
Line 35: Line 38:  
* Setup an output directory
 
* Setup an output directory
 
** It will leave your output directory from the previous tutorial in tact.
 
** It will leave your output directory from the previous tutorial in tact.
  source /home/mktrost/seqshop/setup.txt
+
  source /net/seqshop-server/home/mktrost/seqshop/setup.txt
 
* You won't see any output after running <code>source</code>
 
* You won't see any output after running <code>source</code>
 
** It silently sets up your environment
 
** It silently sets up your environment
 
** If you want to view the detail of the setup, type
 
** If you want to view the detail of the setup, type
  less /home/mktrost/seqshop/setup.txt
+
  less /net/seqshop-server/home/mktrost/seqshop/setup.txt
 
and press 'q' to finish.
 
and press 'q' to finish.
   Line 50: Line 53:  
</div>
 
</div>
 
</div>
 
</div>
      
== Setup when running on your own outside of the SeqShop Workshop ==
 
== Setup when running on your own outside of the SeqShop Workshop ==
Line 78: Line 80:  
If you want to check if you still have the bam index file, run
 
If you want to check if you still have the bam index file, run
   −
  head ${OUT}/bam.index
+
  head ${OUT}/bam.list
    
<ul>
 
<ul>
Line 84: Line 86:  
<li>View Results</li>
 
<li>View Results</li>
 
<div class="mw-collapsible-content" style="width:800px">
 
<div class="mw-collapsible-content" style="width:800px">
  HG00641 ALL /net/seqshop-server/hmkang/out/bams/HG00641.recal.bam
+
  HG00641 /net/seqshop-server/home/hmkang/out/bams/HG00641.recal.bam
  HG00640 ALL /net/seqshop-server/hmkang/out/bams/HG00640.recal.bam
+
  HG00640 /net/seqshop-server/home/hmkang/out/bams/HG00640.recal.bam
  HG00551 ALL /net/seqshop-server/hmkang/out/bams/HG00551.recal.bam
+
  HG00551 /net/seqshop-server/home/hmkang/out/bams/HG00551.recal.bam
  HG00553 ALL /net/seqshop-server/hmkang/out/bams/HG00553.recal.bam
+
  HG00553 /net/seqshop-server/home/hmkang/out/bams/HG00553.recal.bam
  HG00554 ALL bams/HG00554.recal.bam
+
  HG00554 bams/HG00554.recal.bam
  HG00637 ALL bams/HG00637.recal.bam
+
  HG00637 bams/HG00637.recal.bam
  HG00638 ALL bams/HG00638.recal.bam
+
  HG00638 bams/HG00638.recal.bam
  HG00734 ALL bams/HG00734.recal.bam
+
  HG00734 bams/HG00734.recal.bam
  HG00736 ALL bams/HG00736.recal.bam
+
  HG00736 bams/HG00736.recal.bam
  HG00737 ALL bams/HG00737.recal.bam  
+
  HG00737 bams/HG00737.recal.bam
 
</div>
 
</div>
 
</div>
 
</div>
Line 100: Line 102:  
Also, make sure that you have only 62 samples (you did not append new files twice)
 
Also, make sure that you have only 62 samples (you did not append new files twice)
   −
  wc -l ${OUT}/bam.index
+
  wc -l ${OUT}/bam.list
    
Your expected output is similar to this.
 
Your expected output is similar to this.
   −
  62 /net/seqshop-server/hmkang/out/bam.index
+
  62 /net/seqshop-server/hmkang/out/bam.list
    
=== Reference Files ===
 
=== Reference Files ===
Line 126: Line 128:  
<li>View Results</li>
 
<li>View Results</li>
 
<div class="mw-collapsible-content" style="width:800px">
 
<div class="mw-collapsible-content" style="width:800px">
[[File:RefDir.png|500px]]
+
1000G_omni2.5.b37.sites.PASS.chr22.vcf.gz
</div>
+
1000G_omni2.5.b37.sites.PASS.chr22.vcf.gz.tbi
</div>
+
1000G.snps_indels.22.sites.bcf
</ul>
+
1000G.snps_indels.22.sites.bcf.csi
 
+
1kg.omni.chr22.36Mb.vcf.gz
 
+
1kg.pilot_release.merged.indels.sites.hg19.chr22.vcf
Additional Reference files required just for Structural Variation:
+
dbsnp.13147541variants.22.sites.bcf
  ls ${SS}/svtoolkit/ref
+
dbsnp.13147541variants.22.sites.bcf.csi
<ul>
+
dbsnp_135.b37.chr22.vcf.gz
<div class="mw-collapsible mw-collapsed" style="width:200px">
+
dbsnp_135.b37.chr22.vcf.gz.tbi
<li>View Results</li>
+
dbsnp.bcf.vcf
<div class="mw-collapsible-content" style="width:800px">
+
dbsnp.vcf.gz.vcf
  human_g1k_v37.chr22.mask.100.fasta  human_g1k_v37.chr22.mask.100.fasta.dict human_g1k_v37.chr22.mask.100.fasta.fai
+
gencode.cds.22.bed.gz
 +
  hapmap_3.3.b37.sites.chr22.vcf.gz
 +
hapmap_3.3.b37.sites.chr22.vcf.gz.tbi
 +
human.g1k.v37.chr22-bs.umfa
 +
human.g1k.v37.chr22.dict
 +
human.g1k.v37.chr22.fa
 +
human.g1k.v37.chr22.fa.amb
 +
human.g1k.v37.chr22.fa.ann
 +
human.g1k.v37.chr22.fa.bwt
 +
human.g1k.v37.chr22.fa.fai
 +
human.g1k.v37.chr22.fa.pac
 +
human.g1k.v37.chr22.fa.sa
 +
  human_g1k_v37.chr22.mask.100.fasta
 +
  human_g1k_v37.chr22.mask.100.fasta.fai
 +
human.g1k.v37.chr22.winsize100.gc
 +
  humgen_g1k_v37_ploidy.chr22.map
 +
indel.reference.txt
 +
mdust.22.bed.gz
 +
mills.208620indels.22.sites.bcf
 +
mills.208620indels.22.sites.bcf.csi
 +
mills_indels_hg19.22.sites.bcf
 
</div>
 
</div>
 
</div>
 
</div>
Line 145: Line 167:     
Parameters files required just for Structural Variation:
 
Parameters files required just for Structural Variation:
  ls ${SS}/svtoolkit/conf
+
  ls ${GC}/src/svtoolkit/conf
    
<ul>
 
<ul>
Line 151: Line 173:  
<li>View Results</li>
 
<li>View Results</li>
 
<div class="mw-collapsible-content" style="width:800px">
 
<div class="mw-collapsible-content" style="width:800px">
  genstrip_parameters.txt humgen_g1k_v37_ploidy.chr22.map  humgen_g1k_v37_ploidy.map
+
  genstrip_parameters.txt
 
</div>
 
</div>
 
</div>
 
</div>
Line 165: Line 187:     
Check out the GenomeStrip specific settings at the end of the configuration file
 
Check out the GenomeStrip specific settings at the end of the configuration file
  tail -n 8 ${SS}/gotcloud.conf
+
  tail -n 5 ${SS}/gotcloud.conf
    
<ul>
 
<ul>
Line 174: Line 196:  
  ## GenomeSTRIP
 
  ## GenomeSTRIP
 
  #############################
 
  #############################
GENOMESTRIP_OUT = $(OUT_DIR)/sv
+
  GENOMESTRIP_MASK_FASTA = $(REF_DIR)/human_g1k_v37.chr22.mask.100.fasta
GENOMESTRIP_SVTOOLKIT_DIR = svtoolkit
+
  GENOMESTRIP_PLOIDY_MAP = $(REF_DIR)/humgen_g1k_v37_ploidy.chr22.map
  GENOMESTRIP_MASK_FASTA = $(GENOMESTRIP_SVTOOLKIT_DIR)/ref/human_g1k_v37.chr22.mask.100.fasta
  −
  GENOMESTRIP_PLOIDY_MAP = $(GENOMESTRIP_SVTOOLKIT_DIR)/conf/humgen_g1k_v37_ploidy.chr22.map
  −
GENOMESTRIP_PARAM = $(GENOMESTRIP_SVTOOLKIT_DIR)/conf/genstrip_parameters.txt
   
</div>
 
</div>
 
</div>
 
</div>
Line 209: Line 228:  
In addition, if one wants to genotype structural variants from other structural variant caller, there is a step available.
 
In addition, if one wants to genotype structural variants from other structural variant caller, there is a step available.
 
* Third-party Genotyping and Filtering step : Perform genotyping on the variant sites specified by an input VCF, and also perform variant filtering.
 
* Third-party Genotyping and Filtering step : Perform genotyping on the variant sites specified by an input VCF, and also perform variant filtering.
 +
 +
== Command Line Usage of GenomeSTRiP pipeline ==
 +
 +
To see how to use GenomeSTRiP pipeline, type
 +
perl $GC/bin/genomestrip.pl
 +
 +
<div class="mw-collapsible mw-collapsed">
 +
''View Results''
 +
<div class="mw-collapsible-content">
 +
ERROR: One of command options among --run-metadata, --run-discovery, --run-genotype, --run-thirdparty must be specified
 +
ERROR: Missing required option, outdir
 +
Usage:
 +
    /net/seqshop-server/home/mktrost/seqshop/gotcloud/bin/genomestrip.pl
 +
    [options]
 +
 +
    Help Options:
 +
      -help                Print out brief help message [OFF]
 +
      -man                Print the full documentation in man page style [OFF]
 +
 +
    Command options:
 +
      -run-metadata        Create metadata [OFF]
 +
      -run-discovery      Run variant discovery and filtering. Can run with --run-metadata together [OFF]
 +
      -run-genotype        Run genotyping - requires to finish run-metadata and run-discovery [OFF]
 +
      -run-thirdparty      Run genotyping and filtering of third-party sites [OFF]
 +
 +
    Options for input/output data:
 +
      -gotcloudroot|gcroot STRGotCloud Root Directory []
 +
      -conf STR            GotCloud configuration files []
 +
      -outdir STR          Override's conf file's OUT_DIR.  Used as the genomestrip output directory unless --out or GENOMESTRIP_OUT is set []
 +
      -list STR            BAM list file containing ID and BAM path []
 +
      -out STR            Output directory which stores subdirectories such as metadata/, discovery/, genotypes/, thirdparty/ unless overriden individually []
 +
      -metadata STR        Output directory to store --run-metadata results. Default is [OUT]/metadata/ []
 +
      -discovery STR      Output directory to store --run-discovery results. Default is [OUT]/discovery/ []
 +
      -genotype STR        Output directory to store --run-genotype results. Default is [OUT]/genotype/ []
 +
      -thirdparty STR      Output directory to store --run-thirdparty results. Default is [OUT]/thirdparty/ []
 +
 +
    Advanced Options:
 +
      -tmp-dir STR        temporary directory to store temporary files. Default is [OUT]/tmp []
 +
      -gs-dir STR          GenomeSTRiP svtoolkit directory []
 +
      -param STR          GenomeSTRIP parameter file []
 +
      -ref STR            Reference FASTA file []
 +
      -mask STR            Reference mask FASTA file []
 +
      -ploidy-map STR      Ploidy map file []
 +
      -mosix-opt STR      MOSIX options []
 +
      -region STR          Region to focus on the variants []
 +
      -unit INT            Number of variants to be genotyped per parallel run [100]
 +
 +
    Additional Inputs:
 +
      -in-vcf STR          Input site VCF files used for --run-genotype or --run-thirdparty. For --run-thirdparty, this argument is required. For --run-genotype, default is [OUT]/discovery/discovery.vcf []
 +
      -pass-only          Genotype only PASS-filtered variants, default is OFF [OFF]
 +
      -skip-rc            Skip precomputing read count [OFF]
 +
      -base-prefix STR    Prefix of all files []
 +
      -bam-prefix STR      Prefix of BAM files []
 +
      -ref-prefix STR      Prefix of Reference FASTA files []
 +
      -no-phonehome        Skip phone home functionality [OFF]
 +
      -make-base-name STR  Specifies the basename for the makefile []
 +
      -verbose            Specifies that additional details are to be printed out [OFF]
 +
      -dry-run            Perform a dry-run that only produces Makefile but not run it [OFF]
 +
      -numjobs INT        Number of jobs to concurrently run [1]
 +
      -autosomes          Perform analysis only on autosomes [OFF]
 +
</div></div>
    
== Running GotCloud/GenomeSTRiP Metadata Pipeline ==
 
== Running GotCloud/GenomeSTRiP Metadata Pipeline ==
Line 215: Line 295:     
In principle, the metadata can be created from the input BAM files by running the following command
 
In principle, the metadata can be created from the input BAM files by running the following command
  perl ${SS}/svtoolkit/bin/genomestrip.pl -run-metadata --conf ${SS}/gotcloud.conf --numjobs 2 --base-prefix ${SS} --outdir ${OUT}
+
  perl ${GC}/bin/genomestrip.pl --run-metadata --conf ${SS}/gotcloud.conf --numjobs 12 --base-prefix ${SS} --outdir ${OUT}
   −
'''WAIT!!!!! DO NOT RUN THIS COMMAND, because it will take ~50 minutes to finish'''.
+
'''WAIT!!!!! DO NOT RUN THIS COMMAND, because it will take >1 hour to finish'''.
    
Instead, let's look what the output would have looked like.
 
Instead, let's look what the output would have looked like.
   −
  ls ${SS}/svtoolkit/metadata
+
  ls ${SS}/metadata
   −
  cpt  depth  depth.dat  gcprofile  gcprofiles.zip  genome_sizes.txt  isd  isd.dist.bin  spans  spans.dat
+
computerc.args.list
 +
  cpt
 +
depth
 +
  depth.args.list
 +
  depth.dat
 +
  gcprofile
 +
gcprofiles.list
 +
  gcprofiles.zip
 +
  genome_sizes.txt
 +
  isd
 +
isd.dist.args.list
 +
  isd.dist.bin
 +
rccache
 +
rccache.bin 
 +
rccache.bin.idx 
 +
rccache.list 
 +
rccache.merge 
 +
  spans
 +
spans.args.list
 +
  spans.dat
    
The directory contains metadata output and other intermediate files produced by "GenomeSTRiP SVProcess" step.  
 
The directory contains metadata output and other intermediate files produced by "GenomeSTRiP SVProcess" step.  
Line 234: Line 333:     
To discover large deletions from the 62 BAMs we are using for this workshop, you can run the following command
 
To discover large deletions from the 62 BAMs we are using for this workshop, you can run the following command
  time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-discovery --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT}
+
  perl ${GC}/bin/genomestrip.pl --run-discovery --metadata ${SS}/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT}
* <code>${SS}/svtoolkit/bin/genomestrip.pl -run-discovery</code> runs the GenomeSTRiP Discovery Pipeline
+
* <code>${GC}/bin/genomestrip.pl -run-discovery</code> runs the GenomeSTRiP Discovery Pipeline
* <code>--metadata ${SS}/svtoolkit/metadata</code> points to the pre-made metadata file as explained in the previous section, [[#Running GotCloud/GenomeSTRiP Metadata Pipeline|Running GotCloud/GenomeSTRiP Metadata Pipeline]].
+
* <code>--metadata ${SS}/metadata</code> points to the pre-made metadata file as explained in the previous section, [[#Running GotCloud/GenomeSTRiP Metadata Pipeline|Running GotCloud/GenomeSTRiP Metadata Pipeline]].
 
* <code>--conf ${SS}/gotcloud.conf</code> points to the configuration file to use.
 
* <code>--conf ${SS}/gotcloud.conf</code> points to the configuration file to use.
 
** The configuration for this test was downloaded with the seqshop input files (same as other tutorials).
 
** The configuration for this test was downloaded with the seqshop input files (same as other tutorials).
Line 274: Line 373:  
<div class="mw-collapsible-content" style="width:800px">
 
<div class="mw-collapsible-content" style="width:800px">
 
       7 COHERENCE;COVERAGE;DEPTH;DEPTHPVAL
 
       7 COHERENCE;COVERAGE;DEPTH;DEPTHPVAL
     17 COHERENCE;COVERAGE;DEPTH;DEPTHPVAL;PAIRSPERSAMPLE
+
     18 COHERENCE;COVERAGE;DEPTH;DEPTHPVAL;PAIRSPERSAMPLE
 
       3 COHERENCE;COVERAGE;DEPTH;PAIRSPERSAMPLE
 
       3 COHERENCE;COVERAGE;DEPTH;PAIRSPERSAMPLE
 
       2 COHERENCE;COVERAGE;DEPTHPVAL;PAIRSPERSAMPLE
 
       2 COHERENCE;COVERAGE;DEPTHPVAL;PAIRSPERSAMPLE
Line 284: Line 383:  
       2 COVERAGE;DEPTH;PAIRSPERSAMPLE
 
       2 COVERAGE;DEPTH;PAIRSPERSAMPLE
 
       4 COVERAGE;DEPTHPVAL
 
       4 COVERAGE;DEPTHPVAL
       5 COVERAGE;DEPTHPVAL;PAIRSPERSAMPLE
+
       4 COVERAGE;DEPTHPVAL;PAIRSPERSAMPLE
 
       5 COVERAGE;PAIRSPERSAMPLE
 
       5 COVERAGE;PAIRSPERSAMPLE
 
</div>
 
</div>
Line 303: Line 402:  
The discovery pipeline only performs discovery of variant sites with filtering. You will need to iterate BAMs again to perform genotyping.  
 
The discovery pipeline only performs discovery of variant sites with filtering. You will need to iterate BAMs again to perform genotyping.  
 
* If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1.
 
* If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1.
  time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-genotype --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --gcroot ${GC}
+
  perl ${GC}/bin/genomestrip.pl --run-genotype --metadata ${SS}/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --base-prefix ${SS} --outdir ${OUT}
* The added <code>--gcroot ${GC}</code> option directs the pipeline to tabix/bgzip programs found within gotcloud.
      
This will take ~3 minutes to finish.
 
This will take ~3 minutes to finish.
Line 324: Line 422:  
You can take a 3rd-party site and genotype with GenomeSTRiP. Here we take a 1000 Genomes phase 1 sites and genotype them.
 
You can take a 3rd-party site and genotype with GenomeSTRiP. Here we take a 1000 Genomes phase 1 sites and genotype them.
 
* If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1.
 
* If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1.
  time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-thirdparty --in-vcf ${SS}/ext/1kg.phase1.chr22.36Mb.sites.vcf --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --gcroot ${GC} --numjobs 4
+
  perl ${GC}/bin/genomestrip.pl --run-thirdparty --in-vcf ${SS}/ext/1kg.phase1.chr22.36Mb.sites.vcf --metadata ${SS}/metadata --conf ${SS}/gotcloud.conf --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --numjobs 2
    
This will take ~1 minute to finish.
 
This will take ~1 minute to finish.
Line 351: Line 449:  
</div>
 
</div>
 
</div>
 
</div>
 +
 +
 +
== Return to Workshop Wiki Page ==
 +
Return to main workshop wiki page: [[SeqShop: December 2014]]

Navigation menu