Line 2: |
Line 2: |
| Main Workshop wiki page: [[SeqShop: December 2014]] | | Main Workshop wiki page: [[SeqShop: December 2014]] |
| | | |
− | See the [[Media:Seqshop cnv partb 2014 06.pdf|introductory slides]] for an intro to this tutorial. | + | See [[Media:Seqshop cnv partb 2014 06.pdf|lecture slides]] for the lecture slides associated with this tutorial. |
| | | |
| == Goals of This Session == | | == Goals of This Session == |
Line 24: |
Line 24: |
| == Setup in person at the SeqShop Workshop == | | == Setup in person at the SeqShop Workshop == |
| ''This section is specifically for the SeqShop Workshop computers.'' | | ''This section is specifically for the SeqShop Workshop computers.'' |
− | <div class="mw-collapsible" style="width:600px"> | + | <div class="mw-collapsible mw-collapsed" style="width:600px"> |
| ''If you are not running during the SeqShop Workshop, please skip this section.'' | | ''If you are not running during the SeqShop Workshop, please skip this section.'' |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
Line 38: |
Line 38: |
| * Setup an output directory | | * Setup an output directory |
| ** It will leave your output directory from the previous tutorial in tact. | | ** It will leave your output directory from the previous tutorial in tact. |
− | source /home/mktrost/seqshop/setup.txt | + | source /net/seqshop-server/home/mktrost/seqshop/setup.txt |
| * You won't see any output after running <code>source</code> | | * You won't see any output after running <code>source</code> |
| ** It silently sets up your environment | | ** It silently sets up your environment |
| ** If you want to view the detail of the setup, type | | ** If you want to view the detail of the setup, type |
− | less /home/mktrost/seqshop/setup.txt | + | less /net/seqshop-server/home/mktrost/seqshop/setup.txt |
| and press 'q' to finish. | | and press 'q' to finish. |
| | | |
Line 53: |
Line 53: |
| </div> | | </div> |
| </div> | | </div> |
− |
| |
| | | |
| == Setup when running on your own outside of the SeqShop Workshop == | | == Setup when running on your own outside of the SeqShop Workshop == |
Line 81: |
Line 80: |
| If you want to check if you still have the bam index file, run | | If you want to check if you still have the bam index file, run |
| | | |
− | head ${OUT}/bam.index | + | head ${OUT}/bam.list |
| | | |
| <ul> | | <ul> |
Line 87: |
Line 86: |
| <li>View Results</li> | | <li>View Results</li> |
| <div class="mw-collapsible-content" style="width:800px"> | | <div class="mw-collapsible-content" style="width:800px"> |
− | HG00641 ALL /net/seqshop-server/hmkang/out/bams/HG00641.recal.bam | + | HG00641 /net/seqshop-server/home/hmkang/out/bams/HG00641.recal.bam |
− | HG00640 ALL /net/seqshop-server/hmkang/out/bams/HG00640.recal.bam | + | HG00640 /net/seqshop-server/home/hmkang/out/bams/HG00640.recal.bam |
− | HG00551 ALL /net/seqshop-server/hmkang/out/bams/HG00551.recal.bam | + | HG00551 /net/seqshop-server/home/hmkang/out/bams/HG00551.recal.bam |
− | HG00553 ALL /net/seqshop-server/hmkang/out/bams/HG00553.recal.bam | + | HG00553 /net/seqshop-server/home/hmkang/out/bams/HG00553.recal.bam |
− | HG00554 ALL bams/HG00554.recal.bam | + | HG00554 bams/HG00554.recal.bam |
− | HG00637 ALL bams/HG00637.recal.bam | + | HG00637 bams/HG00637.recal.bam |
− | HG00638 ALL bams/HG00638.recal.bam | + | HG00638 bams/HG00638.recal.bam |
− | HG00734 ALL bams/HG00734.recal.bam | + | HG00734 bams/HG00734.recal.bam |
− | HG00736 ALL bams/HG00736.recal.bam | + | HG00736 bams/HG00736.recal.bam |
− | HG00737 ALL bams/HG00737.recal.bam | + | HG00737 bams/HG00737.recal.bam |
| </div> | | </div> |
| </div> | | </div> |
Line 103: |
Line 102: |
| Also, make sure that you have only 62 samples (you did not append new files twice) | | Also, make sure that you have only 62 samples (you did not append new files twice) |
| | | |
− | wc -l ${OUT}/bam.index | + | wc -l ${OUT}/bam.list |
| | | |
| Your expected output is similar to this. | | Your expected output is similar to this. |
| | | |
− | 62 /net/seqshop-server/hmkang/out/bam.index | + | 62 /net/seqshop-server/hmkang/out/bam.list |
| | | |
| === Reference Files === | | === Reference Files === |
Line 129: |
Line 128: |
| <li>View Results</li> | | <li>View Results</li> |
| <div class="mw-collapsible-content" style="width:800px"> | | <div class="mw-collapsible-content" style="width:800px"> |
− | [[File:RefDir.png|500px]]
| + | 1000G_omni2.5.b37.sites.PASS.chr22.vcf.gz |
− | </div>
| + | 1000G_omni2.5.b37.sites.PASS.chr22.vcf.gz.tbi |
− | </div>
| + | 1000G.snps_indels.22.sites.bcf |
− | </ul>
| + | 1000G.snps_indels.22.sites.bcf.csi |
− | | + | 1kg.omni.chr22.36Mb.vcf.gz |
− | | + | 1kg.pilot_release.merged.indels.sites.hg19.chr22.vcf |
− | Additional Reference files required just for Structural Variation:
| + | dbsnp.13147541variants.22.sites.bcf |
− | ls ${SS}/svtoolkit/ref | + | dbsnp.13147541variants.22.sites.bcf.csi |
− | <ul>
| + | dbsnp_135.b37.chr22.vcf.gz |
− | <div class="mw-collapsible mw-collapsed" style="width:200px">
| + | dbsnp_135.b37.chr22.vcf.gz.tbi |
− | <li>View Results</li>
| + | dbsnp.bcf.vcf |
− | <div class="mw-collapsible-content" style="width:800px">
| + | dbsnp.vcf.gz.vcf |
− | 1000G_omni2.5.b37.sites.PASS.chr22.vcf.gz dbsnp.13147541variants.22.sites.bcf gencode.cds.22.bed.gz human.g1k.v37.chr22.fa.amb human_g1k_v37.chr22.mask.100.fasta mills.208620indels.22.sites.bcf | + | gencode.cds.22.bed.gz |
− | 1000G_omni2.5.b37.sites.PASS.chr22.vcf.gz.tbi dbsnp.13147541variants.22.sites.bcf.csi hapmap_3.3.b37.sites.chr22.vcf.gz human.g1k.v37.chr22.fa.ann human_g1k_v37.chr22.mask.100.fasta.fai mills.208620indels.22.sites.bcf.csi
| + | hapmap_3.3.b37.sites.chr22.vcf.gz |
− | 1000G.snps_indels.22.sites.bcf dbsnp_135.b37.chr22.vcf.gz hapmap_3.3.b37.sites.chr22.vcf.gz.tbi human.g1k.v37.chr22.fa.bwt human.g1k.v37.chr22.winsize100.gc mills_indels_hg19.22.sites.bcf
| + | hapmap_3.3.b37.sites.chr22.vcf.gz.tbi |
− | 1000G.snps_indels.22.sites.bcf.csi dbsnp_135.b37.chr22.vcf.gz.tbi human.g1k.v37.chr22-bs.umfa human.g1k.v37.chr22.fa.fai humgen_g1k_v37_ploidy.chr22.map
| + | human.g1k.v37.chr22-bs.umfa |
− | 1kg.omni.chr22.36Mb.vcf.gz dbsnp.bcf.vcf human.g1k.v37.chr22.dict human.g1k.v37.chr22.fa.pac indel.reference.txt
| + | human.g1k.v37.chr22.dict |
− | 1kg.pilot_release.merged.indels.sites.hg19.chr22.vcf dbsnp.vcf.gz.vcf human.g1k.v37.chr22.fa human.g1k.v37.chr22.fa.sa mdust.22.bed.gz
| + | human.g1k.v37.chr22.fa |
| + | human.g1k.v37.chr22.fa.amb |
| + | human.g1k.v37.chr22.fa.ann |
| + | human.g1k.v37.chr22.fa.bwt |
| + | human.g1k.v37.chr22.fa.fai |
| + | human.g1k.v37.chr22.fa.pac |
| + | human.g1k.v37.chr22.fa.sa |
| + | human_g1k_v37.chr22.mask.100.fasta |
| + | human_g1k_v37.chr22.mask.100.fasta.fai |
| + | human.g1k.v37.chr22.winsize100.gc |
| + | humgen_g1k_v37_ploidy.chr22.map |
| + | indel.reference.txt |
| + | mdust.22.bed.gz |
| + | mills.208620indels.22.sites.bcf |
| + | mills.208620indels.22.sites.bcf.csi |
| + | mills_indels_hg19.22.sites.bcf |
| </div> | | </div> |
| </div> | | </div> |
Line 153: |
Line 167: |
| | | |
| Parameters files required just for Structural Variation: | | Parameters files required just for Structural Variation: |
− | ls ${SS}/svtoolkit/conf | + | ls ${GC}/src/svtoolkit/conf |
| | | |
| <ul> | | <ul> |
Line 159: |
Line 173: |
| <li>View Results</li> | | <li>View Results</li> |
| <div class="mw-collapsible-content" style="width:800px"> | | <div class="mw-collapsible-content" style="width:800px"> |
− | genstrip_parameters.txt humgen_g1k_v37_ploidy.chr22.map humgen_g1k_v37_ploidy.map | + | genstrip_parameters.txt |
| </div> | | </div> |
| </div> | | </div> |
Line 173: |
Line 187: |
| | | |
| Check out the GenomeStrip specific settings at the end of the configuration file | | Check out the GenomeStrip specific settings at the end of the configuration file |
− | tail -n 8 ${SS}/gotcloud.conf | + | tail -n 5 ${SS}/gotcloud.conf |
| | | |
| <ul> | | <ul> |
Line 182: |
Line 196: |
| ## GenomeSTRIP | | ## GenomeSTRIP |
| ############################# | | ############################# |
− | GENOMESTRIP_OUT = $(OUT_DIR)/sv
| + | GENOMESTRIP_MASK_FASTA = $(REF_DIR)/human_g1k_v37.chr22.mask.100.fasta |
− | GENOMESTRIP_SVTOOLKIT_DIR = svtoolkit
| + | GENOMESTRIP_PLOIDY_MAP = $(REF_DIR)/humgen_g1k_v37_ploidy.chr22.map |
− | GENOMESTRIP_MASK_FASTA = $(GENOMESTRIP_SVTOOLKIT_DIR)/ref/human_g1k_v37.chr22.mask.100.fasta | |
− | GENOMESTRIP_PLOIDY_MAP = $(GENOMESTRIP_SVTOOLKIT_DIR)/conf/humgen_g1k_v37_ploidy.chr22.map | |
− | GENOMESTRIP_PARAM = $(GENOMESTRIP_SVTOOLKIT_DIR)/conf/genstrip_parameters.txt
| |
| </div> | | </div> |
| </div> | | </div> |
Line 217: |
Line 228: |
| In addition, if one wants to genotype structural variants from other structural variant caller, there is a step available. | | In addition, if one wants to genotype structural variants from other structural variant caller, there is a step available. |
| * Third-party Genotyping and Filtering step : Perform genotyping on the variant sites specified by an input VCF, and also perform variant filtering. | | * Third-party Genotyping and Filtering step : Perform genotyping on the variant sites specified by an input VCF, and also perform variant filtering. |
| + | |
| + | == Command Line Usage of GenomeSTRiP pipeline == |
| + | |
| + | To see how to use GenomeSTRiP pipeline, type |
| + | perl $GC/bin/genomestrip.pl |
| + | |
| + | <div class="mw-collapsible mw-collapsed"> |
| + | ''View Results'' |
| + | <div class="mw-collapsible-content"> |
| + | ERROR: One of command options among --run-metadata, --run-discovery, --run-genotype, --run-thirdparty must be specified |
| + | ERROR: Missing required option, outdir |
| + | Usage: |
| + | /net/seqshop-server/home/mktrost/seqshop/gotcloud/bin/genomestrip.pl |
| + | [options] |
| + | |
| + | Help Options: |
| + | -help Print out brief help message [OFF] |
| + | -man Print the full documentation in man page style [OFF] |
| + | |
| + | Command options: |
| + | -run-metadata Create metadata [OFF] |
| + | -run-discovery Run variant discovery and filtering. Can run with --run-metadata together [OFF] |
| + | -run-genotype Run genotyping - requires to finish run-metadata and run-discovery [OFF] |
| + | -run-thirdparty Run genotyping and filtering of third-party sites [OFF] |
| + | |
| + | Options for input/output data: |
| + | -gotcloudroot|gcroot STRGotCloud Root Directory [] |
| + | -conf STR GotCloud configuration files [] |
| + | -outdir STR Override's conf file's OUT_DIR. Used as the genomestrip output directory unless --out or GENOMESTRIP_OUT is set [] |
| + | -list STR BAM list file containing ID and BAM path [] |
| + | -out STR Output directory which stores subdirectories such as metadata/, discovery/, genotypes/, thirdparty/ unless overriden individually [] |
| + | -metadata STR Output directory to store --run-metadata results. Default is [OUT]/metadata/ [] |
| + | -discovery STR Output directory to store --run-discovery results. Default is [OUT]/discovery/ [] |
| + | -genotype STR Output directory to store --run-genotype results. Default is [OUT]/genotype/ [] |
| + | -thirdparty STR Output directory to store --run-thirdparty results. Default is [OUT]/thirdparty/ [] |
| + | |
| + | Advanced Options: |
| + | -tmp-dir STR temporary directory to store temporary files. Default is [OUT]/tmp [] |
| + | -gs-dir STR GenomeSTRiP svtoolkit directory [] |
| + | -param STR GenomeSTRIP parameter file [] |
| + | -ref STR Reference FASTA file [] |
| + | -mask STR Reference mask FASTA file [] |
| + | -ploidy-map STR Ploidy map file [] |
| + | -mosix-opt STR MOSIX options [] |
| + | -region STR Region to focus on the variants [] |
| + | -unit INT Number of variants to be genotyped per parallel run [100] |
| + | |
| + | Additional Inputs: |
| + | -in-vcf STR Input site VCF files used for --run-genotype or --run-thirdparty. For --run-thirdparty, this argument is required. For --run-genotype, default is [OUT]/discovery/discovery.vcf [] |
| + | -pass-only Genotype only PASS-filtered variants, default is OFF [OFF] |
| + | -skip-rc Skip precomputing read count [OFF] |
| + | -base-prefix STR Prefix of all files [] |
| + | -bam-prefix STR Prefix of BAM files [] |
| + | -ref-prefix STR Prefix of Reference FASTA files [] |
| + | -no-phonehome Skip phone home functionality [OFF] |
| + | -make-base-name STR Specifies the basename for the makefile [] |
| + | -verbose Specifies that additional details are to be printed out [OFF] |
| + | -dry-run Perform a dry-run that only produces Makefile but not run it [OFF] |
| + | -numjobs INT Number of jobs to concurrently run [1] |
| + | -autosomes Perform analysis only on autosomes [OFF] |
| + | </div></div> |
| | | |
| == Running GotCloud/GenomeSTRiP Metadata Pipeline == | | == Running GotCloud/GenomeSTRiP Metadata Pipeline == |
Line 223: |
Line 295: |
| | | |
| In principle, the metadata can be created from the input BAM files by running the following command | | In principle, the metadata can be created from the input BAM files by running the following command |
− | perl ${SS}/svtoolkit/bin/genomestrip.pl -run-metadata --conf ${SS}/gotcloud.conf --numjobs 2 --base-prefix ${SS} --outdir ${OUT} | + | perl ${GC}/bin/genomestrip.pl --run-metadata --conf ${SS}/gotcloud.conf --numjobs 12 --base-prefix ${SS} --outdir ${OUT} |
| | | |
− | '''WAIT!!!!! DO NOT RUN THIS COMMAND, because it will take ~50 minutes to finish'''. | + | '''WAIT!!!!! DO NOT RUN THIS COMMAND, because it will take >1 hour to finish'''. |
| | | |
| Instead, let's look what the output would have looked like. | | Instead, let's look what the output would have looked like. |
| | | |
− | ls ${SS}/svtoolkit/metadata | + | ls ${SS}/metadata |
| | | |
− | cpt depth depth.dat gcprofile gcprofiles.zip genome_sizes.txt isd isd.dist.bin spans spans.dat | + | computerc.args.list |
| + | cpt |
| + | depth |
| + | depth.args.list |
| + | depth.dat |
| + | gcprofile |
| + | gcprofiles.list |
| + | gcprofiles.zip |
| + | genome_sizes.txt |
| + | isd |
| + | isd.dist.args.list |
| + | isd.dist.bin |
| + | rccache |
| + | rccache.bin |
| + | rccache.bin.idx |
| + | rccache.list |
| + | rccache.merge |
| + | spans |
| + | spans.args.list |
| + | spans.dat |
| | | |
| The directory contains metadata output and other intermediate files produced by "GenomeSTRiP SVProcess" step. | | The directory contains metadata output and other intermediate files produced by "GenomeSTRiP SVProcess" step. |
Line 242: |
Line 333: |
| | | |
| To discover large deletions from the 62 BAMs we are using for this workshop, you can run the following command | | To discover large deletions from the 62 BAMs we are using for this workshop, you can run the following command |
− | time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-discovery --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} | + | perl ${GC}/bin/genomestrip.pl --run-discovery --metadata ${SS}/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} |
− | * <code>${SS}/svtoolkit/bin/genomestrip.pl -run-discovery</code> runs the GenomeSTRiP Discovery Pipeline | + | * <code>${GC}/bin/genomestrip.pl -run-discovery</code> runs the GenomeSTRiP Discovery Pipeline |
− | * <code>--metadata ${SS}/svtoolkit/metadata</code> points to the pre-made metadata file as explained in the previous section, [[#Running GotCloud/GenomeSTRiP Metadata Pipeline|Running GotCloud/GenomeSTRiP Metadata Pipeline]]. | + | * <code>--metadata ${SS}/metadata</code> points to the pre-made metadata file as explained in the previous section, [[#Running GotCloud/GenomeSTRiP Metadata Pipeline|Running GotCloud/GenomeSTRiP Metadata Pipeline]]. |
| * <code>--conf ${SS}/gotcloud.conf</code> points to the configuration file to use. | | * <code>--conf ${SS}/gotcloud.conf</code> points to the configuration file to use. |
| ** The configuration for this test was downloaded with the seqshop input files (same as other tutorials). | | ** The configuration for this test was downloaded with the seqshop input files (same as other tutorials). |
Line 282: |
Line 373: |
| <div class="mw-collapsible-content" style="width:800px"> | | <div class="mw-collapsible-content" style="width:800px"> |
| 7 COHERENCE;COVERAGE;DEPTH;DEPTHPVAL | | 7 COHERENCE;COVERAGE;DEPTH;DEPTHPVAL |
− | 17 COHERENCE;COVERAGE;DEPTH;DEPTHPVAL;PAIRSPERSAMPLE | + | 18 COHERENCE;COVERAGE;DEPTH;DEPTHPVAL;PAIRSPERSAMPLE |
| 3 COHERENCE;COVERAGE;DEPTH;PAIRSPERSAMPLE | | 3 COHERENCE;COVERAGE;DEPTH;PAIRSPERSAMPLE |
| 2 COHERENCE;COVERAGE;DEPTHPVAL;PAIRSPERSAMPLE | | 2 COHERENCE;COVERAGE;DEPTHPVAL;PAIRSPERSAMPLE |
Line 292: |
Line 383: |
| 2 COVERAGE;DEPTH;PAIRSPERSAMPLE | | 2 COVERAGE;DEPTH;PAIRSPERSAMPLE |
| 4 COVERAGE;DEPTHPVAL | | 4 COVERAGE;DEPTHPVAL |
− | 5 COVERAGE;DEPTHPVAL;PAIRSPERSAMPLE | + | 4 COVERAGE;DEPTHPVAL;PAIRSPERSAMPLE |
| 5 COVERAGE;PAIRSPERSAMPLE | | 5 COVERAGE;PAIRSPERSAMPLE |
| </div> | | </div> |
Line 311: |
Line 402: |
| The discovery pipeline only performs discovery of variant sites with filtering. You will need to iterate BAMs again to perform genotyping. | | The discovery pipeline only performs discovery of variant sites with filtering. You will need to iterate BAMs again to perform genotyping. |
| * If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1. | | * If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1. |
− | time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-genotype --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --gcroot ${GC} | + | perl ${GC}/bin/genomestrip.pl --run-genotype --metadata ${SS}/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --base-prefix ${SS} --outdir ${OUT} |
− | * The added <code>--gcroot ${GC}</code> option directs the pipeline to tabix/bgzip programs found within gotcloud.
| |
| | | |
| This will take ~3 minutes to finish. | | This will take ~3 minutes to finish. |
Line 332: |
Line 422: |
| You can take a 3rd-party site and genotype with GenomeSTRiP. Here we take a 1000 Genomes phase 1 sites and genotype them. | | You can take a 3rd-party site and genotype with GenomeSTRiP. Here we take a 1000 Genomes phase 1 sites and genotype them. |
| * If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1. | | * If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1. |
− | time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-thirdparty --in-vcf ${SS}/ext/1kg.phase1.chr22.36Mb.sites.vcf --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --gcroot ${GC} --numjobs 4 | + | perl ${GC}/bin/genomestrip.pl --run-thirdparty --in-vcf ${SS}/ext/1kg.phase1.chr22.36Mb.sites.vcf --metadata ${SS}/metadata --conf ${SS}/gotcloud.conf --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --numjobs 2 |
| | | |
| This will take ~1 minute to finish. | | This will take ~1 minute to finish. |
Line 359: |
Line 449: |
| </div> | | </div> |
| </div> | | </div> |
| + | |
| + | |
| + | == Return to Workshop Wiki Page == |
| + | Return to main workshop wiki page: [[SeqShop: December 2014]] |