From Genome Analysis Wiki
Jump to: navigation, search
What does a real SV look like?
Main Workshop wiki page: [[SeqShop: December 2014]]
See the [[Media:Seqshop cnv partb 2014 06.pdf|introductory lecture slides]] for an intro to the lecture slides associated with this tutorial.
== Goals of This Session ==
== Setup in person at the SeqShop Workshop ==
''This section is specifically for the SeqShop Workshop computers.''
<div class="mw-collapsiblemw-collapsed" style="width:600px">
''If you are not running during the SeqShop Workshop, please skip this section.''
<div class="mw-collapsible-content">
* Setup an output directory
** It will leave your output directory from the previous tutorial in tact.
source /net/seqshop-server/home/mktrost/seqshop/setup.txt
* You won't see any output after running <code>source</code>
** It silently sets up your environment
** If you want to view the detail of the setup, type
less /net/seqshop-server/home/mktrost/seqshop/setup.txt
and press 'q' to finish.
== Setup when running on your own outside of the SeqShop Workshop ==
Parameters files required just for Structural Variation:
ls ${GC}/src/svtoolkit/conf
In addition, if one wants to genotype structural variants from other structural variant caller, there is a step available.
* Third-party Genotyping and Filtering step : Perform genotyping on the variant sites specified by an input VCF, and also perform variant filtering.
== Command Line Usage of GenomeSTRiP pipeline ==
To see how to use GenomeSTRiP pipeline, type
perl $GC/bin/
<div class="mw-collapsible mw-collapsed">
''View Results''
<div class="mw-collapsible-content">
ERROR: One of command options among --run-metadata, --run-discovery, --run-genotype, --run-thirdparty must be specified
ERROR: Missing required option, outdir
Help Options:
-help Print out brief help message [OFF]
-man Print the full documentation in man page style [OFF]
Command options:
-run-metadata Create metadata [OFF]
-run-discovery Run variant discovery and filtering. Can run with --run-metadata together [OFF]
-run-genotype Run genotyping - requires to finish run-metadata and run-discovery [OFF]
-run-thirdparty Run genotyping and filtering of third-party sites [OFF]
Options for input/output data:
-gotcloudroot|gcroot STRGotCloud Root Directory []
-conf STR GotCloud configuration files []
-outdir STR Override's conf file's OUT_DIR. Used as the genomestrip output directory unless --out or GENOMESTRIP_OUT is set []
-list STR BAM list file containing ID and BAM path []
-out STR Output directory which stores subdirectories such as metadata/, discovery/, genotypes/, thirdparty/ unless overriden individually []
-metadata STR Output directory to store --run-metadata results. Default is [OUT]/metadata/ []
-discovery STR Output directory to store --run-discovery results. Default is [OUT]/discovery/ []
-genotype STR Output directory to store --run-genotype results. Default is [OUT]/genotype/ []
-thirdparty STR Output directory to store --run-thirdparty results. Default is [OUT]/thirdparty/ []
Advanced Options:
-tmp-dir STR temporary directory to store temporary files. Default is [OUT]/tmp []
-gs-dir STR GenomeSTRiP svtoolkit directory []
-param STR GenomeSTRIP parameter file []
-ref STR Reference FASTA file []
-mask STR Reference mask FASTA file []
-ploidy-map STR Ploidy map file []
-mosix-opt STR MOSIX options []
-region STR Region to focus on the variants []
-unit INT Number of variants to be genotyped per parallel run [100]
Additional Inputs:
-in-vcf STR Input site VCF files used for --run-genotype or --run-thirdparty. For --run-thirdparty, this argument is required. For --run-genotype, default is [OUT]/discovery/discovery.vcf []
-pass-only Genotype only PASS-filtered variants, default is OFF [OFF]
-skip-rc Skip precomputing read count [OFF]
-base-prefix STR Prefix of all files []
-bam-prefix STR Prefix of BAM files []
-ref-prefix STR Prefix of Reference FASTA files []
-no-phonehome Skip phone home functionality [OFF]
-make-base-name STR Specifies the basename for the makefile []
-verbose Specifies that additional details are to be printed out [OFF]
-dry-run Perform a dry-run that only produces Makefile but not run it [OFF]
-numjobs INT Number of jobs to concurrently run [1]
-autosomes Perform analysis only on autosomes [OFF]
== Running GotCloud/GenomeSTRiP Metadata Pipeline ==
In principle, the metadata can be created from the input BAM files by running the following command
perl ${SSGC}/svtoolkit/bin/ --run-metadata --conf ${SS}/gotcloud.conf --numjobs 2 12 --base-prefix ${SS} --outdir ${OUT}
'''WAIT!!!!! DO NOT RUN THIS COMMAND, because it will take ~50 minutes >1 hour to finish'''.
Instead, let's look what the output would have looked like.
ls ${SS}/svtoolkit/metadata
computerc.args.list cpt depth depth .args.list depth.dat gcprofile gcprofiles.list genome_sizes.txt isd isd.dist.args.list isd.dist.bin rccache rccache.bin rccache.bin.idx rccache.list rccache.merge spans spans.args.list spans.dat
The directory contains metadata output and other intermediate files produced by "GenomeSTRiP SVProcess" step.
To discover large deletions from the 62 BAMs we are using for this workshop, you can run the following command
time perl ${SSGC}/svtoolkit/bin/ --run-discovery --metadata ${SS}/svtoolkitmetadata --conf ${SS}/metadata gotcloud.conf --numjobs 4 --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT}* <code>${SSGC}/svtoolkit/bin/ -run-discovery</code> runs the GenomeSTRiP Discovery Pipeline* <code>--metadata ${SS}/svtoolkit/metadata</code> points to the pre-made metadata file as explained in the previous section, [[#Running GotCloud/GenomeSTRiP Metadata Pipeline|Running GotCloud/GenomeSTRiP Metadata Pipeline]].
* <code>--conf ${SS}/gotcloud.conf</code> points to the configuration file to use.
** The configuration for this test was downloaded with the seqshop input files (same as other tutorials).
<div class="mw-collapsible-content" style="width:800px">
The discovery pipeline only performs discovery of variant sites with filtering. You will need to iterate BAMs again to perform genotyping.
* If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1.
time perl ${SSGC}/svtoolkit/bin/ --run-genotype --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --gcroot ${GC}* The added <code>--gcroot ${GC}</code> option directs the pipeline to tabix/bgzip programs found within gotcloud.
This will take ~3 minutes to finish.
You can take a 3rd-party site and genotype with GenomeSTRiP. Here we take a 1000 Genomes phase 1 sites and genotype them.
* If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1.
time perl ${SSGC}/svtoolkit/bin/ --run-thirdparty --in-vcf ${SS}/ext/1kg.phase1.chr22.36Mb.sites.vcf --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --gcroot ${GC} --numjobs 42
This will take ~1 minute to finish.
 == Starting SNP Call on your own Genome Return to Workshop Wiki Page ==Go Return to main workshop wiki page: [[SeqShop: Calling Your Own Genome, December 2014]] so we can run SNP calling overnight.

Navigation menu