Changes

3,977 bytes added , 17:09, 17 December 2014

→‎What does a real SV look like?

Line 1: Line 1: +

==Introduction==

+

Main Workshop wiki page: [[SeqShop: December 2014]]

+

See [[Media:Seqshop cnv partb 2014 06.pdf|lecture slides]] for the lecture slides associated with this tutorial.

+

== Goals of This Session ==

−

* What we want to learn ~~is calling large deletions using GenomeSTRiP implemented in [[GotCloud]] pipeline~~

+

* What we want to learn

** How to prepare metadata for running GenomeSTRiP.

** How to perform variant discovery and filtering for large deletions

** How to perform genotyping for large deletions

** How to perform variant discovery and filtering from third party sites.

−

~~Please refer to [[Media:Seqshop cnv partb 2014 06.pdf|Lecture slides]] for more general background.~~

== GenomeSTRiP ==

Line 35: Line 38:

* Setup an output directory

** It will leave your output directory from the previous tutorial in tact.

−

source /home/mktrost/seqshop/setup.txt

+

source /net/seqshop-server/home/mktrost/seqshop/setup.txt

* You won't see any output after running <code>source</code>

** It silently sets up your environment

** If you want to view the detail of the setup, type

−

less /home/mktrost/seqshop/setup.txt

+

less /net/seqshop-server/home/mktrost/seqshop/setup.txt

and press 'q' to finish.

Line 50: Line 53:

</div>

−

== Setup when running on your own outside of the SeqShop Workshop ==

Line 78: Line 80:

If you want to check if you still have the bam index file, run

−

head ${OUT}/bam.~~index~~

+

head ${OUT}/bam.list

<ul>

Line 84: Line 86:

<li>View Results</li>

−

HG00641 ~~ALL~~ /net/seqshop-server/hmkang/out/bams/HG00641.recal.bam

+

HG00641 /net/seqshop-server/home/hmkang/out/bams/HG00641.recal.bam

−

HG00640 ~~ALL~~ /net/seqshop-server/hmkang/out/bams/HG00640.recal.bam

+

HG00640 /net/seqshop-server/home/hmkang/out/bams/HG00640.recal.bam

−

HG00551 ~~ALL~~ /net/seqshop-server/hmkang/out/bams/HG00551.recal.bam

+

HG00551 /net/seqshop-server/home/hmkang/out/bams/HG00551.recal.bam

−

HG00553 ~~ALL~~ /net/seqshop-server/hmkang/out/bams/HG00553.recal.bam

+

HG00553 /net/seqshop-server/home/hmkang/out/bams/HG00553.recal.bam

−

HG00554 ~~ALL~~ bams/HG00554.recal.bam

+

HG00554 bams/HG00554.recal.bam

−

HG00637 ~~ALL~~ bams/HG00637.recal.bam

+

HG00637 bams/HG00637.recal.bam

−

HG00638 ~~ALL~~ bams/HG00638.recal.bam

+

HG00638 bams/HG00638.recal.bam

−

HG00734 ~~ALL~~ bams/HG00734.recal.bam

+

HG00734 bams/HG00734.recal.bam

−

HG00736 ~~ALL~~ bams/HG00736.recal.bam

+

HG00736 bams/HG00736.recal.bam

−

HG00737 ~~ALL~~ bams/HG00737.recal.bam

+

HG00737 bams/HG00737.recal.bam

</div>

Line 100: Line 102:

Also, make sure that you have only 62 samples (you did not append new files twice)

−

wc -l ${OUT}/bam.~~index~~

+

wc -l ${OUT}/bam.list

Your expected output is similar to this.

−

62 /net/seqshop-server/hmkang/out/bam.~~index~~

+

62 /net/seqshop-server/hmkang/out/bam.list

=== Reference Files ===

Line 126: Line 128:

<li>View Results</li>

−

~~[[File:RefDir~~.~~png|500px]]~~

+

1000G_omni2.5.b37.sites.PASS.chr22.vcf.gz

−

~~</div>~~

+

1000G_omni2.5.b37.sites.PASS.chr22.vcf.gz.tbi

−

~~</div>~~

+

1000G.snps_indels.22.sites.bcf

−

~~</ul>~~

+

1000G.snps_indels.22.sites.bcf.csi

−

+

1kg.omni.chr22.36Mb.vcf.gz

−

+

1kg.pilot_release.merged.indels.sites.hg19.chr22.vcf

−

~~Additional Reference files required just for Structural Variation:~~

+

dbsnp.13147541variants.22.sites.bcf

−

~~ls ${SS}/svtoolkit/ref~~

+

dbsnp.13147541variants.22.sites.bcf.csi

−

~~<ul>~~

+

dbsnp_135.b37.chr22.vcf.gz

−

~~<div class="mw~~-~~collapsible mw-collapsed" style="width:200px">~~

+

dbsnp_135.b37.chr22.vcf.gz.tbi

−

~~<li>View Results</li>~~

+

dbsnp.bcf.vcf

−

~~<div class="mw-collapsible-content" style="width:800px">~~

+

dbsnp.vcf.gz.vcf

−

human_g1k_v37.chr22.mask.100.fasta human_g1k_v37.chr22.mask.100.fasta.~~dict~~ ~~human_g1k_v37~~.chr22.~~mask~~.~~100~~.~~fasta~~.~~fai~~

+

gencode.cds.22.bed.gz

+

hapmap_3.3.b37.sites.chr22.vcf.gz

+

hapmap_3.3.b37.sites.chr22.vcf.gz.tbi

+

human.g1k.v37.chr22-bs.umfa

+

human.g1k.v37.chr22.dict

+

human.g1k.v37.chr22.fa

+

human.g1k.v37.chr22.fa.amb

+

human.g1k.v37.chr22.fa.ann

+

human.g1k.v37.chr22.fa.bwt

+

human.g1k.v37.chr22.fa.fai

+

human.g1k.v37.chr22.fa.pac

+

human.g1k.v37.chr22.fa.sa

+

human_g1k_v37.chr22.mask.100.fasta

+

human_g1k_v37.chr22.mask.100.fasta.fai

+

human.g1k.v37.chr22.winsize100.gc

+

humgen_g1k_v37_ploidy.chr22.map

+

indel.reference.txt

+

mdust.22.bed.gz

+

mills.208620indels.22.sites.bcf

+

mills.208620indels.22.sites.bcf.csi

+

mills_indels_hg19.22.sites.bcf

</div>

Line 145: Line 167:

Parameters files required just for Structural Variation:

−

ls ${SS}/svtoolkit/conf

+

ls ${GC}/src/svtoolkit/conf

<ul>

Line 151: Line 173:

<li>View Results</li>

−

genstrip_parameters.txt ~~humgen_g1k_v37_ploidy.chr22.map humgen_g1k_v37_ploidy.map~~

+

genstrip_parameters.txt

</div>

Line 165: Line 187:

Check out the GenomeStrip specific settings at the end of the configuration file

−

tail -n 8 ${SS}/gotcloud.conf

+

tail -n 5 ${SS}/gotcloud.conf

<ul>

Line 174: Line 196:

## GenomeSTRIP

#############################

−

~~GENOMESTRIP_OUT = $(OUT_DIR)/sv~~

+

GENOMESTRIP_MASK_FASTA = $(REF_DIR)/human_g1k_v37.chr22.mask.100.fasta

−

~~GENOMESTRIP_SVTOOLKIT_DIR = svtoolkit~~

+

GENOMESTRIP_PLOIDY_MAP = $(REF_DIR)/humgen_g1k_v37_ploidy.chr22.map

−

GENOMESTRIP_MASK_FASTA = $(~~GENOMESTRIP_SVTOOLKIT_DIR~~)~~/ref~~/human_g1k_v37.chr22.mask.100.fasta

−

GENOMESTRIP_PLOIDY_MAP = $(~~GENOMESTRIP_SVTOOLKIT_DIR~~)~~/conf~~/humgen_g1k_v37_ploidy.chr22.map

−

~~GENOMESTRIP_PARAM = $(GENOMESTRIP_SVTOOLKIT_DIR)/conf/genstrip_parameters.txt~~

</div>

Line 209: Line 228:

In addition, if one wants to genotype structural variants from other structural variant caller, there is a step available.

* Third-party Genotyping and Filtering step : Perform genotyping on the variant sites specified by an input VCF, and also perform variant filtering.

+

== Command Line Usage of GenomeSTRiP pipeline ==

+

To see how to use GenomeSTRiP pipeline, type

+

perl $GC/bin/genomestrip.pl

+

+

''View Results''

+

+

ERROR: One of command options among --run-metadata, --run-discovery, --run-genotype, --run-thirdparty must be specified

+

ERROR: Missing required option, outdir

+

Usage:

+

/net/seqshop-server/home/mktrost/seqshop/gotcloud/bin/genomestrip.pl

+

[options]

+

Help Options:

+

-help Print out brief help message [OFF]

+

-man Print the full documentation in man page style [OFF]

+

Command options:

+

-run-metadata Create metadata [OFF]

+

-run-discovery Run variant discovery and filtering. Can run with --run-metadata together [OFF]

+

-run-genotype Run genotyping - requires to finish run-metadata and run-discovery [OFF]

+

-run-thirdparty Run genotyping and filtering of third-party sites [OFF]

+

Options for input/output data:

+

-gotcloudroot|gcroot STRGotCloud Root Directory []

+

-conf STR GotCloud configuration files []

+

-outdir STR Override's conf file's OUT_DIR. Used as the genomestrip output directory unless --out or GENOMESTRIP_OUT is set []

+

-list STR BAM list file containing ID and BAM path []

+

-out STR Output directory which stores subdirectories such as metadata/, discovery/, genotypes/, thirdparty/ unless overriden individually []

+

-metadata STR Output directory to store --run-metadata results. Default is [OUT]/metadata/ []

+

-discovery STR Output directory to store --run-discovery results. Default is [OUT]/discovery/ []

+

-genotype STR Output directory to store --run-genotype results. Default is [OUT]/genotype/ []

+

-thirdparty STR Output directory to store --run-thirdparty results. Default is [OUT]/thirdparty/ []

+

Advanced Options:

+

-tmp-dir STR temporary directory to store temporary files. Default is [OUT]/tmp []

+

-gs-dir STR GenomeSTRiP svtoolkit directory []

+

-param STR GenomeSTRIP parameter file []

+

-ref STR Reference FASTA file []

+

-mask STR Reference mask FASTA file []

+

-ploidy-map STR Ploidy map file []

+

-mosix-opt STR MOSIX options []

+

-region STR Region to focus on the variants []

+

-unit INT Number of variants to be genotyped per parallel run [100]

+

Additional Inputs:

+

-in-vcf STR Input site VCF files used for --run-genotype or --run-thirdparty. For --run-thirdparty, this argument is required. For --run-genotype, default is [OUT]/discovery/discovery.vcf []

+

-pass-only Genotype only PASS-filtered variants, default is OFF [OFF]

+

-skip-rc Skip precomputing read count [OFF]

+

-base-prefix STR Prefix of all files []

+

-bam-prefix STR Prefix of BAM files []

+

-ref-prefix STR Prefix of Reference FASTA files []

+

-no-phonehome Skip phone home functionality [OFF]

+

-make-base-name STR Specifies the basename for the makefile []

+

-verbose Specifies that additional details are to be printed out [OFF]

+

-dry-run Perform a dry-run that only produces Makefile but not run it [OFF]

+

-numjobs INT Number of jobs to concurrently run [1]

+

-autosomes Perform analysis only on autosomes [OFF]

+

</div></div>

== Running GotCloud/GenomeSTRiP Metadata Pipeline ==

Line 215: Line 295:

In principle, the metadata can be created from the input BAM files by running the following command

−

perl ${SS}~~/svtoolkit~~/bin/genomestrip.pl -run-metadata --conf ${SS}/gotcloud.conf --numjobs 2 --base-prefix ${SS} --outdir ${OUT}

+

perl ${GC}/bin/genomestrip.pl --run-metadata --conf ${SS}/gotcloud.conf --numjobs 12 --base-prefix ${SS} --outdir ${OUT}

−

'''WAIT!!!!! DO NOT RUN THIS COMMAND, because it will take ~~~50 minutes~~ to finish'''.

+

'''WAIT!!!!! DO NOT RUN THIS COMMAND, because it will take >1 hour to finish'''.

Instead, let's look what the output would have looked like.

−

ls ${SS}~~/svtoolkit~~/metadata

+

ls ${SS}/metadata

−

cpt depth depth.dat gcprofile gcprofiles.zip genome_sizes.txt isd isd.dist.bin spans spans.dat

+

computerc.args.list

+

cpt

+

depth

+

depth.args.list

+

depth.dat

+

gcprofile

+

gcprofiles.list

+

gcprofiles.zip

+

genome_sizes.txt

+

isd

+

isd.dist.args.list

+

isd.dist.bin

+

rccache

+

rccache.bin

+

rccache.bin.idx

+

rccache.list

+

rccache.merge

+

spans

+

spans.args.list

+

spans.dat

The directory contains metadata output and other intermediate files produced by "GenomeSTRiP SVProcess" step.

Line 234: Line 333:

To discover large deletions from the 62 BAMs we are using for this workshop, you can run the following command

−

~~time~~ perl ${SS}~~/svtoolkit~~/bin/genomestrip.pl -run-discovery --metadata ${SS}/~~svtoolkit~~/~~metadata~~ --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT}

+

perl ${GC}/bin/genomestrip.pl --run-discovery --metadata ${SS}/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT}

−

* <code>${SS}~~/svtoolkit~~/bin/genomestrip.pl -run-discovery</code> runs the GenomeSTRiP Discovery Pipeline

+

* <code>${GC}/bin/genomestrip.pl -run-discovery</code> runs the GenomeSTRiP Discovery Pipeline

−

* <code>--metadata ${SS}~~/svtoolkit~~/metadata</code> points to the pre-made metadata file as explained in the previous section, [[#Running GotCloud/GenomeSTRiP Metadata Pipeline|Running GotCloud/GenomeSTRiP Metadata Pipeline]].

+

* <code>--metadata ${SS}/metadata</code> points to the pre-made metadata file as explained in the previous section, [[#Running GotCloud/GenomeSTRiP Metadata Pipeline|Running GotCloud/GenomeSTRiP Metadata Pipeline]].

* <code>--conf ${SS}/gotcloud.conf</code> points to the configuration file to use.

** The configuration for this test was downloaded with the seqshop input files (same as other tutorials).

Line 274: Line 373:

7 COHERENCE;COVERAGE;DEPTH;DEPTHPVAL

−

17 COHERENCE;COVERAGE;DEPTH;DEPTHPVAL;PAIRSPERSAMPLE

+

18 COHERENCE;COVERAGE;DEPTH;DEPTHPVAL;PAIRSPERSAMPLE

3 COHERENCE;COVERAGE;DEPTH;PAIRSPERSAMPLE

2 COHERENCE;COVERAGE;DEPTHPVAL;PAIRSPERSAMPLE

Line 284: Line 383:

2 COVERAGE;DEPTH;PAIRSPERSAMPLE

4 COVERAGE;DEPTHPVAL

−

5 COVERAGE;DEPTHPVAL;PAIRSPERSAMPLE

+

4 COVERAGE;DEPTHPVAL;PAIRSPERSAMPLE

5 COVERAGE;PAIRSPERSAMPLE

</div>

Line 303: Line 402:

The discovery pipeline only performs discovery of variant sites with filtering. You will need to iterate BAMs again to perform genotyping.

* If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1.

−

~~time~~ perl ${SS}~~/svtoolkit~~/bin/genomestrip.pl -run-genotype --metadata ${SS}~~/svtoolkit~~/metadata --conf ${SS}/gotcloud.conf --numjobs 4 ~~--region 22:36000000-37000000~~ --base-prefix ${SS} --outdir ${OUT} ~~--gcroot ${GC}~~

+

perl ${GC}/bin/genomestrip.pl --run-genotype --metadata ${SS}/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --base-prefix ${SS} --outdir ${OUT}

−

* The added <code>--gcroot ${GC}</code> option directs the pipeline to tabix/bgzip programs found within gotcloud.

This will take ~3 minutes to finish.

Line 324: Line 422:

You can take a 3rd-party site and genotype with GenomeSTRiP. Here we take a 1000 Genomes phase 1 sites and genotype them.

* If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1.

−

~~time~~ perl ${SS}~~/svtoolkit~~/bin/genomestrip.pl -run-thirdparty --in-vcf ${SS}/ext/1kg.phase1.chr22.36Mb.sites.vcf --metadata ${SS}~~/svtoolkit~~/metadata --conf ${SS}/gotcloud.conf --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT~~} --gcroot ${GC~~} --numjobs 4

+

perl ${GC}/bin/genomestrip.pl --run-thirdparty --in-vcf ${SS}/ext/1kg.phase1.chr22.36Mb.sites.vcf --metadata ${SS}/metadata --conf ${SS}/gotcloud.conf --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --numjobs 2

This will take ~1 minute to finish.

Line 351: Line 449:

</div>

+

== Return to Workshop Wiki Page ==

+

Return to main workshop wiki page: [[SeqShop: December 2014]]

Mktrost

Administrators

3,045

edits

Changes

SeqShop: Analysis of Structural Variation Practical, December 2014 (view source)

Revision as of 17:09, 17 December 2014

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools