Line 1: |
Line 1: |
| + | '''Note:''' the latest version of this practical is available at: [[SeqShop: Analysis of Structural Variation Practical]] |
| + | * The ones here is the original one from the June workshop (updated to be run from elsewhere) |
| + | |
| + | |
| == Goals of This Session == | | == Goals of This Session == |
| * What we want to learn is calling large deletions using GenomeSTRiP implemented in [[GotCloud]] pipeline | | * What we want to learn is calling large deletions using GenomeSTRiP implemented in [[GotCloud]] pipeline |
Line 8: |
Line 12: |
| Please refer to [[Media:Seqshop cnv partb 2014 06.pdf|Lecture slides]] for more general background. | | Please refer to [[Media:Seqshop cnv partb 2014 06.pdf|Lecture slides]] for more general background. |
| | | |
| + | == GenomeSTRiP == |
| + | GenomeSTRiP was developed at the Broad Institute and at the McCarroll Lab at the Harvard Medical School Department of Genetics: http://www.broadinstitute.org/software/genomestrip/ |
| + | |
| + | If you use GenomeSTRiP for your research, please cite it: |
| + | Handsaker RE, Korn JM, Nemesh J, McCarroll SA |
| + | Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. |
| + | Nature genetics 43, 269-276 (2011) |
| + | PMID: 21317889 |
| + | |
| + | GenomeStrip is currently included in with the seqshop example data under the svtoolkit directory. We have added the bin/ sub-directory to add a high level pipeline that will run genomestrip in the same framework as GotCloud. |
| | | |
| == Setup in person at the SeqShop Workshop == | | == Setup in person at the SeqShop Workshop == |
Line 47: |
Line 61: |
| ''If you are running during the SeqShop Workshop, please skip this section.'' | | ''If you are running during the SeqShop Workshop, please skip this section.'' |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
− | === Download the example data ===
| |
| | | |
− | === Setup your run environment ===
| + | This tutorial builds on the alignment tutorial, if you have not already, please first run that tutorial: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014|Alignment Tutorial]] |
| | | |
− | Environment variables will be used throughout the tutorial.
| + | It also uses the bam.index file created in the SnpCall Tutorial. If you have not yet run that tutorial, please follow the directions at: [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#GotCloud_BAM_Index_File|GotCloud BAM Index File]] |
| | | |
− | We recommend that you setup these variables so you won't have to modify every command in the tutorial.
| |
| | | |
− | <div class="mw-collapsible mw-collapsed" style="width:500px">
| + | {{SeqShopRemoteEnv}} |
− | I'm using bash (replace the paths below with the appropriate paths):
| |
− | <div class="mw-collapsible-content">
| |
− | * Point to where you installed GotCloud
| |
− | *:<pre>export GC=/home/username/gotcloud</pre>
| |
− | * Point to where you installed the seqshop files
| |
− | *:<pre>export SS=/home/username/seqshop/</pre>
| |
− | * Point to where you want the output to go
| |
− | *:<pre>export OUT=/home/username/seqshop_output/</pre>
| |
| </div> | | </div> |
| </div> | | </div> |
− |
| |
− | <div class="mw-collapsible mw-collapsed" style="width:500px">
| |
− | I'm using tcsh (replace the paths below with the appropriate paths):
| |
− | <div class="mw-collapsible-content">
| |
− | * Point to where you installed GotCloud
| |
− | *:<pre>setenv GC /home/username/gotcloud</pre>
| |
− | * Point to where you installed the seqshop files
| |
− | *:<pre>setenv SS /home/username/seqshop/</pre>
| |
− | * Point to where you want the output to go
| |
− | *:<pre>setenv OUT /home/username/seqshop_output/</pre>
| |
− | </div>
| |
− | </div>
| |
− |
| |
− | </div>
| |
− | </div>
| |
− |
| |
| | | |
| == Examining GotCloud/GenomeSTRiP Input files == | | == Examining GotCloud/GenomeSTRiP Input files == |
Line 90: |
Line 78: |
| * BAMs->SVs rather than BAMs->SNPs | | * BAMs->SVs rather than BAMs->SNPs |
| | | |
− | If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]] | + | If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]] |
| | | |
| If you want to check if you still have the bam index file, run | | If you want to check if you still have the bam index file, run |
Line 175: |
Line 163: |
| We will use the same configuration file we used for the GotCloud Align tutorial. | | We will use the same configuration file we used for the GotCloud Align tutorial. |
| | | |
− | See [[SeqShop:_Sequence_Mapping_and_Assembly_Practical#GotCloud Configuration File|SeqShop: Alignment: GotCloud Configuration File]] for more details | + | See [[SeqShop:_Sequence_Mapping_and_Assembly_Practica, June 2014l#GotCloud Configuration File|SeqShop: Alignment: GotCloud Configuration File]] for more details |
| * Note we want to limit snpcall to just chr22 so the configuration already has <code>CHRS = 22</code> (default was 1-22 & X). | | * Note we want to limit snpcall to just chr22 so the configuration already has <code>CHRS = 22</code> (default was 1-22 & X). |
| | | |
Line 181: |
Line 169: |
| | | |
| Check out the GenomeStrip specific settings at the end of the configuration file | | Check out the GenomeStrip specific settings at the end of the configuration file |
− | tail -n 7 ${SS}/gotcloud.conf | + | tail -n 8 ${SS}/gotcloud.conf |
| | | |
| <ul> | | <ul> |
Line 190: |
Line 178: |
| ## GenomeSTRIP | | ## GenomeSTRIP |
| ############################# | | ############################# |
− | GENOMESTRIP_SVTOOLKIT_DIR = $(GOTCLOUD_ROOT)/../reference/svtoolkit | + | GENOMESTRIP_OUT = $(OUT_DIR)/sv |
| + | GENOMESTRIP_SVTOOLKIT_DIR = svtoolkit |
| GENOMESTRIP_MASK_FASTA = $(GENOMESTRIP_SVTOOLKIT_DIR)/ref/human_g1k_v37.chr22.mask.100.fasta | | GENOMESTRIP_MASK_FASTA = $(GENOMESTRIP_SVTOOLKIT_DIR)/ref/human_g1k_v37.chr22.mask.100.fasta |
| GENOMESTRIP_PLOIDY_MAP = $(GENOMESTRIP_SVTOOLKIT_DIR)/conf/humgen_g1k_v37_ploidy.chr22.map | | GENOMESTRIP_PLOIDY_MAP = $(GENOMESTRIP_SVTOOLKIT_DIR)/conf/humgen_g1k_v37_ploidy.chr22.map |
Line 205: |
Line 194: |
| # Currently, GenomeSTRiP only allows calling large deletions, but duplicate calling pipeline is under way. | | # Currently, GenomeSTRiP only allows calling large deletions, but duplicate calling pipeline is under way. |
| | | |
− | === Why do we use GotCloud/GenomeSTRiP pipeline instead of directly using GenomeSTRiP itself? === | + | === Why do we use GotCloud/GenomeSTRiP pipeline? === |
− | # The main purpose of GotCloud pipelines is to provide a pipeline for users with limited knowledge and experience with high performance computing environment. | + | # The main purpose of GotCloud pipelines is to provide a pipeline for users with limited knowledge and experience with high performance computing environment. |
− | #* Although GenomeSTRiP provides a reasonably straightforward pipeline, it still requires a detailed understanding of GATK framework and the details of parameter. | + | #* GotCloud/GenomeSTRiP provide a simple interface consistent to alignment, SNP, and indel calling. |
− | #* GotCloud aims to provide more simpler way to run these procedure | + | #* GenomeSTRiP itself also provides a straightforward pipeline to use as standalone software |
| # GotCloud supports a variety of cluster environment that is not currently supported by GenomeSTRiP | | # GotCloud supports a variety of cluster environment that is not currently supported by GenomeSTRiP |
− | #* GenomeSTRiP is designed based on a framework called Qscript, which provide a nice support for LSF cluster system, but it does not support many other cluster enviroments such as MOSIX or SLURM we use locally. | + | #* GenomeSTRiP is designed based on a framework called Qscript, which provide a nice support for LSF cluster system |
| + | #* GotCloud support many additional cluster environments such as MOSIX or SLURM we use locally at Michigan. |
| # GotCloud also provide a fault-tolerant solution for large-scale jobs. | | # GotCloud also provide a fault-tolerant solution for large-scale jobs. |
| #* GotCloud automatically picks up jobs from the point where it failed. This allows easier and simpler run against potential technical glitches in the system. | | #* GotCloud automatically picks up jobs from the point where it failed. This allows easier and simpler run against potential technical glitches in the system. |
Line 229: |
Line 219: |
| | | |
| In principle, the metadata can be created from the input BAM files by running the following command | | In principle, the metadata can be created from the input BAM files by running the following command |
− | time perl $GC/bin/genomestrip.pl -run-metadata --out $OUT/sv --conf $IN/gotcloud.conf --numjobs 2 | + | perl ${SS}/svtoolkit/bin/genomestrip.pl -run-metadata --conf ${SS}/gotcloud.conf --numjobs 2 --base-prefix ${SS} --outdir ${OUT} |
| | | |
| '''WAIT!!!!! DO NOT RUN THIS COMMAND, because it will take ~50 minutes to finish'''. | | '''WAIT!!!!! DO NOT RUN THIS COMMAND, because it will take ~50 minutes to finish'''. |
Line 316: |
Line 306: |
| | | |
| The discovery pipeline only performs discovery of variant sites with filtering. You will need to iterate BAMs again to perform genotyping. | | The discovery pipeline only performs discovery of variant sites with filtering. You will need to iterate BAMs again to perform genotyping. |
− | | + | * If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1. |
− | time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-genotype --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} | + | time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-genotype --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --gcroot ${GC} |
| + | * The added <code>--gcroot ${GC}</code> option directs the pipeline to tabix/bgzip programs found within gotcloud. |
| | | |
| This will take ~3 minutes to finish. | | This will take ~3 minutes to finish. |
Line 336: |
Line 327: |
| | | |
| You can take a 3rd-party site and genotype with GenomeSTRiP. Here we take a 1000 Genomes phase 1 sites and genotype them. | | You can take a 3rd-party site and genotype with GenomeSTRiP. Here we take a 1000 Genomes phase 1 sites and genotype them. |
− | | + | * If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1. |
− | time perl $GC/bin/genomestrip.pl -run-thirdparty --in-vcf $EXT/1kg.phase1.chr22.36Mb.sites.vcf --metadata $IN/metadata --out $OUT/sv --conf $IN/gotcloud.conf --region 22:36000000-37000000 --numjobs 4 | + | time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-thirdparty --in-vcf ${SS}/ext/1kg.phase1.chr22.36Mb.sites.vcf --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --gcroot ${GC} --numjobs 4 |
| | | |
| This will take ~1 minute to finish. | | This will take ~1 minute to finish. |