Line 1: |
Line 1: |
| + | '''Note:''' the latest version of this practical is available at: [[SeqShop: Analysis of Structural Variation Practical]] |
| + | * The ones here is the original one from the June workshop (updated to be run from elsewhere) |
| + | |
| + | |
| == Goals of This Session == | | == Goals of This Session == |
| * What we want to learn is calling large deletions using GenomeSTRiP implemented in [[GotCloud]] pipeline | | * What we want to learn is calling large deletions using GenomeSTRiP implemented in [[GotCloud]] pipeline |
Line 8: |
Line 12: |
| Please refer to [[Media:Seqshop cnv partb 2014 06.pdf|Lecture slides]] for more general background. | | Please refer to [[Media:Seqshop cnv partb 2014 06.pdf|Lecture slides]] for more general background. |
| | | |
| + | == GenomeSTRiP == |
| + | GenomeSTRiP was developed at the Broad Institute and at the McCarroll Lab at the Harvard Medical School Department of Genetics: http://www.broadinstitute.org/software/genomestrip/ |
| + | |
| + | If you use GenomeSTRiP for your research, please cite it: |
| + | Handsaker RE, Korn JM, Nemesh J, McCarroll SA |
| + | Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. |
| + | Nature genetics 43, 269-276 (2011) |
| + | PMID: 21317889 |
| + | |
| + | GenomeStrip is currently included in with the seqshop example data under the svtoolkit directory. We have added the bin/ sub-directory to add a high level pipeline that will run genomestrip in the same framework as GotCloud. |
| | | |
| == Setup in person at the SeqShop Workshop == | | == Setup in person at the SeqShop Workshop == |
Line 47: |
Line 61: |
| ''If you are running during the SeqShop Workshop, please skip this section.'' | | ''If you are running during the SeqShop Workshop, please skip this section.'' |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
− | === Download the example data ===
| |
− |
| |
− | === Setup your run environment ===
| |
− |
| |
− | Environment variables will be used throughout the tutorial.
| |
| | | |
− | We recommend that you setup these variables so you won't have to modify every command in the tutorial.
| + | This tutorial builds on the alignment tutorial, if you have not already, please first run that tutorial: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014|Alignment Tutorial]] |
| | | |
− | <div class="mw-collapsible mw-collapsed" style="width:500px">
| + | It also uses the bam.index file created in the SnpCall Tutorial. If you have not yet run that tutorial, please follow the directions at: [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#GotCloud_BAM_Index_File|GotCloud BAM Index File]] |
− | I'm using bash (replace the paths below with the appropriate paths):
| |
− | <div class="mw-collapsible-content">
| |
− | * Point to where you installed GotCloud
| |
− | *:<pre>export GC=/home/username/gotcloud</pre>
| |
− | * Point to where you installed the seqshop files
| |
− | *:<pre>export SS=/home/username/seqshop/</pre>
| |
− | * Point to where you want the output to go
| |
− | *:<pre>export OUT=/home/username/seqshop_output/</pre>
| |
− | </div>
| |
− | </div>
| |
| | | |
− | <div class="mw-collapsible mw-collapsed" style="width:500px">
| |
− | I'm using tcsh (replace the paths below with the appropriate paths):
| |
− | <div class="mw-collapsible-content">
| |
− | * Point to where you installed GotCloud
| |
− | *:<pre>setenv GC /home/username/gotcloud</pre>
| |
− | * Point to where you installed the seqshop files
| |
− | *:<pre>setenv SS /home/username/seqshop/</pre>
| |
− | * Point to where you want the output to go
| |
− | *:<pre>setenv OUT /home/username/seqshop_output/</pre>
| |
− | </div>
| |
− | </div>
| |
| | | |
| + | {{SeqShopRemoteEnv}} |
| </div> | | </div> |
| </div> | | </div> |
− |
| |
| | | |
| == Examining GotCloud/GenomeSTRiP Input files == | | == Examining GotCloud/GenomeSTRiP Input files == |
Line 90: |
Line 78: |
| * BAMs->SVs rather than BAMs->SNPs | | * BAMs->SVs rather than BAMs->SNPs |
| | | |
− | If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]] | + | If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]] |
| | | |
| If you want to check if you still have the bam index file, run | | If you want to check if you still have the bam index file, run |
Line 147: |
Line 135: |
| </ul> | | </ul> |
| | | |
− | Reference files required just for Structural Variation: | + | |
| + | Additional Reference files required just for Structural Variation: |
| + | ls ${SS}/svtoolkit/ref |
| + | <ul> |
| + | <div class="mw-collapsible mw-collapsed" style="width:200px"> |
| + | <li>View Results</li> |
| + | <div class="mw-collapsible-content" style="width:800px"> |
| human_g1k_v37.chr22.mask.100.fasta human_g1k_v37.chr22.mask.100.fasta.dict human_g1k_v37.chr22.mask.100.fasta.fai | | human_g1k_v37.chr22.mask.100.fasta human_g1k_v37.chr22.mask.100.fasta.dict human_g1k_v37.chr22.mask.100.fasta.fai |
| + | </div> |
| + | </div> |
| + | </ul> |
| | | |
| | | |
| Parameters files required just for Structural Variation: | | Parameters files required just for Structural Variation: |
− | ls ${SS}/svtoolkit | + | ls ${SS}/svtoolkit/conf |
| | | |
| <ul> | | <ul> |
Line 166: |
Line 163: |
| We will use the same configuration file we used for the GotCloud Align tutorial. | | We will use the same configuration file we used for the GotCloud Align tutorial. |
| | | |
− | See [[SeqShop:_Sequence_Mapping_and_Assembly_Practical#GotCloud Configuration File|SeqShop: Alignment: GotCloud Configuration File]] for more details | + | See [[SeqShop:_Sequence_Mapping_and_Assembly_Practica, June 2014l#GotCloud Configuration File|SeqShop: Alignment: GotCloud Configuration File]] for more details |
| * Note we want to limit snpcall to just chr22 so the configuration already has <code>CHRS = 22</code> (default was 1-22 & X). | | * Note we want to limit snpcall to just chr22 so the configuration already has <code>CHRS = 22</code> (default was 1-22 & X). |
| | | |
Line 172: |
Line 169: |
| | | |
| Check out the GenomeStrip specific settings at the end of the configuration file | | Check out the GenomeStrip specific settings at the end of the configuration file |
− | tail -n 7 ${SS}/gotcloud.conf | + | tail -n 8 ${SS}/gotcloud.conf |
| | | |
| <ul> | | <ul> |
Line 181: |
Line 178: |
| ## GenomeSTRIP | | ## GenomeSTRIP |
| ############################# | | ############################# |
− | GENOMESTRIP_SVTOOLKIT_DIR = $(GOTCLOUD_ROOT)/../reference/svtoolkit | + | GENOMESTRIP_OUT = $(OUT_DIR)/sv |
| + | GENOMESTRIP_SVTOOLKIT_DIR = svtoolkit |
| GENOMESTRIP_MASK_FASTA = $(GENOMESTRIP_SVTOOLKIT_DIR)/ref/human_g1k_v37.chr22.mask.100.fasta | | GENOMESTRIP_MASK_FASTA = $(GENOMESTRIP_SVTOOLKIT_DIR)/ref/human_g1k_v37.chr22.mask.100.fasta |
| GENOMESTRIP_PLOIDY_MAP = $(GENOMESTRIP_SVTOOLKIT_DIR)/conf/humgen_g1k_v37_ploidy.chr22.map | | GENOMESTRIP_PLOIDY_MAP = $(GENOMESTRIP_SVTOOLKIT_DIR)/conf/humgen_g1k_v37_ploidy.chr22.map |
Line 196: |
Line 194: |
| # Currently, GenomeSTRiP only allows calling large deletions, but duplicate calling pipeline is under way. | | # Currently, GenomeSTRiP only allows calling large deletions, but duplicate calling pipeline is under way. |
| | | |
− | === Why do we use GotCloud/GenomeSTRiP pipeline instead of directly using GenomeSTRiP itself? === | + | === Why do we use GotCloud/GenomeSTRiP pipeline? === |
− | # The main purpose of GotCloud pipelines is to provide a pipeline for users with limited knowledge and experience with high performance computing environment. | + | # The main purpose of GotCloud pipelines is to provide a pipeline for users with limited knowledge and experience with high performance computing environment. |
− | #* Although GenomeSTRiP provides a reasonably straightforward pipeline, it still requires a detailed understanding of GATK framework and the details of parameter. | + | #* GotCloud/GenomeSTRiP provide a simple interface consistent to alignment, SNP, and indel calling. |
− | #* GotCloud aims to provide more simpler way to run these procedure | + | #* GenomeSTRiP itself also provides a straightforward pipeline to use as standalone software |
| # GotCloud supports a variety of cluster environment that is not currently supported by GenomeSTRiP | | # GotCloud supports a variety of cluster environment that is not currently supported by GenomeSTRiP |
− | #* GenomeSTRiP is designed based on a framework called Qscript, which provide a nice support for LSF cluster system, but it does not support many other cluster enviroments such as MOSIX or SLURM we use locally. | + | #* GenomeSTRiP is designed based on a framework called Qscript, which provide a nice support for LSF cluster system |
| + | #* GotCloud support many additional cluster environments such as MOSIX or SLURM we use locally at Michigan. |
| # GotCloud also provide a fault-tolerant solution for large-scale jobs. | | # GotCloud also provide a fault-tolerant solution for large-scale jobs. |
| #* GotCloud automatically picks up jobs from the point where it failed. This allows easier and simpler run against potential technical glitches in the system. | | #* GotCloud automatically picks up jobs from the point where it failed. This allows easier and simpler run against potential technical glitches in the system. |
Line 220: |
Line 219: |
| | | |
| In principle, the metadata can be created from the input BAM files by running the following command | | In principle, the metadata can be created from the input BAM files by running the following command |
− | time perl $GC/bin/genomestrip.pl -run-metadata --out $OUT/sv --conf $IN/gotcloud.conf --numjobs 2 | + | perl ${SS}/svtoolkit/bin/genomestrip.pl -run-metadata --conf ${SS}/gotcloud.conf --numjobs 2 --base-prefix ${SS} --outdir ${OUT} |
| | | |
| '''WAIT!!!!! DO NOT RUN THIS COMMAND, because it will take ~50 minutes to finish'''. | | '''WAIT!!!!! DO NOT RUN THIS COMMAND, because it will take ~50 minutes to finish'''. |
Line 239: |
Line 238: |
| | | |
| To discover large deletions from the 62 BAMs we are using for this workshop, you can run the following command | | To discover large deletions from the 62 BAMs we are using for this workshop, you can run the following command |
| + | time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-discovery --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} |
| + | * <code>${SS}/svtoolkit/bin/genomestrip.pl -run-discovery</code> runs the GenomeSTRiP Discovery Pipeline |
| + | * <code>--metadata ${SS}/svtoolkit/metadata</code> points to the pre-made metadata file as explained in the previous section, [[#Running GotCloud/GenomeSTRiP Metadata Pipeline|Running GotCloud/GenomeSTRiP Metadata Pipeline]]. |
| + | * <code>--conf ${SS}/gotcloud.conf</code> points to the configuration file to use. |
| + | ** The configuration for this test was downloaded with the seqshop input files (same as other tutorials). |
| + | * <code>--numjobs</code> tells how many jobs to run in parallel |
| + | ** Depends on your system |
| + | * <code>--region 22:36000000-37000000</code> |
| + | ** The sample files are just a small region of chromosome 22, so to save time, we tell the pipeline to ignore the other regions |
| + | * <code>--base_prefix</code> tells the pipeline the prefix to append to relative paths. |
| + | ** The Configuration file cannot read environment variables, so we need to tell it the path to the input files, ${SS} |
| + | ** Alternatively, gotcloud.conf could be updated to specify the full paths |
| + | * <code>--out_dir</code> tells GotCloud where to write the output. |
| + | ** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line |
| + | ** Based on <code>gotcloud.conf</code>, the GenomeSTRiP output will go in <code>$(OUT_DIR)/sv</code> |
| | | |
− | time perl $GC/bin/genomestrip.pl -run-discovery --metadata $IN/metadata --out $OUT/sv --conf $IN/gotcloud.conf --region 22:36000000-37000000 --numjobs 2
| + | This will take ~2-3 minutes to finish. |
− | | |
− | This will take ~2 minutes to finish. | |
| | | |
| Let's see the final outputs produced. | | Let's see the final outputs produced. |
| | | |
− | less $OUT/sv/discovery/discovery.vcf | + | less ${OUT}/sv/discovery/discovery.vcf |
| | | |
| You will see output file that looks like this | | You will see output file that looks like this |
Line 294: |
Line 306: |
| | | |
| The discovery pipeline only performs discovery of variant sites with filtering. You will need to iterate BAMs again to perform genotyping. | | The discovery pipeline only performs discovery of variant sites with filtering. You will need to iterate BAMs again to perform genotyping. |
− | | + | * If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1. |
− | time perl $GC/bin/genomestrip.pl -run-genotype --metadata $IN/metadata --out $OUT/sv --conf $IN/gotcloud.conf --region 22:36000000-37000000 --numjobs 4 | + | time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-genotype --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --gcroot ${GC} |
| + | * The added <code>--gcroot ${GC}</code> option directs the pipeline to tabix/bgzip programs found within gotcloud. |
| | | |
| This will take ~3 minutes to finish. | | This will take ~3 minutes to finish. |
Line 314: |
Line 327: |
| | | |
| You can take a 3rd-party site and genotype with GenomeSTRiP. Here we take a 1000 Genomes phase 1 sites and genotype them. | | You can take a 3rd-party site and genotype with GenomeSTRiP. Here we take a 1000 Genomes phase 1 sites and genotype them. |
− | | + | * If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1. |
− | time perl $GC/bin/genomestrip.pl -run-thirdparty --in-vcf $EXT/1kg.phase1.chr22.36Mb.sites.vcf --metadata $IN/metadata --out $OUT/sv --conf $IN/gotcloud.conf --region 22:36000000-37000000 --numjobs 4 | + | time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-thirdparty --in-vcf ${SS}/ext/1kg.phase1.chr22.36Mb.sites.vcf --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --gcroot ${GC} --numjobs 4 |
| | | |
| This will take ~1 minute to finish. | | This will take ~1 minute to finish. |