Line 1: |
Line 1: |
| + | '''Note:''' the latest version of this practical is available at: [[SeqShop: Analysis of Structural Variation Practical]] |
| + | * The ones here is the original one from the June workshop (updated to be run from elsewhere) |
| + | |
| + | |
| == Goals of This Session == | | == Goals of This Session == |
| * What we want to learn is calling large deletions using GenomeSTRiP implemented in [[GotCloud]] pipeline | | * What we want to learn is calling large deletions using GenomeSTRiP implemented in [[GotCloud]] pipeline |
Line 8: |
Line 12: |
| Please refer to [[Media:Seqshop cnv partb 2014 06.pdf|Lecture slides]] for more general background. | | Please refer to [[Media:Seqshop cnv partb 2014 06.pdf|Lecture slides]] for more general background. |
| | | |
| + | == GenomeSTRiP == |
| + | GenomeSTRiP was developed at the Broad Institute and at the McCarroll Lab at the Harvard Medical School Department of Genetics: http://www.broadinstitute.org/software/genomestrip/ |
| + | |
| + | If you use GenomeSTRiP for your research, please cite it: |
| + | Handsaker RE, Korn JM, Nemesh J, McCarroll SA |
| + | Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. |
| + | Nature genetics 43, 269-276 (2011) |
| + | PMID: 21317889 |
| + | |
| + | GenomeStrip is currently included in with the seqshop example data under the svtoolkit directory. We have added the bin/ sub-directory to add a high level pipeline that will run genomestrip in the same framework as GotCloud. |
| | | |
| == Setup in person at the SeqShop Workshop == | | == Setup in person at the SeqShop Workshop == |
Line 48: |
Line 62: |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
| | | |
− | This tutorial builds on the alignment tutorial, if you have not already, please first run that tutorial: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical|Alignment Tutorial]] | + | This tutorial builds on the alignment tutorial, if you have not already, please first run that tutorial: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014|Alignment Tutorial]] |
| | | |
− | It also uses the bam.index file created in the SnpCall Tutorial. If you have not yet run that tutorial, please follow the directions at: [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical#GotCloud_BAM_Index_File|GotCloud BAM Index File]] | + | It also uses the bam.index file created in the SnpCall Tutorial. If you have not yet run that tutorial, please follow the directions at: [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#GotCloud_BAM_Index_File|GotCloud BAM Index File]] |
| | | |
| | | |
Line 64: |
Line 78: |
| * BAMs->SVs rather than BAMs->SNPs | | * BAMs->SVs rather than BAMs->SNPs |
| | | |
− | If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]] | + | If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]] |
| | | |
| If you want to check if you still have the bam index file, run | | If you want to check if you still have the bam index file, run |
Line 149: |
Line 163: |
| We will use the same configuration file we used for the GotCloud Align tutorial. | | We will use the same configuration file we used for the GotCloud Align tutorial. |
| | | |
− | See [[SeqShop:_Sequence_Mapping_and_Assembly_Practical#GotCloud Configuration File|SeqShop: Alignment: GotCloud Configuration File]] for more details | + | See [[SeqShop:_Sequence_Mapping_and_Assembly_Practica, June 2014l#GotCloud Configuration File|SeqShop: Alignment: GotCloud Configuration File]] for more details |
| * Note we want to limit snpcall to just chr22 so the configuration already has <code>CHRS = 22</code> (default was 1-22 & X). | | * Note we want to limit snpcall to just chr22 so the configuration already has <code>CHRS = 22</code> (default was 1-22 & X). |
| | | |
Line 180: |
Line 194: |
| # Currently, GenomeSTRiP only allows calling large deletions, but duplicate calling pipeline is under way. | | # Currently, GenomeSTRiP only allows calling large deletions, but duplicate calling pipeline is under way. |
| | | |
− | === Why do we use GotCloud/GenomeSTRiP pipeline instead of directly using GenomeSTRiP itself? === | + | === Why do we use GotCloud/GenomeSTRiP pipeline? === |
− | # The main purpose of GotCloud pipelines is to provide a pipeline for users with limited knowledge and experience with high performance computing environment. | + | # The main purpose of GotCloud pipelines is to provide a pipeline for users with limited knowledge and experience with high performance computing environment. |
− | #* Although GenomeSTRiP provides a reasonably straightforward pipeline, it still requires a detailed understanding of GATK framework and the details of parameter. | + | #* GotCloud/GenomeSTRiP provide a simple interface consistent to alignment, SNP, and indel calling. |
− | #* GotCloud aims to provide more simpler way to run these procedure | + | #* GenomeSTRiP itself also provides a straightforward pipeline to use as standalone software |
| # GotCloud supports a variety of cluster environment that is not currently supported by GenomeSTRiP | | # GotCloud supports a variety of cluster environment that is not currently supported by GenomeSTRiP |
− | #* GenomeSTRiP is designed based on a framework called Qscript, which provide a nice support for LSF cluster system, but it does not support many other cluster enviroments such as MOSIX or SLURM we use locally. | + | #* GenomeSTRiP is designed based on a framework called Qscript, which provide a nice support for LSF cluster system |
| + | #* GotCloud support many additional cluster environments such as MOSIX or SLURM we use locally at Michigan. |
| # GotCloud also provide a fault-tolerant solution for large-scale jobs. | | # GotCloud also provide a fault-tolerant solution for large-scale jobs. |
| #* GotCloud automatically picks up jobs from the point where it failed. This allows easier and simpler run against potential technical glitches in the system. | | #* GotCloud automatically picks up jobs from the point where it failed. This allows easier and simpler run against potential technical glitches in the system. |
Line 204: |
Line 219: |
| | | |
| In principle, the metadata can be created from the input BAM files by running the following command | | In principle, the metadata can be created from the input BAM files by running the following command |
− | #time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-metadata --conf ${SS}/gotcloud.conf --numjobs 2 --base-prefix ${SS} --outdir ${OUT} | + | perl ${SS}/svtoolkit/bin/genomestrip.pl -run-metadata --conf ${SS}/gotcloud.conf --numjobs 2 --base-prefix ${SS} --outdir ${OUT} |
| | | |
| '''WAIT!!!!! DO NOT RUN THIS COMMAND, because it will take ~50 minutes to finish'''. | | '''WAIT!!!!! DO NOT RUN THIS COMMAND, because it will take ~50 minutes to finish'''. |
Line 291: |
Line 306: |
| | | |
| The discovery pipeline only performs discovery of variant sites with filtering. You will need to iterate BAMs again to perform genotyping. | | The discovery pipeline only performs discovery of variant sites with filtering. You will need to iterate BAMs again to perform genotyping. |
− | | + | * If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1. |
| time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-genotype --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --gcroot ${GC} | | time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-genotype --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --numjobs 4 --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --gcroot ${GC} |
| * The added <code>--gcroot ${GC}</code> option directs the pipeline to tabix/bgzip programs found within gotcloud. | | * The added <code>--gcroot ${GC}</code> option directs the pipeline to tabix/bgzip programs found within gotcloud. |
Line 312: |
Line 327: |
| | | |
| You can take a 3rd-party site and genotype with GenomeSTRiP. Here we take a 1000 Genomes phase 1 sites and genotype them. | | You can take a 3rd-party site and genotype with GenomeSTRiP. Here we take a 1000 Genomes phase 1 sites and genotype them. |
− | | + | * If running on a small machine, you may want to reduce <code>--numjobs</code> from 4 to 1. |
− | time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-thirdparty --in-vcf ${SS}/ext/1kg.phase1.chr22.36Mb.sites.vcf --metadata ${SS}/svtoolkit/metadata --outdir ${OUT} --base-prefix ${SS} --conf ${SS}/gotcloud.conf --region 22:36000000-37000000 --numjobs 4 | + | time perl ${SS}/svtoolkit/bin/genomestrip.pl -run-thirdparty --in-vcf ${SS}/ext/1kg.phase1.chr22.36Mb.sites.vcf --metadata ${SS}/svtoolkit/metadata --conf ${SS}/gotcloud.conf --region 22:36000000-37000000 --base-prefix ${SS} --outdir ${OUT} --gcroot ${GC} --numjobs 4 |
| | | |
| This will take ~1 minute to finish. | | This will take ~1 minute to finish. |