Changes

From Genome Analysis Wiki
Jump to navigationJump to search
341 bytes added ,  12:35, 10 January 2014
no edit summary
Line 8: Line 8:  
GotCloud and this basic tutorial were presented at the [http://ibg.colorado.edu/dokuwiki/doku.php?id=workshop:2013:announcement 2013 IBG Workshop].  It was presented in two sessions.  On Wednesday an overview was presented with steps for running the tutorial data: [[Media:IBG2013GotCloud.pdf|IBG2013GotCloud.pdf]].  On Friday more detail on the input files and what goes into generating the input files was presented: [[Media:GotCloudIBGWorkshop2013Friday.pdf|GotCloudIBGWorkshop2013Friday.pdf]].
 
GotCloud and this basic tutorial were presented at the [http://ibg.colorado.edu/dokuwiki/doku.php?id=workshop:2013:announcement 2013 IBG Workshop].  It was presented in two sessions.  On Wednesday an overview was presented with steps for running the tutorial data: [[Media:IBG2013GotCloud.pdf|IBG2013GotCloud.pdf]].  On Friday more detail on the input files and what goes into generating the input files was presented: [[Media:GotCloudIBGWorkshop2013Friday.pdf|GotCloudIBGWorkshop2013Friday.pdf]].
   −
'''This tutorial is in the process of being updated for gotcloud version 1.06 (April 17. 2013).'''
+
'''This tutorial is in the process of being updated for gotcloud version 1.08 (July 30, 2013).'''
    
== STEP 1 : Setup GotCloud ==
 
== STEP 1 : Setup GotCloud ==
Line 15: Line 15:     
We will use 3 different directories for this tutorial:
 
We will use 3 different directories for this tutorial:
# path to the directory where gotcloud is installed, default ~/gotcloud/
+
# path to the directory where gotcloud is installed, default ~/gotcloud-latest/
 
# path to the directory where the example data is installed, default ~/gotcloudExample
 
# path to the directory where the example data is installed, default ~/gotcloudExample
 
# path to your output directory, default ~/gotcloudTutorialOut/
 
# path to your output directory, default ~/gotcloudTutorialOut/
Line 28: Line 28:     
Otherwise, you can install it in your own directory:
 
Otherwise, you can install it in your own directory:
# Change to the directory where you want gotcloud/ installed
+
# Change to the directory where you want gotcloud-latest/ installed
# Download the gotcloud tar from the ftp site.
+
# Download the gotcloud tar.
 
# Extract the tar
 
# Extract the tar
 
# Build (compile) the source
 
# Build (compile) the source
Line 35: Line 35:     
  cd ~
 
  cd ~
  wget ftp://share.sph.umich.edu/gotcloud/gotcloud_latest.tgz # Download
+
  wget https://github.com/statgen/gotcloud/archive/latest.tar.gz # Download
  tar xf gotcloud_latest.tgz     # Extracts into gotcloud/
+
  tar xf latest.tar.gz     # Extracts into gotcloud-latest/
  cd ~/gotcloud/src; make        # Build source
+
  cd gotcloud-latest/src; make        # Build source
  −
GotCloud requires the following tools to be installed.
  −
You can run ~/gotcloud/scripts/check_requirements.sh
  −
...TBD – put in required programs/tools.
  −
* java (java-common default-jre on ubuntu)
  −
* make (make on ubuntu)
  −
* libssl (libssl0.9.8 on ubuntu)
  −
* gcc 4.4 or newer
      
=== Step 1b: Install Example Dataset ===
 
=== Step 1b: Install Example Dataset ===
Our dataset consists of 60 individuals from Great Britain (GBR) sequenced by the 1000 Genomes Project. These individuals have been sequenced to an average depth of about 4x.
+
Our dataset consists of individuals from Great Britain (GBR) sequenced by the 1000 Genomes Project. These individuals have been sequenced to an average depth of about 4x.
    
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20, 42900000 - 43200000.  
 
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20, 42900000 - 43200000.  
   −
The tutorial will run the alignment pipeline on 2 of the individuals (HG00096, HG00100).  The fastqs used for this step are reduced to reads that fall into our target region.
+
The alignment pipeline in this tutorial will be run on 2 of the individuals (HG00096, HG00100).  The fastqs used for this step are reduced to reads that fall into our target region.
   −
The tutorial will then used previously aligned/mapped reads for the full 60 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.  
+
The snpcall and ldrefine pipelines will use previously aligned/mapped reads for 60 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.  
    
The example dataset we'll be using is available at: ftp://share.sph.umich.edu/gotcloud/gotcloudExample.tgz  
 
The example dataset we'll be using is available at: ftp://share.sph.umich.edu/gotcloud/gotcloudExample.tgz  
Line 64: Line 56:  
  cd ~
 
  cd ~
 
  wget ftp://share.sph.umich.edu/gotcloud/gotcloudExample_latest.tgz  # Download  
 
  wget ftp://share.sph.umich.edu/gotcloud/gotcloudExample_latest.tgz  # Download  
  tar xvf gotcloudExample_latest.tgz    # Extracts into gotcloudExample/
+
  tar xf gotcloudExample_latest.tgz    # Extracts into gotcloudExample/
    
== STEP 2 : Run GotCloud Alignment Pipeline ==  
 
== STEP 2 : Run GotCloud Alignment Pipeline ==  
Line 83: Line 75:     
Run the alignment pipeline (the example aligns 2 samples) :  
 
Run the alignment pipeline (the example aligns 2 samples) :  
  ~/gotcloud/gotcloud align --conf ~/gotcloudExample/[[#Alignment Configuration File|GBR2align.conf]] --outdir [[#Alignment Output Directory|~/gotcloudTutorialOut]]  
+
  cd ~
 +
gotcloud-latest/gotcloud align --conf gotcloudExample/[[#Alignment Configuration File|GBR2align.conf]] --outdir [[#Alignment Output Directory|gotcloudTutorialOut]] --baseprefix gotcloudExample
   −
Upon successful completion of the alignment pipeline (about 1-2 minutes), you will see the following message:  
+
Upon successful completion of the alignment pipeline (about 1-3 minutes), you will see the following message:  
  Processing finished in nn secs with no errors reported  
+
  Processing finished in n secs with no errors reported  
    
The final BAM files produced by the alignment pipeline are:  
 
The final BAM files produced by the alignment pipeline are:  
  ls ~/gotcloudTutorialOut/bams
+
  ls gotcloudTutorialOut/bams
 
In this directory you will see:
 
In this directory you will see:
 
* BAM (.bam) files - 1 per sample
 
* BAM (.bam) files - 1 per sample
Line 97: Line 90:  
** HG00096.recal.bam.bai  
 
** HG00096.recal.bam.bai  
 
** HG00100.recal.bam.bai  
 
** HG00100.recal.bam.bai  
* BAM checksum files (.md5) – 1 per sample
+
* Indicator files that the step completed successfully:
** HG00096.recal.bam.md5
  −
** HG00100.recal.bam.md5
  −
* Indicator flies that the step completed successfully:
   
** HG00096.recal.bam.done  
 
** HG00096.recal.bam.done  
 
** HG00100.recal.bam.done  
 
** HG00100.recal.bam.done  
    
The Quality Control (QC) files are:  
 
The Quality Control (QC) files are:  
  ls ~/gotcloudTutorialOut/QCFiles
+
  ls gotcloudTutorialOut/QCFiles
 
In this directory you will see:
 
In this directory you will see:
 
* VerifyBamID output files:
 
* VerifyBamID output files:
Line 133: Line 123:  
For information on the VerifyBamID output, see: [[Understanding VerifyBamID output]]  
 
For information on the VerifyBamID output, see: [[Understanding VerifyBamID output]]  
   −
For information on the QPLOT output, see: [[Understanding QPLOT output]]  
+
For information on the QPLOT output, see: [[Understanding QPLOT output]]
    
== STEP 3 : Run GotCloud Variant Calling Pipeline ==  
 
== STEP 3 : Run GotCloud Variant Calling Pipeline ==  
Line 151: Line 141:     
Run the variant calling pipeline:  
 
Run the variant calling pipeline:  
  ~/gotcloud/gotcloud snpcall --conf [[GBR60vc.conf]] --outdir ~/gotcloudTutorialOut --numjobs 2 --region 20:42900000-43200000  
+
  cd ~
 +
gotcloud-latest/gotcloud snpcall --conf gotcloudExample/[[GBR60vc.conf]] --outdir gotcloudTutorialOut --numjobs 2 --region 20:42900000-43200000 --baseprefix gotcloudExample
   −
Upon successful completion of the variant calling pipeline (about 3-4 minutes), you will see the following message:  
+
Upon successful completion of the variant calling pipeline (about 2-4 minutes), you will see the following message:  
 
   Commands finished in nnn secs with no errors reported  
 
   Commands finished in nnn secs with no errors reported  
    
On SNP Call success, the VCF files of interest are:  
 
On SNP Call success, the VCF files of interest are:  
  ls ~/gotcloudTutorialOut/vcfs/chr20/chr20.filtered*
+
  ls gotcloudTutorialOut/vcfs/chr20/chr20.filtered*
    
This gives you the following files:
 
This gives you the following files:
* '''chr20.filtered.vcf.gz ''' - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL including per sample genotypes
+
* '''chr20.filtered.vcf.gz ''' - vcf for whole chromosome after it has been run through hardfilters and SVM filters and marked with PASS/FAIL including per sample genotypes
 
* chr20.filtered.sites.vcf - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL without the per sample genotypes
 
* chr20.filtered.sites.vcf - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL without the per sample genotypes
* chr20.filtered.sites.vcf.log - log file
+
* chr20.filtered.sites.vcf.norm.log - log file
 
* chr20.filtered.sites.vcf.summary - summary of filters applied
 
* chr20.filtered.sites.vcf.summary - summary of filters applied
 
* chr20.filtered.vcf.gz.OK - indicator that the filtering completed successfully
 
* chr20.filtered.vcf.gz.OK - indicator that the filtering completed successfully
Line 173: Line 164:  
** chr20.merged.vcf - including per sample genotypes
 
** chr20.merged.vcf - including per sample genotypes
 
** chr20.merged.vcf.OK - indicator that the step completed successfully
 
** chr20.merged.vcf.OK - indicator that the step completed successfully
 +
* the hardfiltered (pre-svm filtered) variant calls:
 +
** chr20.hardfiltered.vcf.gz - vcf for whole chromosome after it has been run through hard filters
 +
** chr20.hardfiltered.sites.vcf - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL without the per sample genotypes
 +
** chr20.hardfiltered.sites.vcf.log - log file
 +
** chr20.hardfiltered.sites.vcf.summary - summary of filters applied
 +
** chr20.hardfiltered.vcf.gz.OK - indicator that the filtering completed successfully
 +
** chr20.hardfiltered.vcf.gz.tbi - index file for the vcf file
 
* 40000001.45000000 subdirectory contains the data for just that region.
 
* 40000001.45000000 subdirectory contains the data for just that region.
   Line 189: Line 187:  
Note: the tutorial does not produce a target directory, but if you run with targeted data, you may see that.
 
Note: the tutorial does not produce a target directory, but if you run with targeted data, you may see that.
   −
== STEP 4 : Run Support Vector Machine (SVM) Pipeline ==
+
== STEP 4 : Run GotCloud Genotype Refinement Pipeline ==  
 
  −
== STEP 5 : Run GotCloud Genotype Refinement Pipeline ==  
   
The next step is to perform genotype refinement using linkage disequilibrium information using [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] & [[ThunderVCF]].  
 
The next step is to perform genotype refinement using linkage disequilibrium information using [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] & [[ThunderVCF]].  
    
Run the LD-aware genotype refinement pipeline:  
 
Run the LD-aware genotype refinement pipeline:  
  ~/gotcloud/gotcloud ldrefine --conf [[GBR60vc.conf]] --outdir ~/gotcloudTutorialOut --numjobs 2  
+
  cd ~
 +
gotcloud-latest/gotcloud ldrefine --conf gotcloudExample/[[GBR60vc.conf]] --outdir gotcloudTutorialOut --numjobs 2 --baseprefix gotcloudExample
   −
Upon successful completion of this pipeline, you will see the following message:  
+
Upon successful completion of this pipeline (about 3-10 minutes), you will see the following message:  
 
  Commands finished in nnn secs with no errors reported  
 
  Commands finished in nnn secs with no errors reported  
    
The output from the beagle step of the genotype refinement pipeline is found in:  
 
The output from the beagle step of the genotype refinement pipeline is found in:  
  ls ~/gotcloudTutorialOut/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz ~/gotcloudTutorialOut/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz.tbi  
+
  ls gotcloudTutorialOut/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz gotcloudTutorialOut/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz.tbi  
    
The output from the thunderVcf (final step) of the genotype refinement pipeline is found in:  
 
The output from the thunderVcf (final step) of the genotype refinement pipeline is found in:  
  ls ~/gotcloudTutorialOut/thunder/chr20/GBR/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz ~/gotcloudTutorialOut/thunder/chr20/GBR/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz.tbi  
+
  ls gotcloudTutorialOut/thunder/chr20/GBR/thunder/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz gotcloudTutorialOut/thunder/chr20/GBR/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz.tbi
 
  −
 
     −
== STEP 6 : Run GotCloud Association Analysis Pipeline (EPACTS) ==  
+
== STEP 5 : Run GotCloud Association Analysis Pipeline (EPACTS) ==  
    
We will assume that the EPACTS are installed in the following directory
 
We will assume that the EPACTS are installed in the following directory
Line 230: Line 225:  
= Frequently Asked Questions (FAQs) =
 
= Frequently Asked Questions (FAQs) =
   −
'''I ran the tutorai example successfully, how can I run it with my real sequence data?'''
+
'''I ran the tutorial example successfully, how can I run it with my real sequence data?'''
   −
Congratulations for your successful run of your [[GotCloud]] Tutorial. Please see [[#Tutorial Inputs]] section to prepare your own input files for your sequence data. You will need to specify the FASTQ files associated with its sample names as explained. In addition, you will need to download the full reference and resource file across whole genome (the Tutorial contains only chr20 portion to make it compact) See [[#Alignment Configuration File]] section for the detailed information. Also, please refer to the original documentation of [[GotCloud]] for more detailed guide on installation beyond the scope of tutorial
+
Congratulations for your successful run of your [[GotCloud]] Tutorial. Please see [[#Tutorial Inputs]] section to prepare your own input files for your sequence data. You will need to specify the FASTQ files associated with its sample names as explained. In addition, you will need to download the full reference and resource file across whole genome (the Tutorial contains only chr20 portion to make it compact) See [[#Alignment Configuration File]] section for the detailed information. Also, please refer to the original documentation of [[GotCloud]] for more detailed guide on installation beyond the scope of the tutorial.
    
= Input Files for GotCloud Tutorial =  
 
= Input Files for GotCloud Tutorial =  

Navigation menu