Changes

From Genome Analysis Wiki
Jump to navigationJump to search
798 bytes removed ,  12:35, 10 January 2014
no edit summary
Line 2: Line 2:  
In this tutorial, we illustrate some of the essential steps in the analysis of next generation sequence data.  
 
In this tutorial, we illustrate some of the essential steps in the analysis of next generation sequence data.  
   −
Analysis starts with [http://en.wikipedia.org/wiki/FASTQ_format FASTQ files], the typical format provided from your sequencing center containing the sequence & base quality information for your data.
+
For a background on GotCloud and Sequence Analysis Pipelines, see [[GotCloud]]
   −
The fastq files are processed using the alignment pipeline which finds the most likely genomic location for each read and stores that information in a [[BAM|BAM (Binary Sequence Alignment/Map format) file]].  In addition to the sequence and base quality information contained in FASTQ files, a BAM file also contains the genomic location and some additional information about the mapping.  As part of the alignment pipeline, the base qualities are adjusted to more accurately reflect the likelihood that the base is correct.  
+
While GotCloud can run on a cluster of machines or instances, this tutorial is just a small test that just runs on the machine the commands are run on.
   −
The variant calling pipeline processes the BAMs file produced by the alignment pipeline, generating an initial list of polymorphic sites and genotypes stored in a [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 VCF (Variant Call Format) file] and then uses haplotype information to refine these genotypes in an updated VCF file.
+
GotCloud and this basic tutorial were presented at the [http://ibg.colorado.edu/dokuwiki/doku.php?id=workshop:2013:announcement 2013 IBG Workshop].  It was presented in two sessions.  On Wednesday an overview was presented with steps for running the tutorial data: [[Media:IBG2013GotCloud.pdf|IBG2013GotCloud.pdf]].  On Friday more detail on the input files and what goes into generating the input files was presented: [[Media:GotCloudIBGWorkshop2013Friday.pdf|GotCloudIBGWorkshop2013Friday.pdf]].
   −
After variant calling, there is an optional step to further filter the variants using a [[SVM Filtering|Support Vector Machine (SVM)]].  This feature is in development and will soon be added to gotcloud and this tutorial.
+
'''This tutorial is in the process of being updated for gotcloud version 1.08 (July 30, 2013).'''
 
  −
This tutorial then demonstrates how [[EPACTS|EPACTS (Efficient and Parallelizable Association Container Toolbox)]] can be used to perform statistical tests to identify genome-wide association from sequence data.
  −
 
  −
[[File:GotCloudDiagram.png]]
  −
 
  −
[[GotCloud]] incorporates the alignment and variant calling pipelines into one easy to use tool.  GotCloud can run on a user's computer, on an instance in a compute cloud, or can split the work up onto a cluster of machines or instances.  This tutorial is just a small test that just runs on the machine the commands are run on.
      
== STEP 1 : Setup GotCloud ==
 
== STEP 1 : Setup GotCloud ==
Line 20: Line 14:  
[[GotCloud]] has been developed and tested on Linux Ubuntu 12.10 and 12.04.2 LTS but has not been tested on other Linux operating systems. It is not available for Windows. If you do not have your own set of machines to run on, GotCloud is also available for Ubuntu running on the Amazon Elastic Compute Cloud, see [[Amazon_Snapshot]] for more information.
 
[[GotCloud]] has been developed and tested on Linux Ubuntu 12.10 and 12.04.2 LTS but has not been tested on other Linux operating systems. It is not available for Windows. If you do not have your own set of machines to run on, GotCloud is also available for Ubuntu running on the Amazon Elastic Compute Cloud, see [[Amazon_Snapshot]] for more information.
    +
We will use 3 different directories for this tutorial:
 +
# path to the directory where gotcloud is installed, default ~/gotcloud-latest/
 +
# path to the directory where the example data is installed, default ~/gotcloudExample
 +
# path to your output directory, default ~/gotcloudTutorialOut/
   −
=== Step 1a: Setup Environment ===
+
If the directories specified above do not reflect the directories you would like to use, replace their occurrances in the instructions below with the appropriate paths.
 
  −
We will use 3 different directories for this tutorial and will use the following variables to stand for these directories:
  −
# $GCHOME : path to the directory where gotcloud is installed
  −
# $GCDATA : path to the directory where the example data is installed
  −
# $GCOUT : path to your output directory
  −
 
  −
The following steps will set these environment variables in your Linux terminal allowing you to type $GCHOME to specify the path to gotcloud instead of having to type the entire path (absolute path). 
  −
 
  −
Note: These settings will be local to that specific terminal.  If you open a new terminal you will need to set the variables again in that terminal.
  −
 
  −
There are two different ways to set these variables and they depend on what type of shell your terminal is using.  The shell you are using does not matter, other than for using the appropriate commands for setting the variables. To determine which shell your terminal is running, the following command will display your shell type, so type the following in your terminal:
     −
  ps -p $$ -ocomm=
     −
If your shell is bash or sh, set your variables using:
+
=== Step 1a: Install GotCloud ===
export GCHOME=~/gotcloud
  −
export GCDATA=~/gotcloudExample
  −
export GCOUT=~/gotcloudTutorial
  −
 
  −
If your shell is csh, tcsh, set your variables using:
  −
setenv GCHOME ~/gotcloud
  −
setenv GCDATA ~/gotcloudExample
  −
setenv GCOUT ~/gotcloudTutorial
  −
 
  −
If the directories specified above do not reflect where gotcloud and/or the example data is installed or where you want your output to go, then replace those directories with the full paths to the appropriate directories.
  −
 
  −
After setting these variables, you can copy and paste the rest of the commands in the tutorial.  If you do not want to use variables, you can type the commands in with the appropriate paths specified.
  −
 
  −
 
  −
=== Step 1b: Install GotCloud ===
   
In order to run this tutorial, you need to make sure you have GotCloud installed on your system.   
 
In order to run this tutorial, you need to make sure you have GotCloud installed on your system.   
   Line 57: Line 28:     
Otherwise, you can install it in your own directory:
 
Otherwise, you can install it in your own directory:
# Create & change to the directory where you want gotcloud installed
+
# Change to the directory where you want gotcloud-latest/ installed
# Download the gotcloud tar from the ftp site.
+
# Download the gotcloud tar.
 
# Extract the tar
 
# Extract the tar
 
# Build (compile) the source
 
# Build (compile) the source
 
#* Note: as the source builds, many messages will scroll through your terminal.  You may even see some warnings.  These messages are normal and expected.  As long as the build does not end with an error, you have successfully built the source.
 
#* Note: as the source builds, many messages will scroll through your terminal.  You may even see some warnings.  These messages are normal and expected.  As long as the build does not end with an error, you have successfully built the source.
   −
  mkdir -p $GCHOME; cd $GCHOME
+
  cd ~
  wget ftp://share.sph.umich.edu/gotcloud/gotcloud.tar  # Download
+
  wget https://github.com/statgen/gotcloud/archive/latest.tar.gz # Download
  tar xvf gotcloud.tar --strip 1  # Extract
+
  tar xf latest.tar.gz    # Extracts into gotcloud-latest/
  cd $GCHOME/src; make           # Build source
+
  cd gotcloud-latest/src; make         # Build source
  −
GotCloud requires the following tools to be installed.
  −
You can run $GCHOME/scripts/check_requirements.sh
  −
...TBD – put in required programs/tools.
  −
* java (java-common default-jre on ubuntu)
  −
* make (make on ubuntu)
  −
* libssl (libssl0.9.8 on ubuntu)
  −
* gcc 4.6 or newer
     −
=== Step 1c: Install Example Dataset ===
+
=== Step 1b: Install Example Dataset ===
Our dataset consists of 60 individuals from Great Britain (GBR) sequenced by the 1000 Genomes Project. These individuals have been sequenced to an average depth of about 4x.
+
Our dataset consists of individuals from Great Britain (GBR) sequenced by the 1000 Genomes Project. These individuals have been sequenced to an average depth of about 4x.
    
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20, 42900000 - 43200000.  
 
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20, 42900000 - 43200000.  
   −
The tutorial will run the alignment pipeline on 2 of the individuals (HG00096, HG00100).  The fastqs used for this step are reduced to reads that fall into our target region.
+
The alignment pipeline in this tutorial will be run on 2 of the individuals (HG00096, HG00100).  The fastqs used for this step are reduced to reads that fall into our target region.
   −
The tutorial will then used previously aligned/mapped reads for the full 60 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.  
+
The snpcall and ldrefine pipelines will use previously aligned/mapped reads for 60 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.  
   −
The example dataset we'll be using is available at: ftp://share.sph.umich.edu/gotcloud/gotcloudExample.tar
+
The example dataset we'll be using is available at: ftp://share.sph.umich.edu/gotcloud/gotcloudExample.tgz
   −
# Create & Change directory to where you want to install the Tutorail data  
+
# Change directory to where you want to install the Tutorial data  
 
# Download the dataset tar from the ftp site  
 
# Download the dataset tar from the ftp site  
 
# Extract the tar  
 
# Extract the tar  
   −
  mkdir -p $GCDATA; cd $GCDATA
+
  cd ~
  wget ftp://share.sph.umich.edu/gotcloud/gotcloudExample.tar # Download  
+
  wget ftp://share.sph.umich.edu/gotcloud/gotcloudExample_latest.tgz # Download  
  tar xvf gotcloudExample.tar --strip 1  # Extract
+
  tar xf gotcloudExample_latest.tgz    # Extracts into gotcloudExample/
    
== STEP 2 : Run GotCloud Alignment Pipeline ==  
 
== STEP 2 : Run GotCloud Alignment Pipeline ==  
Line 112: Line 75:     
Run the alignment pipeline (the example aligns 2 samples) :  
 
Run the alignment pipeline (the example aligns 2 samples) :  
  $GCHOME/gotcloud align --conf $GCDATA/[[Alignment Configuration File|GBR60align.conf]] --outdir $GCOUT
+
  cd ~
 +
gotcloud-latest/gotcloud align --conf gotcloudExample/[[#Alignment Configuration File|GBR2align.conf]] --outdir [[#Alignment Output Directory|gotcloudTutorialOut]] --baseprefix gotcloudExample
   −
Upon successful completion of the alignment pipeline (about 1-2 minutes), you will see the following message:  
+
Upon successful completion of the alignment pipeline (about 1-3 minutes), you will see the following message:  
  Processing finished in nn secs with no errors reported  
+
  Processing finished in n secs with no errors reported  
    
The final BAM files produced by the alignment pipeline are:  
 
The final BAM files produced by the alignment pipeline are:  
  ls $GCOUT/bams
+
  ls gotcloudTutorialOut/bams
 
In this directory you will see:
 
In this directory you will see:
 
* BAM (.bam) files - 1 per sample
 
* BAM (.bam) files - 1 per sample
Line 126: Line 90:  
** HG00096.recal.bam.bai  
 
** HG00096.recal.bam.bai  
 
** HG00100.recal.bam.bai  
 
** HG00100.recal.bam.bai  
* BAM checksum files (.md5) – 1 per sample
+
* Indicator files that the step completed successfully:
** HG00096.recal.bam.md5
  −
** HG00100.recal.bam.md5
  −
* Indicator flies that the step completed successfully:
   
** HG00096.recal.bam.done  
 
** HG00096.recal.bam.done  
 
** HG00100.recal.bam.done  
 
** HG00100.recal.bam.done  
    
The Quality Control (QC) files are:  
 
The Quality Control (QC) files are:  
  ls $GCOUT/QCFiles
+
  ls gotcloudTutorialOut/QCFiles
 
In this directory you will see:
 
In this directory you will see:
 
* VerifyBamID output files:
 
* VerifyBamID output files:
Line 162: Line 123:  
For information on the VerifyBamID output, see: [[Understanding VerifyBamID output]]  
 
For information on the VerifyBamID output, see: [[Understanding VerifyBamID output]]  
   −
For information on the QPLOT output, see: [[Understanding QPLOT output]]  
+
For information on the QPLOT output, see: [[Understanding QPLOT output]]
    
== STEP 3 : Run GotCloud Variant Calling Pipeline ==  
 
== STEP 3 : Run GotCloud Variant Calling Pipeline ==  
Line 180: Line 141:     
Run the variant calling pipeline:  
 
Run the variant calling pipeline:  
  $GCHOME/gotcloud snpcall --conf [[GBR60vc.conf]] --outdir $GCOUT --numjobs 2 --region 20:42900000-43200000  
+
  cd ~
 +
gotcloud-latest/gotcloud snpcall --conf gotcloudExample/[[GBR60vc.conf]] --outdir gotcloudTutorialOut --numjobs 2 --region 20:42900000-43200000 --baseprefix gotcloudExample
   −
Upon successful completion of the variant calling pipeline (about 3-4 minutes), you will see the following message:  
+
Upon successful completion of the variant calling pipeline (about 2-4 minutes), you will see the following message:  
 
   Commands finished in nnn secs with no errors reported  
 
   Commands finished in nnn secs with no errors reported  
    
On SNP Call success, the VCF files of interest are:  
 
On SNP Call success, the VCF files of interest are:  
  ls $GCOUT/vcfs/chr20/chr20.filtered*
+
  ls gotcloudTutorialOut/vcfs/chr20/chr20.filtered*
    
This gives you the following files:
 
This gives you the following files:
* '''chr20.filtered.vcf.gz ''' - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL including per sample information
+
* '''chr20.filtered.vcf.gz ''' - vcf for whole chromosome after it has been run through hardfilters and SVM filters and marked with PASS/FAIL including per sample genotypes
* chr20.filtered.sites.vcf - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL without the per sample information
+
* chr20.filtered.sites.vcf - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL without the per sample genotypes
* chr20.filtered.sites.vcf.log - log file
+
* chr20.filtered.sites.vcf.norm.log - log file
 
* chr20.filtered.sites.vcf.summary - summary of filters applied
 
* chr20.filtered.sites.vcf.summary - summary of filters applied
 
* chr20.filtered.vcf.gz.OK - indicator that the filtering completed successfully
 
* chr20.filtered.vcf.gz.OK - indicator that the filtering completed successfully
 
* chr20.filtered.vcf.gz.tbi - index file for the vcf file
 
* chr20.filtered.vcf.gz.tbi - index file for the vcf file
   −
Also in the $GCOUT/vcfs/chr20 directory are intermediate files:
+
Also in the ~/gotcloudTutorialOut/vcfs/chr20 directory are intermediate files:
 
* the whole chromosome variant calls prior to any filtering:  
 
* the whole chromosome variant calls prior to any filtering:  
** chr20.merged.sites.vcf - no sample information
+
** chr20.merged.sites.vcf - without per sample genotypes
 
** chr20.merged.stats.vcf  
 
** chr20.merged.stats.vcf  
** chr20.merged.vcf - includes sample information
+
** chr20.merged.vcf - including per sample genotypes
 
** chr20.merged.vcf.OK - indicator that the step completed successfully
 
** chr20.merged.vcf.OK - indicator that the step completed successfully
 +
* the hardfiltered (pre-svm filtered) variant calls:
 +
** chr20.hardfiltered.vcf.gz - vcf for whole chromosome after it has been run through hard filters
 +
** chr20.hardfiltered.sites.vcf - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL without the per sample genotypes
 +
** chr20.hardfiltered.sites.vcf.log - log file
 +
** chr20.hardfiltered.sites.vcf.summary - summary of filters applied
 +
** chr20.hardfiltered.vcf.gz.OK - indicator that the filtering completed successfully
 +
** chr20.hardfiltered.vcf.gz.tbi - index file for the vcf file
 
* 40000001.45000000 subdirectory contains the data for just that region.
 
* 40000001.45000000 subdirectory contains the data for just that region.
   −
The $GCOUT/split/chr20 folder contains a VCF with just the sites that pass the filters.
+
The ~/gotcloudTutorialOut/split/chr20 folder contains a VCF with just the sites that pass the filters.
  ls $GCOUT/split/chr20/
+
  ls ~/gotcloudTutorialOut/split/chr20/
 
* '''chr20.filtered.PASS.vcf.gz ''' – vcf of just sites that pass all filters
 
* '''chr20.filtered.PASS.vcf.gz ''' – vcf of just sites that pass all filters
 
* chr20.filtered.PASS.split.1.vcf.gz - intermediate file
 
* chr20.filtered.PASS.split.1.vcf.gz - intermediate file
Line 222: Line 191:     
Run the LD-aware genotype refinement pipeline:  
 
Run the LD-aware genotype refinement pipeline:  
  $GCHOME/gotcloud ldrefine --conf [[GBR60vc.conf]] --outdir $GCOUT --numjobs 2  
+
  cd ~
 +
gotcloud-latest/gotcloud ldrefine --conf gotcloudExample/[[GBR60vc.conf]] --outdir gotcloudTutorialOut --numjobs 2 --baseprefix gotcloudExample
   −
Upon successful completion of this pipeline, you will see the following message:  
+
Upon successful completion of this pipeline (about 3-10 minutes), you will see the following message:  
 
  Commands finished in nnn secs with no errors reported  
 
  Commands finished in nnn secs with no errors reported  
    
The output from the beagle step of the genotype refinement pipeline is found in:  
 
The output from the beagle step of the genotype refinement pipeline is found in:  
  ls $GCOUT/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz $GCOUT/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz.tbi  
+
  ls gotcloudTutorialOut/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz gotcloudTutorialOut/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz.tbi  
    
The output from the thunderVcf (final step) of the genotype refinement pipeline is found in:  
 
The output from the thunderVcf (final step) of the genotype refinement pipeline is found in:  
  ls $GCOUT/thunder/chr20/GBR/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz $GCOUT/thunder/chr20/GBR/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz.tbi  
+
  ls gotcloudTutorialOut/thunder/chr20/GBR/thunder/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz gotcloudTutorialOut/thunder/chr20/GBR/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz.tbi
   −
== STEP 5 : Run Support Vector Machine (SVM) Pipeline ==  
+
== STEP 5 : Run GotCloud Association Analysis Pipeline (EPACTS) ==  
    +
We will assume that the EPACTS are installed in the following directory
 +
setenv EPACTS /path/to/epacts
 +
(If you need to install EPACTS, please refer to the documentation at [[EPACTS#Installation_Details]])
   −
== STEP 6 : Run GotCloud Association Analysis Pipeline ==
+
$EPACTS/epacts single --vcf ~/gotcloudTutorialOut/vcfs/chr20/chr20.filtered.vcf.gz --ped ~/gotcloudExample/test.GBR60.ped \\
 +
    --out ~/gotcloudTutorialOut/epacts --test q.linear --run 1 --top 1 --chr 20
   −
TBD/epacts single --vcf $GCOUT/vcfs/chr20/chr20.filtered.vcf.gz --ped $GCDATA/test.GBR60.ped \\
+
Upon successful run, you will see files starting with ~/gotcloudTutorialOut/epacts
    --out $GCOUT/epacts --test q.linear --run 1 --top 1 --chr 20
+
  ls ~/gotcloudTutorialOut/epacts*
 
  −
Upon successful run, you will see files starting with $GCOUT/epacts
  −
  ls $GCOUT/epacts*
      
To see the top associated variants, you can run
 
To see the top associated variants, you can run
  less $GCOUT/epacts.epacts.top5000
+
  less ~/gotcloudTutorialOut/epacts.epacts.top5000
    
To see the locus-zoom like plot, you can type the following command (assuming GNU gnuplot 4.2 or higher version was installed)
 
To see the locus-zoom like plot, you can type the following command (assuming GNU gnuplot 4.2 or higher version was installed)
  xpdf $GCOUT/epacs.zoom.20.42987877.pdf
+
  xpdf ~/gotcloudTutorialOut/epacs.zoom.20.42987877.pdf
    
Click [[Media:EPACTS TEST.zoom.20.42987877.pdf | Exampe LocusZoom PDF]] to see the expected output pdf
 
Click [[Media:EPACTS TEST.zoom.20.42987877.pdf | Exampe LocusZoom PDF]] to see the expected output pdf
   −
= Tutorial Inputs =  
+
= Frequently Asked Questions (FAQs) =
 +
 
 +
'''I ran the tutorial example successfully, how can I run it with my real sequence data?'''
 +
 
 +
Congratulations for your successful run of your [[GotCloud]] Tutorial. Please see [[#Tutorial Inputs]] section to prepare your own input files for your sequence data. You will need to specify the FASTQ files associated with its sample names as explained. In addition, you will need to download the full reference and resource file across whole genome (the Tutorial contains only chr20 portion to make it compact) See [[#Alignment Configuration File]] section for the detailed information. Also, please refer to the original documentation of [[GotCloud]] for more detailed guide on installation beyond the scope of the tutorial.
 +
 
 +
= Input Files for GotCloud Tutorial =  
 +
 
 +
This section describes the input files needed for the GotCloud tutorial. You don't need to know this detail to run the tutorial, but if you're interested in understanding the structure of GotCloud pipeline and run with your own sample, this would be a good starting point
    
== Alignment Pipeline ==  
 
== Alignment Pipeline ==  
 +
=== List of Input Files needed for Alignment ===
 
The command-line inputs to the tutorial alignment pipeline are:  
 
The command-line inputs to the tutorial alignment pipeline are:  
 
# [[#Alignment Configuration File|Configuration File (--conf)]]  
 
# [[#Alignment Configuration File|Configuration File (--conf)]]  
Line 260: Line 240:  
# [[#Alignment Output Directory|Output Directory (--outdir)]]  
 
# [[#Alignment Output Directory|Output Directory (--outdir)]]  
 
#* Directory where the output should be placed.
 
#* Directory where the output should be placed.
 +
 
Additional information required to run the alignment pipeline:
 
Additional information required to run the alignment pipeline:
# Index file of FASTQs
+
# [[#Alignment FASTQ Index File|Index file of FASTQs]]
# Reference files
+
# [[#Alignment Reference Files|Alignment Reference Files]]
 
For the tutorial, these values are specified in the configuration file.
 
For the tutorial, these values are specified in the configuration file.
   Line 280: Line 261:     
This configuration file sets:  
 
This configuration file sets:  
* INDEX_FILE - file containing the fastqs to be processed as well as the read group information for these fastqs.
+
* [[#Alignment FASTQ Index File|INDEX_FILE]] - file containing the fastqs to be processed as well as the read group information for these fastqs.
 +
* Reference Information: see [[#Alignment Reference Files|Alignment Reference Files]] for more information
 +
** AS - assembly value to put in the BAM
 +
 
 +
The index file and chromosome 20 references used in this tutorial are included with the example data under the ~/gotcloudExample directory.  The tutorial uses chromosome 20 only references in order to speed the processing time. 
 +
 
 +
When running with your own data, you will need to update the:
 +
* Index File to contain the information for your own fastq files
 +
** See [[#Alignment FASTQ Index File|Alignment FASTQ Index File]] for more information on the contents of the index file.
 +
* The reference files to be whole genome references
 +
** If you are just running chromosome 20, you can use the tutorial references
 +
** Whole genome reference files can be downloaded from [[GotCloudReference]].
 +
** See [[#Alignment Reference Files|Alignment Reference Files]] for more information.
   −
* Reference Information:  
+
Note: It is recommended that you use absolute paths (full path names, like “/home/mktrost/gotcloudReference” rather than just “gotcloudReference”).  This example does not use absolute paths in order to be flexible to where the data is installed, but using relative paths requires it to be run from the correct directory.
** AS - assembly value to put in the BAM
+
 
** FA_REF - the reference file (.fa extension), the additional files should be at the same location:
+
 
*** human_g1k_v37_chr20-bs.umfa
+
=== Reference Files ===
*** human_g1k_v37_chr20.dict
+
 
*** human_g1k_v37_chr20.fa
+
Reference files are required for running both the alignment and variant calling pipelines.
*** human_g1k_v37_chr20.fa.amb
+
 
*** human_g1k_v37_chr20.fa.ann
+
The configuration keys for setting these are:
*** human_g1k_v37_chr20.fa.bwt
+
* FA_REF - Genome sequence reference files (needed for both pipelines)
*** human_g1k_v37_chr20.fa.fai
+
* DBSNP_VCF – DBSNP site VCF file (needed for both pipelines)
*** human_g1k_v37_chr20.fa.GCcontent
+
* HM3_VCF - HAPMAP site VCF file (needed for both pipelines)
*** human_g1k_v37_chr20.fa.pac
+
* INDEL_PREFIX - Indel sites file (need for variant calling pipeline)
*** human_g1k_v37_chr20.fa.rbwt
  −
*** human_g1k_v37_chr20.fa.rpac
  −
*** human_g1k_v37_chr20.fa.rsa
  −
*** human_g1k_v37_chr20.fa.sa
  −
** DBSNP_VCF - a vcf containing the dbsnp positions
  −
** HM3_VCF - hapmap vcf
     −
The index file and chromosome 20 references are included with the example data under the $GCDATA directory.  The tutorial uses chromosome 20 only references in order to speed the processing time.
+
The tutorial configuration file is setup to point to the required chromosome 20 reference files which are included with the tutorial example data in ~/gotcloudExample/chr20Ref/.  
   −
When you run on your own data, that is more than just chromosome 20, you will need to use the full reference files.  Full Reference files can be downloaded from [[GotCloudReference]].  If you are using these reference files, you will only need to specify REF_DIR in your configuration file to the full path to where they are installed.
+
If you are running more than just chromosome 20, you will need whole genome reference files which can be downloaded from [[GotCloudReference]].
    +
The configuration settings for these files are setup in the default configuration so do not need to be specified.  You just need to set REF_DIR in your configuration file to the path where you installed your reference files.
   −
When running on your own data, you will also need to update the INDEX_FILE to point to your own index file. See [[#Alignment Index File|Alignment Index File]] for more information on the contents of the index file.
+
To learn more about the reference files that are required, see [[GotCloud: Reference Files]].
   −
Note: It is recommended that you use absolute paths (full path names, like “/home/mktrost/gotcloudReference” rather than just “gotcloudReference”).  This example does not use absolute paths in order to be flexible to where the data is installed, but using relative paths requires it to be run from the correct directory.
      
=== Alignment Output Directory ===  
 
=== Alignment Output Directory ===  
This setting tells the pipeline where to write the output files.  
+
This setting tells the alignment pipeline where to write the output and intermediate files.  
   −
The output directory will be created if necessary and will contain the following Directories/files:  
+
The output directory will be created if it doesn't already exist and will contain the following Directories/files:  
 
* bams - directory containing the final bams/bai files  
 
* bams - directory containing the final bams/bai files  
 
** HG00096.OK - indicates that this sample completed alignment processing  
 
** HG00096.OK - indicates that this sample completed alignment processing  
 
** HG00100.OK - indicates that this sample completed alignment processing  
 
** HG00100.OK - indicates that this sample completed alignment processing  
 
* failLogs - directory containing logs from steps that failed  
 
* failLogs - directory containing logs from steps that failed  
 +
** this directory is only created if an error is detected
 
* Makefiles - directory containing the makefiles with commands for processing each sample  
 
* Makefiles - directory containing the makefiles with commands for processing each sample  
** biopipe_HG00096.Makefile
+
** biopipe_HG00096.Makefile – commands for processing sample HG00096
** biopipe_HG00100.Makefile
+
** biopipe_HG00100.Makefile – commands for processing sample HG00100
** biopipe_HG00096.Makefile.log – log file from running the associated Makefiles
+
** biopipe_HG00096.Makefile.log – log file from running the associated Makefile
** biopipe_HG00100.Makefile.log – log file from running the associated Makefiles
+
** biopipe_HG00100.Makefile.log – log file from running the associated Makefile
 
* QCFiles - directory containing the QC Results  
 
* QCFiles - directory containing the QC Results  
 
**  
 
**  
 
* tmp - directory containing temporary alignment files  
 
* tmp - directory containing temporary alignment files  
**  
+
** bwa.sai.t – contains temporary files for the 1st step of bwa that generates sai files from fastq files
 +
*** fastq
 +
**** *.done – indicator files that the step to generate the file completed
 +
**** HG0096
 +
** alignment.bwa – contains temporary files for the 2nd step of bwa that generates BAM files
 +
*** fastq
 +
**** *.done – indicator files that the step to generate the file completed
 +
** alignment.pol - contains temporary files for the polish bam step that cleans up BAM files
 +
** alignment.dedup – contains temporary files for the deduping step
 +
*** *.done – indicator files that the step to generate the file completed
 +
*** *.metrics – metrics files for the deduping step
 +
** alignment.recal – contains temporary files for the deduping step
 +
*** *.log – contains information about the recalibration step
 +
*** *.qemp – contains the recalibration tables used for recalibrating each BAM file
         −
 
+
===Alignment FASTQ Index File===  
===Index file===  
      
There are four fastq files in {ROOT_DIR}/test/align/fastq/Sample_1 and four fastq files in {ROOT_DIR}/test/align/fastq/Sample_2, both in paired-end format.  Normally, we would need to build an index file for these files. Conveniently, an index file (indexFile.txt) already exists for the automatic test samples.  It can be found in {ROOT_DIR}/test/align/, and contains the following information in tab-delimited format:  
 
There are four fastq files in {ROOT_DIR}/test/align/fastq/Sample_1 and four fastq files in {ROOT_DIR}/test/align/fastq/Sample_2, both in paired-end format.  Normally, we would need to build an index file for these files. Conveniently, an index file (indexFile.txt) already exists for the automatic test samples.  It can be found in {ROOT_DIR}/test/align/, and contains the following information in tab-delimited format:  
Line 354: Line 354:     
(More information about: [[Mapping_Pipeline#Sequence_Index_File|the index file]].)  
 
(More information about: [[Mapping_Pipeline#Sequence_Index_File|the index file]].)  
  −
  −
===Configuration file===
  −
  −
Similar to the index file, a configuration file (test.conf) already exists for the automatic test samples. It contains the following information:
  −
  −
INDEX_FILE = indexFile.txt
  −
############
  −
# References
  −
REF_DIR = $(PIPELINE_DIR)/test/align/chr20Ref
  −
AS = NCBI37
  −
FA_REF = $(REF_DIR)/human_g1k_v37_chr20.fa
  −
DBSNP_VCF =  $(REF_DIR)/dbsnp.b130.ncbi37.chr20.vcf.gz
  −
PLINK = $(REF_DIR)/hapmap_3.3.b37.chr20
  −
  −
If you are in the {ROOT_DIR}/test/align directory, you can use this file as-is. If you are using a different index file, make sure your index file is named correctly in the first line. If you are not running this from {ROOT_DIR}/test/align, make sure your configuration and index files are in the same directory. 
  −
  −
(More information about: [[Mapping_Pipeline#Reference_Files|reference files]], [[Mapping_Pipeline#Optional_Configurable_Settings|optional configurable settings]], or [[Mapping_Pipeline#Command-Line_Options|command-line options]].)
  −
  −
  −
===Running the alignment pipeline===
  −
  −
You are now ready to run the alignment pipeline.
  −
  −
To run the alignment pipeline, enter the following command:
  −
  −
{ROOT_DIR}/bin/gen_biopipeline.pl -conf test.conf -out_dir {OUT_DIR}
  −
  −
where {OUT_DIR} is the directory in which you wish to store the resulting BAM files (for example, ~/out).
  −
  −
If everything went well, you will see the following messages:
  −
  −
Created {OUT_DIR}/Makefiles/biopipe_Sample2.Makefile
  −
Created {OUT_DIR}/Makefiles/biopipe_Sample1.Makefile
  −
---------------------------------------------------------------------
  −
Submitted 2  commands
  −
Waiting for commands to complete... . .  Commands finished in 33 secs with no errors reported
  −
  −
The aligned BAM files are found in {OUT_DIR}/alignment.recal/
        Line 434: Line 395:     
A configuration file (umake_test.conf) already exists in {ROOT_DIR}/test/umake/.  It contains the following information:  
 
A configuration file (umake_test.conf) already exists in {ROOT_DIR}/test/umake/.  It contains the following information:  
 +
 +
CHRS = 20
 +
BAM_INDEX = GBR60bam.index
 +
############
 +
# References
 +
REF_ROOT = chr20Ref
 +
#
 +
REF = $(REF_ROOT)/human_g1k_v37_chr20.fa
 +
INDEL_PREFIX = $(REF_ROOT)/1kg.pilot_release.merged.indels.sites.hg19
 +
DBSNP_VCF =  $(REF_ROOT)/dbsnp135_chr20.vcf.gz
 +
HM3_VCF =  $(REF_ROOT)/hapmap_3.3.b37.sites.chr20.vcf.gz
 +
    
  CHRS = 20  
 
  CHRS = 20  
Line 478: Line 451:  
(More information about: [[Variant_Calling_Pipeline_(UMAKE)#Configuration_File|the configuration file]], [[Variant_Calling_Pipeline_(UMAKE)#Reference_Files|reference files]].)  
 
(More information about: [[Variant_Calling_Pipeline_(UMAKE)#Configuration_File|the configuration file]], [[Variant_Calling_Pipeline_(UMAKE)#Reference_Files|reference files]].)  
   −
  −
===Running UMAKE===
  −
  −
If you added an OUT_DIR line to the configuration file, you can run UMAKE with the following command:
  −
  −
{ROOT_DIR}/bin/umake.pl --conf umake_test.conf --snpcall --numjobs 2
  −
  −
If you have not added an OUT_DIR line to the configuration file, you can specify the output directory directly with the following command:
  −
  −
{ROOT_DIR}/bin/umake.pl --conf umake_test.conf --outdir {OUT_DIR} --snpcall --numjobs 2
  −
  −
where {OUT_DIR} is the directory in which you want the output to be stored.
  −
  −
Either command will perform SNP calling on the test samples. If you find the resulting VCF files located in {OUT_DIR}/vcfs/chr20, then you have successfully called the SNPs from the test BAM files.
       

Navigation menu