Difference between revisions of "GotCloud"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(79 intermediate revisions by 5 users not shown)
Line 1: Line 1:
= Genomes on the Cloud (GotCloud)=
+
__TOC__
 +
 
 +
== Genomes on the Cloud (GotCloud) Introduction==
 +
 
 +
To handle the increasing volume of next generation sequencing and genotyping data, we created and developed software pipelines called '''Genomes on the Cloud (GotCloud).'''
 +
 
 +
GotCloud contains Mapping & Variant Calling Pipelines.
  
To handle the increasing volume of next generation sequencing and genotyping data, we created and developed software pipelines called '''Genomes on the Cloud (GotCloud)''' for:
 
*Mapping
 
*Variant Calling
 
 
Key Features:
 
Key Features:
*Scalable to tens of thousands of samples
+
* Connects sequence analysis tools together in automated pipeline
*Robust against unexpected system failure using GNU make
+
** Alignment, quality control, variant calling
*Massively parallel, can run hundreds of jobs
+
* Robust against unexpected system failure using GNU make
*Easy to use - Automates series of configurable steps
+
** easy restart after failure
*Available on Amazon Web Services (AWS) Elastic Compute Cloud (EC2)
+
* Massively parallel, can run hundreds of jobs
*Run on local machines/clusters
+
** Splits large jobs into many pieces
*Available via Debian Packages
+
** Simplifies running on clusters
 +
* Scalable to tens of thousands of samples
 +
* Easy to use - Automates series of configurable steps
 +
** user doesn't have to understand/configure/know the many tools required to create high quality results
 +
* Available on Amazon Web Services (AWS) Elastic Compute Cloud (EC2)
 +
* Run on local machines/clusters
 +
 
 +
GotCloud incorporates the alignment and variant calling pipelines that we have been using at UM into one easy to use, publicly available tool.  GotCloud can run on a user's computer, on an instance in a
 +
compute cloud, and/or can split the work up onto a cluster of machines or instances.
  
 +
[[File:Gotcloud.puzzles.v2.png|500px]]
  
The following describes the use of this software with the Amazon Web Services (https://aws.amazon.com/),
 
but you can just as easily use the pipelines on your own machine(s) by just installing them.
 
  
== Join in GotCloud mailing list ==
+
=== Getting Help with GotCloud ===
  
 
Please join in the [http://groups.google.com/group/GotCloud GotCloud Google Group] to ask / discuss / comment about these pipelines.
 
Please join in the [http://groups.google.com/group/GotCloud GotCloud Google Group] to ask / discuss / comment about these pipelines.
  
 +
Currently the "join" button appears to be missing.  Click "NEW TOPIC", then select "Join this group".  You can then cancel the message post (or post a message).
 +
 +
See [[GotCloud: FAQs]] if you have any questions.  If your questions are not answered there, ask questions in the [https://github.com/statgen/gotcloud GotCloud GitHub repository]
 +
 +
=== Sequence Analysis Background Information ===
 +
 +
There are many essential steps in the analysis of next generation sequence data.
 +
 +
Next generation sequence data analysis starts with [http://en.wikipedia.org/wiki/FASTQ_format FASTQ files], the typical format provided from your sequencing center containing the sequence & base quality information for your data.
 +
 +
The fastq files are processed using the [[GotCloud: Alignment Pipeline|alignment pipeline]] which finds the most likely genomic location for each read and stores that information in a [[BAM|BAM (Binary Sequence Alignment/Map format) file]].  In addition to the sequence and base quality information contained in FASTQ files, a BAM file also contains the genomic location and some additional information about the mapping.  As part of the [[GotCloud: Alignment Pipeline|alignment pipeline]], the base qualities are adjusted to more accurately reflect the likelihood that the base is correct.
 +
 +
The [[GotCloud: Alignment Pipeline|alignment pipeline]] can be skipped if you already have Deduped and Recalibrated BAM files.  If you have BAMs, but they needed to be deduped and recalibrated, you can use our [[GotCloud:_Alignment_Sub-Pipelines#recabQC_2|recabQC pipeline]].
 +
 +
The [[GotCloud: Variant Calling Pipeline|variant calling pipeline]] processes the deduped and recalibrated BAM files produced by the alignment pipeline or that you provide it, generating an initial list of polymorphic sites and genotypes stored in a [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 VCF (Variant Call Format) file].  The [[GotCloud: Variant Calling Pipeline|variant calling pipeline]] then filters the  variants using both hard filters and a [[SVM Filtering|Support Vector Machine (SVM)]].  It then uses haplotype information to refine these genotypes in an updated VCF file.
 +
 +
After completing the GotCloud Variant Calling Pipeline, [[EPACTS|EPACTS (Efficient and Parallelizable Association Container Toolbox)]] can be used to perform statistical tests to identify genome-wide association from sequence data.
 +
 +
[[File:GotCloudDiagram.jpg|500px]]
 +
 +
 +
== Publication ==
 +
If you use GotCloud, please cite our publication:
 +
[http://genome.cshlp.org/content/early/2015/04/14/gr.176552.114.abstract Jun, Goo, et al. "An efficient and scalable analysis framework for variant extraction and refinement from population scale DNA sequence data." Genome research (2015): gr-176552.]
 +
 +
== GotCloud Setup ==
 +
 +
You may run the GotCloud software in several modes:
 +
* On your own hardware running Ubuntu or Redhat/CentOS. See the instructions about installing the software below.
 +
* On an Amazon Elastic Compute Cloud using Ubuntu or Redhat/CentOS if you do not have your own set of machines.
 +
** See [[GotCloud: Amazon]] for more information.
 +
** You can run on an EC2 cluster instance created by StarCluster.
 +
 +
GotCloud has been developed and tested on Linux Ubuntu 12.10 and 12.04.2 LTS and Red Hat 6.6.  While it should work on other Linux systems, they have not yet been tested.
 +
 +
=== GotCloud on Amazon ===
 +
You can take advantage of GotCloud AMI's when running on Amazon.  The GotCloud already includes GotCloud and default reference files.
 +
 +
See [[GotCloud: Amazon]] for instructions on using GotCloud on Amazon.
 +
 +
=== GotCloud Setup on Any Linux Machine ===
 +
 +
==== GotCloud Dependencies ====
 +
 +
GotCloud requires certain things to be installed in order to run:
 +
* perl - gotcloud is a perl script and it calls many other perl scripts
 +
** Zlib.pm - required for perl scripts to read compressed files.
 +
* make - GNU make is used to run the pipelines
 +
* java - required to run the beagle step of the ld-aware genotype refinement
 +
* curses/ncurses (required for samtools)
 +
** On Ubuntu: <code>sudo apt-get install libncurses5 libncurses5-dev</code>
 +
* cmake (required for premo)
 +
** On Ubuntu: <code>sudo apt-get install cmake</code>
 +
 +
You can check if your system has the proper software installed by invoking the command
 +
[gotcloud_path]/scripts/check_requirements.sh
 +
 +
==== Install GotCloud Software ====
 +
 +
You can install gotCloud on your system as (follow the links for the appropriate instructions):
 +
* [[GotCloud: Source Releases|source release]] - contains the scripts and uncompiled source
 +
* [[GotCloud: Binary Releases|binary release]] - contains the scripts and pre-compiled binaries (no source)
 +
 +
[[GotCloud: Versions]] describes the changes added to each version.
 +
 +
 +
Alternatively if you are using Amazon EC2, you can use one of the following sets of instructions:
 +
* Create a machine instance based on the AMI we provide: [[Amazon Single Node]]
 +
* Create an EC2 cluster instance using [[StarCluster|StarCluster]]
 +
For more information on using GotCloud on Amazon, see: [[GotCloud: Amazon]]
 +
 +
For more information on Amazon Web Services, see: https://aws.amazon.com/
 +
 +
==== GotCloud Reference/Resource Files ====
 +
In order to run gotCloud, you need to provide Genetic Reference and Resource Files.
  
== AWS Specific Setup ==
+
These include information about the reference sequence and dbnsp positions.
'''Preparation in AWS'''
 
  
* Preparing an [[Amazon Single Instance|Amazon Single Instance]]
+
See: [[GotCloud: Genetic Reference and Resource Files]] for information about the required files.  It contains a description of the required files, information about generating your own versions, as well as a downloadable set of files.
* Preparing a Cluster using [[StarCluster|StarCluster]]
+
* When running on Amazon, a default set of reference files are included in the GotCloud AMI.
* Notes on sequence data preparation in [[Amazon Storage|Amazon Storage]].
 
  
'''Resources / Cost'''
+
==== Configure GotCloud ====
 +
* [[Configure GotCloud|Configure Gotcloud]] for your installation
  
* [[AWS Rresources|AWS Resources]]
+
== Running GotCloud Software ==
  
'''Development Notes'''
+
* [[GotCloud: Alignment Pipeline|Alignment Pipeline]]
 +
** [[GotCloud: Alignment Sub-Pipelines|Alignment Sub-Pipelines]] - for if you do not want to run the entire Alignment Pipeline
 +
* [[GotCloud: Variant Calling Pipeline|Variant Calling Pipeline]]
 +
* Indel Calling Pipeline
 +
* [[GotCloud: GenomeSTRiP Pipeline|GenomeSTRiP Pipeline]] (Structural Variation)
 +
* MEI Calling Pipeline - ''Ask if you're interested''
  
* [[Creating an AMI on EC2]]
+
You can also create your own pipelines.  Instructions are here:
* [[Mount S3 Volume]]
+
* [[GotCloud: Creating a New Pipeline]]  
  
== General Usage ==
+
=== GotCloud Demos ===
 +
GotCloud Demos (originally from our sequencing workshop):
 +
* [[SeqShop: Sequence Mapping and Assembly Practical]]
 +
* [[SeqShop: Variant Calling and Filtering for SNPs Practical]]
 +
* [[SeqShop: Variant Calling and Filtering for INDELs Practical]]
 +
* [[SeqShop: Analysis of Structural Variation Practical]]
  
'''Install the Software (for AWS or a local machine)'''
+
GotCloud on Amazon Demo (snpcall & indel):
 +
* [[GotCloud: Amazon Demo]]
  
* Installing the software as a [[Pipeline Debian Package|debian package]]
+
Deprecated: [[Tutorial: GotCloud]]
* Installing the software as a  [[Pipeline RedHatPackage|red hat package]]
 
  
'''Run the Software'''
+
== UMich Development/Release How-To Notes ==
 +
* [[Releasing GotCloud]]
 +
* Amazon EC2
 +
** [[Creating an AMI on EC2]]
 +
** [[Creating a Snapshot on EC2]] (deprecated)
 +
** [[Mount S3 Volume]]
 +
** Notes on sequence data preparation in [[Amazon Storage|Amazon Storage]].
  
* [[Mapping Pipeline]]
+
* [[Git_FAQs#Subtrees|Upgrade Git Subtree]]
* [[Variant Calling Pipeline (UMAKE)]]
 

Latest revision as of 17:23, 11 September 2021

Genomes on the Cloud (GotCloud) Introduction

To handle the increasing volume of next generation sequencing and genotyping data, we created and developed software pipelines called Genomes on the Cloud (GotCloud).

GotCloud contains Mapping & Variant Calling Pipelines.

Key Features:

  • Connects sequence analysis tools together in automated pipeline
    • Alignment, quality control, variant calling
  • Robust against unexpected system failure using GNU make
    • easy restart after failure
  • Massively parallel, can run hundreds of jobs
    • Splits large jobs into many pieces
    • Simplifies running on clusters
  • Scalable to tens of thousands of samples
  • Easy to use - Automates series of configurable steps
    • user doesn't have to understand/configure/know the many tools required to create high quality results
  • Available on Amazon Web Services (AWS) Elastic Compute Cloud (EC2)
  • Run on local machines/clusters

GotCloud incorporates the alignment and variant calling pipelines that we have been using at UM into one easy to use, publicly available tool. GotCloud can run on a user's computer, on an instance in a compute cloud, and/or can split the work up onto a cluster of machines or instances.

Gotcloud.puzzles.v2.png


Getting Help with GotCloud

Please join in the GotCloud Google Group to ask / discuss / comment about these pipelines.

Currently the "join" button appears to be missing. Click "NEW TOPIC", then select "Join this group". You can then cancel the message post (or post a message).

See GotCloud: FAQs if you have any questions. If your questions are not answered there, ask questions in the GotCloud GitHub repository

Sequence Analysis Background Information

There are many essential steps in the analysis of next generation sequence data.

Next generation sequence data analysis starts with FASTQ files, the typical format provided from your sequencing center containing the sequence & base quality information for your data.

The fastq files are processed using the alignment pipeline which finds the most likely genomic location for each read and stores that information in a BAM (Binary Sequence Alignment/Map format) file. In addition to the sequence and base quality information contained in FASTQ files, a BAM file also contains the genomic location and some additional information about the mapping. As part of the alignment pipeline, the base qualities are adjusted to more accurately reflect the likelihood that the base is correct.

The alignment pipeline can be skipped if you already have Deduped and Recalibrated BAM files. If you have BAMs, but they needed to be deduped and recalibrated, you can use our recabQC pipeline.

The variant calling pipeline processes the deduped and recalibrated BAM files produced by the alignment pipeline or that you provide it, generating an initial list of polymorphic sites and genotypes stored in a VCF (Variant Call Format) file. The variant calling pipeline then filters the variants using both hard filters and a Support Vector Machine (SVM). It then uses haplotype information to refine these genotypes in an updated VCF file.

After completing the GotCloud Variant Calling Pipeline, EPACTS (Efficient and Parallelizable Association Container Toolbox) can be used to perform statistical tests to identify genome-wide association from sequence data.

GotCloudDiagram.jpg


Publication

If you use GotCloud, please cite our publication: Jun, Goo, et al. "An efficient and scalable analysis framework for variant extraction and refinement from population scale DNA sequence data." Genome research (2015): gr-176552.

GotCloud Setup

You may run the GotCloud software in several modes:

  • On your own hardware running Ubuntu or Redhat/CentOS. See the instructions about installing the software below.
  • On an Amazon Elastic Compute Cloud using Ubuntu or Redhat/CentOS if you do not have your own set of machines.
    • See GotCloud: Amazon for more information.
    • You can run on an EC2 cluster instance created by StarCluster.

GotCloud has been developed and tested on Linux Ubuntu 12.10 and 12.04.2 LTS and Red Hat 6.6. While it should work on other Linux systems, they have not yet been tested.

GotCloud on Amazon

You can take advantage of GotCloud AMI's when running on Amazon. The GotCloud already includes GotCloud and default reference files.

See GotCloud: Amazon for instructions on using GotCloud on Amazon.

GotCloud Setup on Any Linux Machine

GotCloud Dependencies

GotCloud requires certain things to be installed in order to run:

  • perl - gotcloud is a perl script and it calls many other perl scripts
    • Zlib.pm - required for perl scripts to read compressed files.
  • make - GNU make is used to run the pipelines
  • java - required to run the beagle step of the ld-aware genotype refinement
  • curses/ncurses (required for samtools)
    • On Ubuntu: sudo apt-get install libncurses5 libncurses5-dev
  • cmake (required for premo)
    • On Ubuntu: sudo apt-get install cmake

You can check if your system has the proper software installed by invoking the command

[gotcloud_path]/scripts/check_requirements.sh

Install GotCloud Software

You can install gotCloud on your system as (follow the links for the appropriate instructions):

GotCloud: Versions describes the changes added to each version.


Alternatively if you are using Amazon EC2, you can use one of the following sets of instructions:

For more information on using GotCloud on Amazon, see: GotCloud: Amazon

For more information on Amazon Web Services, see: https://aws.amazon.com/

GotCloud Reference/Resource Files

In order to run gotCloud, you need to provide Genetic Reference and Resource Files.

These include information about the reference sequence and dbnsp positions.

See: GotCloud: Genetic Reference and Resource Files for information about the required files. It contains a description of the required files, information about generating your own versions, as well as a downloadable set of files.

  • When running on Amazon, a default set of reference files are included in the GotCloud AMI.

Configure GotCloud

Running GotCloud Software

You can also create your own pipelines. Instructions are here:

GotCloud Demos

GotCloud Demos (originally from our sequencing workshop):

GotCloud on Amazon Demo (snpcall & indel):

Deprecated: Tutorial: GotCloud

UMich Development/Release How-To Notes