Difference between revisions of "Amazon Snapshot"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Back to the beginning [http://genome.sph.umich.edu/wiki/GotCloud]
+
Back to the beginning: [[GotCloud]]
  
GotCloud is made available in various forms. In Amazon Web Services the software
+
'''We no longer use a snapshot.''' It is very likely that you will need quite a few packages installed
is made available in a EBS (Elastic Block Store) '''Snapshot'''.
+
so that you can compile your software, access the EC2 application data or access data on S3.
This is simple a copy of a data volume we have created that has our software
+
It just seemed foolish to not make these software available in an AMI.
and, additionally, some reference files you will find useful.
 
You simply need to create your own EBS volume from our snapshot, mount your
 
new volume and you are ready.
 
  
If this does not work or is unacceptable for some reason, you may install the  
+
 
 +
 
 +
GotCloud is made available in various forms.
 +
It is distributed as conventional packages for Ubuntu and as compress TAR files for others.
 +
In addition the source is available from github.
 +
In Amazon Web Services the software is made available as an Amazon Machine Instance (AMI).
 +
 
 +
The GotCloud software itself only requires a few packages to be installed for Ubuntu installations
 +
(java-common default-jre make libssl0.9.8).
 +
However, there are a number of things you may well want to do in getting your data
 +
ready for processing (access data on S3 storage, compile GotCloud or others, or
 +
access the EC2 application data).
 +
Assuming this is the case, the GotCloud AMI has installed these packages on Ubunutu.
 +
If you need to run on some other distribution, you may need to install their packages.
 +
 
 +
<code>
 +
  sudo apt-get install java-common default-jre make libssl0.9.8
 +
  sudo apt-get install libnet-amazon-ec2-perl
 +
  sudo apt-get install make g++ libcurl4-openssl-dev libssl-dev libxml2-dev libfuse-dev
 +
</code>
 +
 
 +
You will almost certainly need to fetch and install your own reference files - regardless
 +
of the details of the system you are using.
 +
Finally, you'll need access to your FASTQ files - either copied to the Amazon instance
 +
or perhaps accessible from S3 storage.
 +
 
 +
If the GotCloud instance is unacceptable for some reason, you may install the  
 
software and reference files wherever you'd like
 
software and reference files wherever you'd like
(read about this in [http://genome.sph.umich.edu/wiki/Pipeline_Debian_Package|Installing from a Debian package].
+
(read about this in [[Pipeline_Debian_Package|Installing from a Debian package]]).
  
 
Your first task is get an AWS account and keys so that you can use the AWS EC2 Console Dashboard  
 
Your first task is get an AWS account and keys so that you can use the AWS EC2 Console Dashboard  
Line 23: Line 46:
 
You'll need to know some details when launching an instance:
 
You'll need to know some details when launching an instance:
  
* '''Launch an Instance''' - use any instance running 64 bit software and
+
* '''Launch an Instance''' - use the GotCloud instance running 64 bit software.
either an Ubuntu of any version or Redhat/CentOS 6.3 distribution.
 
  
 
* '''Instance size'''  (memory and number of processors). The pipeline software will require at least 4GB of memory (''type m1.medium'') and can use as many processors as is available.
 
* '''Instance size'''  (memory and number of processors). The pipeline software will require at least 4GB of memory (''type m1.medium'') and can use as many processors as is available.
  
* '''GotCloud Volume''' (copy from GotCloud ''snapshot'').
+
* '''Storage''' for the instance refers to the size for root (/) partition. This can be quite small, as little as 8GB can work. Of course if you intend to bring other files/programs to the instance, you may need to increase this to something a bit larger (e.g. 30GB).
We provide an AWS snapshot of a small volume
 
which contains the aligner and umake software and reference files.
 
Your task is to create an EBS volume based on our snapshot and then mount that volume
 
on your instance (see below for more precise details).
 
  
* '''Storage''' for the instance refers to the size for root (/) partition. This can be quite small, as little as 8GB should work. Of course if you intend to bring lots of other files/programs to the instance, you may want to increase this to something a bit larger (e.g. 30GB).
+
* '''Data Storage''' for the aligner or SNP caller (see below)
  
  
Line 41: Line 59:
 
You will also want additional storage volumes for:
 
You will also want additional storage volumes for:
  
* GotCloud software and reference files
+
* '''Local Storage''' for the instance refers to the size for root (/) partition. This can be quite small, as little as 8GB can work. Of course if you intend to bring other files/programs to the instance, you may need to increase this to something a bit larger (e.g. 30GB).
 
 
* Your sequence data
 
* Output of the aligner
 
* Output of umake
 
  
The '''first of these''' is a small volume based on a snapshot containing the GotCloud files you will need.
+
* '''Data Storage''' for the aligner or SNP caller will likely be far larger than the system you are creating.
We provide an AWS snapshot of a small volume which contains the aligner and umake software and reference files.
+
You'll need to create EBS Volumes for the input and output of the aligner and SNP caller.
Create an EBS volume based on our snapshot and then mount that volume on your instance.
 
In the EC2 Management Console under ELASTIC BLOCK STORE, select Volumes -> Create Volume.
 
In the prompt supply the size and Snapshot (based on the table below).
 
You may take the defaults for the Volume Type and IOPS.
 
  
<code>
+
'''Prepare Your Storage'''
                            Availability
 
  Name                        Zone        Snapshot        Size
 
  GotCloud software/refs    us-west-2a    snap-1a13913c  40GB
 
</code>
 
  
and create the volume.  This will create a device which you need to mount in your instance.
+
These can be quite substantial and because of that we recommend you create separate volumes like this:
This will create a device like /dev/sdf, which unfortunately actually translates to
 
the device /dev/xvdf in your Linux instance. Once the volume is ready, mount it
 
by logging into your instance with ssh and issuing the command:
 
 
 
<code>
 
  sudo mkdir -p /gotcloud
 
  sudo mount /dev/xvdf  /gotcloud    # or whatever device yours is
 
  df -h
 
</code>
 
  
This will make the GotCloud software available under the path /gotcloud/bin etc.
+
* Your '''input FASTQ''' files for the aligner.
Each time your instance is started, you'll need to mount this volume.
+
This may have been done for you by some vendor when they put your FASTQ data on an S3 volume.
You may want to create a small shell script to mount the device.
+
If so, your vendor will need to provide you with the details of how to access your FASTQ files.
 +
If your FASTQ files are not in S3 storage, you'll have to create a volume for this and copy your data into it.
 +
This can take a very long time.
  
The next storage volumes will varying based on what you data you have.
+
* The '''output of the aligner''' (BAM files)
The sequence data might already be in some volume, provided someone else.
 
You'll have to mount volume too.
 
  
You should expect the three data volumes will all need to be the same size. That is, if your sequence data is 300GB, then you'll need an additional 300GB for the aligner output and then another 300GB of storage for the umake output. We suggest you consider making each set of data be separate volumes.
+
* The '''intermediate files of the SNP caller''' (GLF files)
  
You may also find that your sequence data is too large to be easily handled in one go,
+
* The '''final output of the SNP caller''' (VCF files)
so you might choose to only use the aligner/umake on part of your sequence data, capture the files
 
of interest from umake, and then go back and rerun the software with the next bit of sequence data.
 

Latest revision as of 13:06, 20 May 2013

Back to the beginning: GotCloud

We no longer use a snapshot. It is very likely that you will need quite a few packages installed so that you can compile your software, access the EC2 application data or access data on S3. It just seemed foolish to not make these software available in an AMI.


GotCloud is made available in various forms. It is distributed as conventional packages for Ubuntu and as compress TAR files for others. In addition the source is available from github. In Amazon Web Services the software is made available as an Amazon Machine Instance (AMI).

The GotCloud software itself only requires a few packages to be installed for Ubuntu installations (java-common default-jre make libssl0.9.8). However, there are a number of things you may well want to do in getting your data ready for processing (access data on S3 storage, compile GotCloud or others, or access the EC2 application data). Assuming this is the case, the GotCloud AMI has installed these packages on Ubunutu. If you need to run on some other distribution, you may need to install their packages.

 sudo apt-get install java-common default-jre make libssl0.9.8 
 sudo apt-get install libnet-amazon-ec2-perl
 sudo apt-get install make g++ libcurl4-openssl-dev libssl-dev libxml2-dev libfuse-dev

You will almost certainly need to fetch and install your own reference files - regardless of the details of the system you are using. Finally, you'll need access to your FASTQ files - either copied to the Amazon instance or perhaps accessible from S3 storage.

If the GotCloud instance is unacceptable for some reason, you may install the software and reference files wherever you'd like (read about this in Installing from a Debian package).

Your first task is get an AWS account and keys so that you can use the AWS EC2 Console Dashboard (see https://console.aws.amazon.com/ec2/). From here you can launch instances prepared by others or create your own. We cannot assist in this step - Amazon has plenty of documentation. Once you are at the AWS EC2 Console Dashboard, you're almost ready for GotCloud.


Your First Instance

You'll need to know some details when launching an instance:

  • Launch an Instance - use the GotCloud instance running 64 bit software.
  • Instance size (memory and number of processors). The pipeline software will require at least 4GB of memory (type m1.medium) and can use as many processors as is available.
  • Storage for the instance refers to the size for root (/) partition. This can be quite small, as little as 8GB can work. Of course if you intend to bring other files/programs to the instance, you may need to increase this to something a bit larger (e.g. 30GB).
  • Data Storage for the aligner or SNP caller (see below)


Prepare Your Instance

You will also want additional storage volumes for:

  • Local Storage for the instance refers to the size for root (/) partition. This can be quite small, as little as 8GB can work. Of course if you intend to bring other files/programs to the instance, you may need to increase this to something a bit larger (e.g. 30GB).
  • Data Storage for the aligner or SNP caller will likely be far larger than the system you are creating.

You'll need to create EBS Volumes for the input and output of the aligner and SNP caller.

Prepare Your Storage

These can be quite substantial and because of that we recommend you create separate volumes like this:

  • Your input FASTQ files for the aligner.

This may have been done for you by some vendor when they put your FASTQ data on an S3 volume. If so, your vendor will need to provide you with the details of how to access your FASTQ files. If your FASTQ files are not in S3 storage, you'll have to create a volume for this and copy your data into it. This can take a very long time.

  • The output of the aligner (BAM files)
  • The intermediate files of the SNP caller (GLF files)
  • The final output of the SNP caller (VCF files)