Difference between revisions of "Amazon Snapshot"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(32 intermediate revisions by 2 users not shown)
Line 1: Line 1:
You may run the pipeline software on a single instance we have created for you in AWS.
+
Back to the beginning: [[GotCloud]]
You may also, of course, create your own instance and run it there.
+
 
 +
'''We no longer use a snapshot.''' It is very likely that you will need quite a few packages installed
 +
so that you can compile your software, access the EC2 application data or access data on S3.
 +
It just seemed foolish to not make these software available in an AMI.
 +
 
 +
 
 +
 
 +
GotCloud is made available in various forms.
 +
It is distributed as conventional packages for Ubuntu and as compress TAR files for others.
 +
In addition the source is available from github.
 +
In Amazon Web Services the software is made available as an Amazon Machine Instance (AMI).
 +
 
 +
The GotCloud software itself only requires a few packages to be installed for Ubuntu installations
 +
(java-common default-jre make libssl0.9.8).
 +
However, there are a number of things you may well want to do in getting your data
 +
ready for processing (access data on S3 storage, compile GotCloud or others, or
 +
access the EC2 application data).
 +
Assuming this is the case, the GotCloud AMI has installed these packages on Ubunutu.
 +
If you need to run on some other distribution, you may need to install their packages.
 +
 
 +
<code>
 +
  sudo apt-get install java-common default-jre make libssl0.9.8
 +
  sudo apt-get install libnet-amazon-ec2-perl
 +
  sudo apt-get install make g++ libcurl4-openssl-dev libssl-dev libxml2-dev libfuse-dev
 +
</code>
 +
 
 +
You will almost certainly need to fetch and install your own reference files - regardless
 +
of the details of the system you are using.
 +
Finally, you'll need access to your FASTQ files - either copied to the Amazon instance
 +
or perhaps accessible from S3 storage.
 +
 
 +
If the GotCloud instance is unacceptable for some reason, you may install the
 +
software and reference files wherever you'd like
 +
(read about this in [[Pipeline_Debian_Package|Installing from a Debian package]]).
  
 
Your first task is get an AWS account and keys so that you can use the AWS EC2 Console Dashboard  
 
Your first task is get an AWS account and keys so that you can use the AWS EC2 Console Dashboard  
Line 6: Line 39:
 
From here you can launch instances prepared by others or create your own.
 
From here you can launch instances prepared by others or create your own.
 
We cannot assist in this step - Amazon has plenty of documentation.
 
We cannot assist in this step - Amazon has plenty of documentation.
Once you are at the AWS EC2 Console Dashboard, you're ready to run the pipeline.
+
Once you are at the AWS EC2 Console Dashboard, you're almost ready for GotCloud.
  
  
'''Launch Your First Instance'''
+
'''Your First Instance'''
  
 
You'll need to know some details when launching an instance:
 
You'll need to know some details when launching an instance:
  
* '''What Instance''' to launch. You have several choices
+
* '''Launch an Instance''' - use the GotCloud instance running 64 bit software.
** ami-be59d78e which is an instance we have prepared based on ''Ubuntu Server 12.04.1 LTS''. It has all of our software installed.
 
** Some other instance. The instance must run 64 bit software and is either Ubuntu of any version or Redhat/CentOS 6.3. You will also need to install the Pipeline software which will require about 15 minutes.
 
 
 
* '''Instance size'''  (memory and number of processors). The pipeline software will require at least 8GB of memory (type m1.large) and can use as many processors as is available.
 
 
 
* '''Storage''' for the instance refers to the size for root (/) partition. This can be quite small, as little as 8GB should work. Of course if you intend to bring lots of other files/programs to the instance, you may want to increase this to something a bit larger (e.g. 30GB).
 
 
 
 
 
Prepare Your Instance
 
 
 
If you launched some other instance than the one prepared for our software, you will need to install
 
the Pipeline software. This is quite simple - see
 
  
 +
* '''Instance size'''  (memory and number of processors). The pipeline software will require at least 4GB of memory (''type m1.medium'') and can use as many processors as is available.
  
is perhaps the most difficult detail as it is controlled completely by the size of your data. As a generate rule you will need three times the space required for your sequence data. For instance in the 1000 Genomes data, the data for one individual takes about 45G. If you have 1000 Genome data for nine individuals, you'll need about 1500GB of space (9x450x3 plus a little extra space).
+
* '''Storage''' for the instance refers to the size for root (/) partition. This can be quite small, as little as 8GB can work. Of course if you intend to bring other files/programs to the instance, you may need to increase this to something a bit larger (e.g. 30GB).
  
 +
* '''Data Storage''' for the aligner or SNP caller (see below)
  
  
 +
'''Prepare Your Instance'''
  
 +
You will also want additional storage volumes for:
  
 +
* '''Local Storage''' for the instance refers to the size for root (/) partition. This can be quite small, as little as 8GB can work. Of course if you intend to bring other files/programs to the instance, you may need to increase this to something a bit larger (e.g. 30GB).
  
 +
* '''Data Storage''' for the aligner or SNP caller will likely be far larger than the system you are creating.
 +
You'll need to create EBS Volumes for the input and output of the aligner and SNP caller.
  
 +
'''Prepare Your Storage'''
  
Testing the Installation
+
These can be quite substantial and because of that we recommend you create separate volumes like this:
  
We recommend that at least the first time, you install the test packages so you can conveniently test the installation and make sure everything runs smoothly. The tests run within a few minutes and are self checking, so unless you see obvious errors, you can be reasonably sure everything is set up properly. You only need to do this once, unless you have made signifcant changes to your Unix system.
+
* Your '''input FASTQ''' files for the aligner.
 +
This may have been done for you by some vendor when they put your FASTQ data on an S3 volume.
 +
If so, your vendor will need to provide you with the details of how to access your FASTQ files.
 +
If your FASTQ files are not in S3 storage, you'll have to create a volume for this and copy your data into it.
 +
This can take a very long time.
  
sudo dpkg -i debs/biopipe-test*_amd64.deb
+
* The '''output of the aligner''' (BAM files)
Unpacking biopipe-testalign (from .../biopipe-testalign_M.n_amd64.deb) ...
 
Selecting previously deselected package biopipe-testumake.
 
Unpacking biopipe-testumake (from .../biopipe-testumake_M.n_amd64.deb) ...
 
Setting up biopipe-testalign (M.n) ...
 
To test the pipeline, run:
 
 
  /usr/local/biopipe/bin/gen_biopipeline.pl --test ~/testalign
 
 
This will remove the contents of ~/testalign and then run
 
the aligner test case. The output is verified so you know if
 
anything failed or not.
 
 
Setting up biopipe-testumake (M.n) ...
 
To test umake, run:
 
 
  /usr/local/biopipe/bin/umake.pl --test ~/testumake
 
 
This will remove the contents of ~/testumake and then run
 
the umake test case. The output is verified so you know if
 
anything failed or not.
 
  
Login as a normal user (not as root) and do:
+
* The '''intermediate files of the SNP caller''' (GLF files)
  
#  Test the aligner (fast, about 3 minutes)
+
* The '''final output of the SNP caller''' (VCF files)
/usr/local/biopipe/bin/gen_biopipeline.pl --test ~/testalign
 
rm -rf ~/testalign              # If no error
 
 
#  Test umake  (longer, about 15 minutes)
 
/usr/local/biopipe/bin/umake.pl --test ~/testumake
 
rm -rf ~/testumake              # If no error
 

Latest revision as of 13:06, 20 May 2013

Back to the beginning: GotCloud

We no longer use a snapshot. It is very likely that you will need quite a few packages installed so that you can compile your software, access the EC2 application data or access data on S3. It just seemed foolish to not make these software available in an AMI.


GotCloud is made available in various forms. It is distributed as conventional packages for Ubuntu and as compress TAR files for others. In addition the source is available from github. In Amazon Web Services the software is made available as an Amazon Machine Instance (AMI).

The GotCloud software itself only requires a few packages to be installed for Ubuntu installations (java-common default-jre make libssl0.9.8). However, there are a number of things you may well want to do in getting your data ready for processing (access data on S3 storage, compile GotCloud or others, or access the EC2 application data). Assuming this is the case, the GotCloud AMI has installed these packages on Ubunutu. If you need to run on some other distribution, you may need to install their packages.

 sudo apt-get install java-common default-jre make libssl0.9.8 
 sudo apt-get install libnet-amazon-ec2-perl
 sudo apt-get install make g++ libcurl4-openssl-dev libssl-dev libxml2-dev libfuse-dev

You will almost certainly need to fetch and install your own reference files - regardless of the details of the system you are using. Finally, you'll need access to your FASTQ files - either copied to the Amazon instance or perhaps accessible from S3 storage.

If the GotCloud instance is unacceptable for some reason, you may install the software and reference files wherever you'd like (read about this in Installing from a Debian package).

Your first task is get an AWS account and keys so that you can use the AWS EC2 Console Dashboard (see https://console.aws.amazon.com/ec2/). From here you can launch instances prepared by others or create your own. We cannot assist in this step - Amazon has plenty of documentation. Once you are at the AWS EC2 Console Dashboard, you're almost ready for GotCloud.


Your First Instance

You'll need to know some details when launching an instance:

  • Launch an Instance - use the GotCloud instance running 64 bit software.
  • Instance size (memory and number of processors). The pipeline software will require at least 4GB of memory (type m1.medium) and can use as many processors as is available.
  • Storage for the instance refers to the size for root (/) partition. This can be quite small, as little as 8GB can work. Of course if you intend to bring other files/programs to the instance, you may need to increase this to something a bit larger (e.g. 30GB).
  • Data Storage for the aligner or SNP caller (see below)


Prepare Your Instance

You will also want additional storage volumes for:

  • Local Storage for the instance refers to the size for root (/) partition. This can be quite small, as little as 8GB can work. Of course if you intend to bring other files/programs to the instance, you may need to increase this to something a bit larger (e.g. 30GB).
  • Data Storage for the aligner or SNP caller will likely be far larger than the system you are creating.

You'll need to create EBS Volumes for the input and output of the aligner and SNP caller.

Prepare Your Storage

These can be quite substantial and because of that we recommend you create separate volumes like this:

  • Your input FASTQ files for the aligner.

This may have been done for you by some vendor when they put your FASTQ data on an S3 volume. If so, your vendor will need to provide you with the details of how to access your FASTQ files. If your FASTQ files are not in S3 storage, you'll have to create a volume for this and copy your data into it. This can take a very long time.

  • The output of the aligner (BAM files)
  • The intermediate files of the SNP caller (GLF files)
  • The final output of the SNP caller (VCF files)