Amazon Storage

From Genome Analysis Wiki
Revision as of 09:28, 29 October 2012 by Terry Gliedt (talk | contribs)
Jump to navigationJump to search

Back to parent [1]

Setting up your storage is perhaps the most difficult step as it is controlled completely by the size of your data. As a general rule you will need three times the space required for your sequence data. For instance in the 1000 Genomes data, the data for one individual takes about 45G. If you have 1000 Genome data for nine individuals, you'll need about 1500GB of space (9x450x3 plus a little extra space).

Making your data available for the Pipeline software can be accomplished in many ways. Here is a simple straightforward organization you might want to use.

Create Volumes

  • Launch your instance and login as explained in the AWS documentation.
  • Using the AWS EC2 Console Dashboard create one EBS volume (ELASTIC BLOCK STORE -> Volumes) for the sequence data (e.g. 500GB).
  • Using the Dashboard create another EBS volume for the output of the aligner step (e.g. another 500GB).
  • Using the Dashboard create another EBS volume for the output of the umake step (e.g. another 500GB).

Attach the volume to the instance you have just launched and specify as a separate device: f, g and h (e.g. /dev/sdf, /dev/sdg and /dev/sdh). It'll take a few minutes for the volume to show up in your instance. Note: as of this writing if you specify a device as sdf, it will actually show up as /dev/xvdf in the instance.

We suggest you create storage volumes and use each one for each particular kind of data:

  • /dev/xvdf for sequence data
  • /dev/xvdg for aligner output data
  • /dev/xvdh for umake output data

Prepare and Attach Volumes

This first time you need to prepare the disks by formatting and mounting them. Realize this step destroys the data on each volume, so be careful which volume you are working on.

sudo fdisk -l /dev/xdvf          # Do not continue until this works
   Disk /dev/xvdf: 536.9 GB, 536870912000 bytes
     [lines deleted]
   Disk /dev/xvdf doesn't contain a valid partition table   # This is OK

#   Device exists, good. Format it, destroying any data there, so be sure of the device name.
sudo mkfs -t ext4 -L seq /dev/xvdf
   mke2fs 1.42 (29-Nov-2011)
   Filesystem label=seq
     [lines deleted]
   Allocating group tables: done                            
   Writing inode tables: done                            
   Creating journal (32768 blocks): done
   Writing superblocks and filesystem accounting information: done     

#   Repeat these steps for the other volumes
sudo fdisk -l /dev/xdvg
sudo mkfs -t ext4 -L aligner /dev/xvdg

sudo fdisk -l /dev/xdvh
sudo mkfs -t ext4 -L umake /dev/xvdh

Now mount the formatted volumes so you have storage available to run the Pipeline. This example puts all of the data in your HOME directory under one place - myseq. You may, of course, use any paths you'd like.

mkdir -p ~/myseq/seq ~/myseq/aligner ~/myseq/umake

sudo mount -t ext4 /dev/xvdf ~/myseq/seq
sudo mount -t ext4 /dev/xvdg ~/myseq/aligner
sudo mount -t ext4 /dev/xvdh ~/myseq/umake

df -h ~/myseq/*
 Filesystem      Size  Used Avail Use% Mounted on
 /dev/xvdf       500G  7.5G  467G   2% /home/ubuntu/myseq/seq
 /dev/xvdg       500G  7.5G  467G   2% /home/ubuntu/myseq/aligner
 /dev/xvdh       500G  7.5G  467G   2% /home/ubuntu/myseq/umake