Difference between revisions of "Amazon Storage"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(16 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Back to parent [http://genome.sph.umich.edu/wiki/Pipelines]
+
Back to the beginning at [[GotCloud]]
  
 
Setting up your storage is perhaps the most difficult step as it is controlled completely by the size of your data.
 
Setting up your storage is perhaps the most difficult step as it is controlled completely by the size of your data.
 
As a general rule you will need three times the space required for your sequence data.
 
As a general rule you will need three times the space required for your sequence data.
 
For instance in the 1000 Genomes data, the data for one individual takes about 45G.
 
For instance in the 1000 Genomes data, the data for one individual takes about 45G.
If you have 1000 Genome data for nine individuals, you'll need about 1500GB of space (9x450x3 plus a little extra space).
+
If you have 1000 Genome data for nine individuals, you'll need about 1500GB of space (9x45x3 plus a little extra space).
  
 
Making your data available for the Pipeline software can be accomplished in many ways.
 
Making your data available for the Pipeline software can be accomplished in many ways.
 
Here is a simple straightforward organization you might want to use.
 
Here is a simple straightforward organization you might want to use.
 +
 +
===Making Use of Instance Storage===
 +
Some instances provide storage.  By default they are not added to your instance.  You need to set them up prior to launching your instance.
 +
 +
I found instructions at: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#InstanceStore_UsageScenarios
 +
* referred me to: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html#Using_OverridingAMIBDM
 +
 +
'''Make sure you add the instance store prior to launching'''
 +
* After launching, only one of the 2 instance stores was mounted at /mnt
 +
** To mount the other instance store, I followed the instructions in the '''To make a volume available''' section of: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-add-volume-to-instance.html
 +
 +
 +
===Create Volumes===
  
 
* Launch your instance and login as explained in the AWS documentation.
 
* Launch your instance and login as explained in the AWS documentation.
Line 19: Line 32:
 
Note: as of this writing if you specify a device as sdf, it will actually show up as /dev/xvdf in the instance.
 
Note: as of this writing if you specify a device as sdf, it will actually show up as /dev/xvdf in the instance.
  
This first time you need to prepare the disks by formatting and mounting them:
+
We suggest you create storage volumes and use each one for each particular kind of data:
 +
 
 +
* /dev/xvdf for sequence data      This might have already been done for you
 +
* /dev/xvdg for aligner output data
 +
* /dev/xvdh for umake output data
 +
 
 +
 
 +
====Prepare and Attach Volumes====
 +
 
 +
This first time you need to prepare the disks by formatting and mounting them.
 +
Realize this step destroys the data on each volume, so be careful which volume you are working on.
 +
If someone has already put your sequence data on an EBS volume, attach it as /dev/xvdf,
 +
but do not format it, just do /dev/xvdg and /dev/xdvh.
 +
'''Do not format a volume that already has your sequence data.'''
 +
 
 +
<code>
 +
'''sudo fdisk -l /dev/xdvf'''          # Do not continue until this works
 +
  Disk /dev/xvdf: 536.9 GB, 536870912000 bytes
 +
    [lines deleted]
 +
  Disk /dev/xvdf doesn't contain a valid partition table  # This is OK
 +
 +
# Device exists, good. Format it, destroying any data there, so be sure of the device name and label.
 +
'''sudo mkfs -t ext4 -L seq /dev/xvdf'''
 +
  mke2fs 1.42 (29-Nov-2011)
 +
  Filesystem label=seq
 +
    [lines deleted]
 +
  Allocating group tables: done                           
 +
  Writing inode tables: done                           
 +
  Creating journal (32768 blocks): done
 +
  Writing superblocks and filesystem accounting information: done   
 +
 +
# Repeat these steps for the other volumes
 +
'''sudo fdisk -l /dev/xdvg'''
 +
'''sudo mkfs -t ext4 -L aligner /dev/xvdg'''
 +
 +
'''sudo fdisk -l /dev/xdvh'''
 +
'''sudo mkfs -t ext4 -L umake /dev/xvdh'''
 +
</code>
 +
 
 +
Now mount the formatted volumes so you have storage available to run the Pipeline.
 +
This example puts all of the data in your HOME directory under one place - myseq.
 +
You may, of course, use any paths you'd like.
  
 
<code>
 
<code>
  '''sudo fdisk -l /dev/xdvg'''          # Do not continue until this works
+
'''mkdir -p ~/myseq/seq ~/myseq/aligner ~/myseq/umake'''
    Disk /dev/xvdg: 536.9 GB, 536870912000 bytes
 
    255 heads, 63 sectors/track, 65270 cylinders, total 1048576000 sectors
 
    Units = sectors of 1 * 512 = 512 bytes
 
    Sector size (logical/physical): 512 bytes / 512 bytes
 
    I/O size (minimum/optimal): 512 bytes / 512 bytes
 
    Disk identifier: 0x00000000
 
 
   
 
   
    Disk /dev/xvdg doesn't contain a valid partition table
+
'''sudo mount -t ext4 /dev/xvdf ~/myseq/seq'''
 +
'''sudo mount -t ext4 /dev/xvdg ~/myseq/aligner'''
 +
'''sudo mount -t ext4 /dev/xvdh ~/myseq/umake'''
 +
 +
'''df -h ~/myseq/*'''
 +
  Filesystem      Size  Used Avail Use% Mounted on
 +
  /dev/xvdf      500G  7.5G  467G  2% /home/ubuntu/myseq/seq
 +
  /dev/xvdg      500G  7.5G  467G  2% /home/ubuntu/myseq/aligner
 +
  /dev/xvdh      500G  7.5G  467G  2% /home/ubuntu/myseq/umake
 +
</code>
 +
 
  
  '''sudo mkfs -t ext4 /dev/xvdg'''
+
===Getting Your Sequence Data===
    mke2fs 1.42 (29-Nov-2011)
 
    Filesystem label=
 
    OS type: Linux
 
    Block size=4096 (log=2)
 
    Fragment size=4096 (log=2)
 
    Stride=0 blocks, Stripe width=0 blocks
 
    32768000 inodes, 131072000 blocks
 
    6553600 blocks (5.00%) reserved for the super user
 
    First data block=0
 
    Maximum filesystem blocks=4294967296
 
    4000 block groups
 
    32768 blocks per group, 32768 fragments per group
 
    8192 inodes per group
 
    Superblock backups stored on blocks:
 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 
        102400000
 
    Allocating group tables: done                           
 
    Writing inode tables: done                           
 
    Creating journal (32768 blocks): done
 
    Writing superblocks and filesystem accounting information: done   
 
  
 +
Now it's time to get your sequence data so you can run the Pipeline on it.
 +
If someone else has already put your sequence data on an EBS volume, consider yourself fortunate.
 +
Others will have to copy the data from wherever it is to ~/myseq/seq.
 +
You might do this with rsync, scp, ftp, sftp or aspera (see http://asperasoft.com/).
  
 +
If your data is in an '''S3 Bucket''', you will need to copy the data from the bucket to some local storage.
 +
We suggest you copy it to ~/myseq/seq.
 +
Copying from an S3 bucket is faster than refetching it over the Internet, but given the size
 +
of sequence data, it still can take quite a bit of time.
 +
We provide a script 'awssync.pl' to assist in copy your S3 bucket data.
  
    
+
For example, if you are using data from the 1000 Genomes repository in AWS, you can copy the
   sudo fdisk -l /dev/xdvg          # Do not continue until this works
+
individuals data like this:
   sudo mkfs -t ext4 /dev/xdvg
+
 
 +
<code>
 +
'''cd ~/myseq'''
 +
'''/usr/local/biopipe/bin/awssync.pl seq 1000genomes/data/HG01112 1000genomes/data/HG01113 ...'''
 +
  Retreiving data from bucket '1000genomes/data/HG01112' into seq'
 +
  Found 39 files and  3 directories
 +
  Directories created
 +
  Copying data/HG01112/sequence_read/SRR063071.filt.fastq.gz  (37496.31 KB)  14 secs
 +
  Copying data/HG01112/sequence_read/SRR063073_2.filt.fastq.gz  (1960124.13 KB)   832 secs
 +
    [lines deleted] 
 +
   Copying data/HG01112/sequence_read/SRR063073.filt.fastq.gz  (39765.40 KB)  18 secs
 +
  Files created
 +
   Completed bucket 'seq/data/HG01112' in 378.38 min
 +
 +
  [lines deleted]
 
</code>
 
</code>
 +
 +
You probably will not be surprised at the times shown (45G in ~380 minutes).
 +
By now you know that getting your sequence data can take a long time.
 +
When this completes you are finally ready to run the Pipeline software.
 +
 +
In one case we copied 387GB of data from 1000 Genomes to our own EBS volume.
 +
This took about 60 hours (longer actually because the copy failed and had to be restarted).
 +
For an m1.medium instance (any sized instance can be used for this step) this cost about $20 (Oct 2012).
 +
The cost for the 500GB EBS volume where the 1000 Genomes data was copied is very low ($0.50/month).

Latest revision as of 16:49, 5 November 2014

Back to the beginning at GotCloud

Setting up your storage is perhaps the most difficult step as it is controlled completely by the size of your data. As a general rule you will need three times the space required for your sequence data. For instance in the 1000 Genomes data, the data for one individual takes about 45G. If you have 1000 Genome data for nine individuals, you'll need about 1500GB of space (9x45x3 plus a little extra space).

Making your data available for the Pipeline software can be accomplished in many ways. Here is a simple straightforward organization you might want to use.

Making Use of Instance Storage

Some instances provide storage. By default they are not added to your instance. You need to set them up prior to launching your instance.

I found instructions at: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#InstanceStore_UsageScenarios

Make sure you add the instance store prior to launching


Create Volumes

  • Launch your instance and login as explained in the AWS documentation.
  • Using the AWS EC2 Console Dashboard create one EBS volume (ELASTIC BLOCK STORE -> Volumes) for the sequence data (e.g. 500GB).
  • Using the Dashboard create another EBS volume for the output of the aligner step (e.g. another 500GB).
  • Using the Dashboard create another EBS volume for the output of the umake step (e.g. another 500GB).

Attach the volume to the instance you have just launched and specify as a separate device: f, g and h (e.g. /dev/sdf, /dev/sdg and /dev/sdh). It'll take a few minutes for the volume to show up in your instance. Note: as of this writing if you specify a device as sdf, it will actually show up as /dev/xvdf in the instance.

We suggest you create storage volumes and use each one for each particular kind of data:

  • /dev/xvdf for sequence data This might have already been done for you
  • /dev/xvdg for aligner output data
  • /dev/xvdh for umake output data


Prepare and Attach Volumes

This first time you need to prepare the disks by formatting and mounting them. Realize this step destroys the data on each volume, so be careful which volume you are working on. If someone has already put your sequence data on an EBS volume, attach it as /dev/xvdf, but do not format it, just do /dev/xvdg and /dev/xdvh. Do not format a volume that already has your sequence data.

sudo fdisk -l /dev/xdvf          # Do not continue until this works
 Disk /dev/xvdf: 536.9 GB, 536870912000 bytes
   [lines deleted]
 Disk /dev/xvdf doesn't contain a valid partition table   # This is OK

# Device exists, good. Format it, destroying any data there, so be sure of the device name and label.
sudo mkfs -t ext4 -L seq /dev/xvdf
 mke2fs 1.42 (29-Nov-2011)
 Filesystem label=seq
   [lines deleted]
 Allocating group tables: done                            
 Writing inode tables: done                            
 Creating journal (32768 blocks): done
 Writing superblocks and filesystem accounting information: done     

# Repeat these steps for the other volumes
sudo fdisk -l /dev/xdvg
sudo mkfs -t ext4 -L aligner /dev/xvdg

sudo fdisk -l /dev/xdvh
sudo mkfs -t ext4 -L umake /dev/xvdh

Now mount the formatted volumes so you have storage available to run the Pipeline. This example puts all of the data in your HOME directory under one place - myseq. You may, of course, use any paths you'd like.

mkdir -p ~/myseq/seq ~/myseq/aligner ~/myseq/umake

sudo mount -t ext4 /dev/xvdf ~/myseq/seq
sudo mount -t ext4 /dev/xvdg ~/myseq/aligner
sudo mount -t ext4 /dev/xvdh ~/myseq/umake

df -h ~/myseq/*
 Filesystem      Size  Used Avail Use% Mounted on
 /dev/xvdf       500G  7.5G  467G   2% /home/ubuntu/myseq/seq
 /dev/xvdg       500G  7.5G  467G   2% /home/ubuntu/myseq/aligner
 /dev/xvdh       500G  7.5G  467G   2% /home/ubuntu/myseq/umake


Getting Your Sequence Data

Now it's time to get your sequence data so you can run the Pipeline on it. If someone else has already put your sequence data on an EBS volume, consider yourself fortunate. Others will have to copy the data from wherever it is to ~/myseq/seq. You might do this with rsync, scp, ftp, sftp or aspera (see http://asperasoft.com/).

If your data is in an S3 Bucket, you will need to copy the data from the bucket to some local storage. We suggest you copy it to ~/myseq/seq. Copying from an S3 bucket is faster than refetching it over the Internet, but given the size of sequence data, it still can take quite a bit of time. We provide a script 'awssync.pl' to assist in copy your S3 bucket data.

For example, if you are using data from the 1000 Genomes repository in AWS, you can copy the individuals data like this:

cd ~/myseq
/usr/local/biopipe/bin/awssync.pl seq 1000genomes/data/HG01112 1000genomes/data/HG01113 ...
 Retreiving data from bucket '1000genomes/data/HG01112' into seq'
 Found 39 files and  3 directories
 Directories created
 Copying data/HG01112/sequence_read/SRR063071.filt.fastq.gz  (37496.31 KB)   14 secs
 Copying data/HG01112/sequence_read/SRR063073_2.filt.fastq.gz  (1960124.13 KB)   832 secs
   [lines deleted]  
 Copying data/HG01112/sequence_read/SRR063073.filt.fastq.gz  (39765.40 KB)   18 secs
 Files created
 Completed bucket 'seq/data/HG01112' in 378.38 min

 [lines deleted]

You probably will not be surprised at the times shown (45G in ~380 minutes). By now you know that getting your sequence data can take a long time. When this completes you are finally ready to run the Pipeline software.

In one case we copied 387GB of data from 1000 Genomes to our own EBS volume. This took about 60 hours (longer actually because the copy failed and had to be restarted). For an m1.medium instance (any sized instance can be used for this step) this cost about $20 (Oct 2012). The cost for the 500GB EBS volume where the 1000 Genomes data was copied is very low ($0.50/month).