Line 1: |
Line 1: |
− | Back to parent [http://genome.sph.umich.edu/wiki/Pipelines] | + | Back to the beginning at [[GotCloud]] |
| | | |
| Setting up your storage is perhaps the most difficult step as it is controlled completely by the size of your data. | | Setting up your storage is perhaps the most difficult step as it is controlled completely by the size of your data. |
| As a general rule you will need three times the space required for your sequence data. | | As a general rule you will need three times the space required for your sequence data. |
| For instance in the 1000 Genomes data, the data for one individual takes about 45G. | | For instance in the 1000 Genomes data, the data for one individual takes about 45G. |
− | If you have 1000 Genome data for nine individuals, you'll need about 1500GB of space (9x450x3 plus a little extra space). | + | If you have 1000 Genome data for nine individuals, you'll need about 1500GB of space (9x45x3 plus a little extra space). |
| | | |
| Making your data available for the Pipeline software can be accomplished in many ways. | | Making your data available for the Pipeline software can be accomplished in many ways. |
| Here is a simple straightforward organization you might want to use. | | Here is a simple straightforward organization you might want to use. |
| | | |
− | '''Create Volumes''' | + | ===Making Use of Instance Storage=== |
| + | Some instances provide storage. By default they are not added to your instance. You need to set them up prior to launching your instance. |
| + | |
| + | I found instructions at: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#InstanceStore_UsageScenarios |
| + | * referred me to: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html#Using_OverridingAMIBDM |
| + | |
| + | '''Make sure you add the instance store prior to launching''' |
| + | * After launching, only one of the 2 instance stores was mounted at /mnt |
| + | ** To mount the other instance store, I followed the instructions in the '''To make a volume available''' section of: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-add-volume-to-instance.html |
| + | |
| + | |
| + | ===Create Volumes=== |
| | | |
| * Launch your instance and login as explained in the AWS documentation. | | * Launch your instance and login as explained in the AWS documentation. |
Line 23: |
Line 34: |
| We suggest you create storage volumes and use each one for each particular kind of data: | | We suggest you create storage volumes and use each one for each particular kind of data: |
| | | |
− | * /dev/xvdf for sequence data | + | * /dev/xvdf for sequence data This might have already been done for you |
| * /dev/xvdg for aligner output data | | * /dev/xvdg for aligner output data |
| * /dev/xvdh for umake output data | | * /dev/xvdh for umake output data |
| | | |
− | '''Prepare and Attach Volumes'''
| + | |
| + | ====Prepare and Attach Volumes==== |
| | | |
| This first time you need to prepare the disks by formatting and mounting them. | | This first time you need to prepare the disks by formatting and mounting them. |
| Realize this step destroys the data on each volume, so be careful which volume you are working on. | | Realize this step destroys the data on each volume, so be careful which volume you are working on. |
| + | If someone has already put your sequence data on an EBS volume, attach it as /dev/xvdf, |
| + | but do not format it, just do /dev/xvdg and /dev/xdvh. |
| + | '''Do not format a volume that already has your sequence data.''' |
| + | |
| + | <code> |
| + | '''sudo fdisk -l /dev/xdvf''' # Do not continue until this works |
| + | Disk /dev/xvdf: 536.9 GB, 536870912000 bytes |
| + | [lines deleted] |
| + | Disk /dev/xvdf doesn't contain a valid partition table # This is OK |
| + | |
| + | # Device exists, good. Format it, destroying any data there, so be sure of the device name and label. |
| + | '''sudo mkfs -t ext4 -L seq /dev/xvdf''' |
| + | mke2fs 1.42 (29-Nov-2011) |
| + | Filesystem label=seq |
| + | [lines deleted] |
| + | Allocating group tables: done |
| + | Writing inode tables: done |
| + | Creating journal (32768 blocks): done |
| + | Writing superblocks and filesystem accounting information: done |
| + | |
| + | # Repeat these steps for the other volumes |
| + | '''sudo fdisk -l /dev/xdvg''' |
| + | '''sudo mkfs -t ext4 -L aligner /dev/xvdg''' |
| + | |
| + | '''sudo fdisk -l /dev/xdvh''' |
| + | '''sudo mkfs -t ext4 -L umake /dev/xvdh''' |
| + | </code> |
| + | |
| + | Now mount the formatted volumes so you have storage available to run the Pipeline. |
| + | This example puts all of the data in your HOME directory under one place - myseq. |
| + | You may, of course, use any paths you'd like. |
| | | |
| <code> | | <code> |
− | '''sudo fdisk -l /dev/xdvf''' # Do not continue until this works | + | '''mkdir -p ~/myseq/seq ~/myseq/aligner ~/myseq/umake''' |
− | Disk /dev/xvdf: 536.9 GB, 536870912000 bytes
| + | |
− | [lines deleted] | + | '''sudo mount -t ext4 /dev/xvdf ~/myseq/seq''' |
− | Disk /dev/xvdf doesn't contain a valid partition table # This is OK
| + | '''sudo mount -t ext4 /dev/xvdg ~/myseq/aligner''' |
| + | '''sudo mount -t ext4 /dev/xvdh ~/myseq/umake''' |
| + | |
| + | '''df -h ~/myseq/*''' |
| + | Filesystem Size Used Avail Use% Mounted on |
| + | /dev/xvdf 500G 7.5G 467G 2% /home/ubuntu/myseq/seq |
| + | /dev/xvdg 500G 7.5G 467G 2% /home/ubuntu/myseq/aligner |
| + | /dev/xvdh 500G 7.5G 467G 2% /home/ubuntu/myseq/umake |
| + | </code> |
| + | |
| + | |
| + | ===Getting Your Sequence Data=== |
| + | |
| + | Now it's time to get your sequence data so you can run the Pipeline on it. |
| + | If someone else has already put your sequence data on an EBS volume, consider yourself fortunate. |
| + | Others will have to copy the data from wherever it is to ~/myseq/seq. |
| + | You might do this with rsync, scp, ftp, sftp or aspera (see http://asperasoft.com/). |
| | | |
− | # Device exists, good. Format it, destroying any data there, so be sure of the device name.
| + | If your data is in an '''S3 Bucket''', you will need to copy the data from the bucket to some local storage. |
− | '''sudo mkfs -t ext4 -L seq /dev/xvdf''' | + | We suggest you copy it to ~/myseq/seq. |
− | mke2fs 1.42 (29-Nov-2011)
| + | Copying from an S3 bucket is faster than refetching it over the Internet, but given the size |
− | Filesystem label=seq
| + | of sequence data, it still can take quite a bit of time. |
− | [lines deleted]
| + | We provide a script 'awssync.pl' to assist in copy your S3 bucket data. |
− | Allocating group tables: done
| |
− | Writing inode tables: done
| |
− | Creating journal (32768 blocks): done
| |
− | Writing superblocks and filesystem accounting information: done
| |
| | | |
− | * Repeat these steps for the other volumes
| + | For example, if you are using data from the 1000 Genomes repository in AWS, you can copy the |
− | '''sudo fdisk -l /dev/xdvg'''
| + | individuals data like this: |
− | '''sudo mkfs -t ext4 -L aligner /dev/xvdg'''
| |
| | | |
− | '''sudo fdisk -l /dev/xdvh''' | + | <code> |
− | '''sudo mkfs -t ext4 -L umake /dev/xvdh''' | + | '''cd ~/myseq''' |
| + | '''/usr/local/biopipe/bin/awssync.pl seq 1000genomes/data/HG01112 1000genomes/data/HG01113 ...''' |
| + | Retreiving data from bucket '1000genomes/data/HG01112' into seq' |
| + | Found 39 files and 3 directories |
| + | Directories created |
| + | Copying data/HG01112/sequence_read/SRR063071.filt.fastq.gz (37496.31 KB) 14 secs |
| + | Copying data/HG01112/sequence_read/SRR063073_2.filt.fastq.gz (1960124.13 KB) 832 secs |
| + | [lines deleted] |
| + | Copying data/HG01112/sequence_read/SRR063073.filt.fastq.gz (39765.40 KB) 18 secs |
| + | Files created |
| + | Completed bucket 'seq/data/HG01112' in 378.38 min |
| + | |
| + | [lines deleted] |
| </code> | | </code> |
| + | |
| + | You probably will not be surprised at the times shown (45G in ~380 minutes). |
| + | By now you know that getting your sequence data can take a long time. |
| + | When this completes you are finally ready to run the Pipeline software. |
| + | |
| + | In one case we copied 387GB of data from 1000 Genomes to our own EBS volume. |
| + | This took about 60 hours (longer actually because the copy failed and had to be restarted). |
| + | For an m1.medium instance (any sized instance can be used for this step) this cost about $20 (Oct 2012). |
| + | The cost for the 500GB EBS volume where the 1000 Genomes data was copied is very low ($0.50/month). |