Line 8: |
Line 8: |
| Making your data available for the Pipeline software can be accomplished in many ways. | | Making your data available for the Pipeline software can be accomplished in many ways. |
| Here is a simple straightforward organization you might want to use. | | Here is a simple straightforward organization you might want to use. |
| + | |
| | | |
| '''Create Volumes''' | | '''Create Volumes''' |
Line 23: |
Line 24: |
| We suggest you create storage volumes and use each one for each particular kind of data: | | We suggest you create storage volumes and use each one for each particular kind of data: |
| | | |
− | * /dev/xvdf for sequence data | + | * /dev/xvdf for sequence data This might have already been done for you |
| * /dev/xvdg for aligner output data | | * /dev/xvdg for aligner output data |
| * /dev/xvdh for umake output data | | * /dev/xvdh for umake output data |
| + | |
| | | |
| '''Prepare and Attach Volumes''' | | '''Prepare and Attach Volumes''' |
Line 31: |
Line 33: |
| This first time you need to prepare the disks by formatting and mounting them. | | This first time you need to prepare the disks by formatting and mounting them. |
| Realize this step destroys the data on each volume, so be careful which volume you are working on. | | Realize this step destroys the data on each volume, so be careful which volume you are working on. |
| + | If someone has already put your sequence data on an EBS volume, attach it as /dev/xvdf, |
| + | but do not format it, just do /dev/xvdg and /dev/xdvh. |
| + | '''Do not format a volume that already has your sequence data.''' |
| | | |
| <code> | | <code> |
Line 73: |
Line 78: |
| /dev/xvdh 500G 7.5G 467G 2% /home/ubuntu/myseq/umake | | /dev/xvdh 500G 7.5G 467G 2% /home/ubuntu/myseq/umake |
| </code> | | </code> |
| + | |
| + | |
| + | '''Getting Your Sequence Data''' |
| + | |
| + | Now it's time to get your sequence data so you can run the Pipeline on it. |
| + | If someone else has already put your sequence data on an EBS volume, consider yourself fortunate. |
| + | Others will have to copy the data from wherever it is to ~/myseq/seq. |
| + | You might do this with rsync, scp, ftp, sftp or aspera (see http://asperasoft.com/). |
| + | |
| + | If your data is in an '''S3 Bucket''', you will need to copy the data from the bucket to some local storage. |
| + | We suggest you copy it to ~/myseq/seq. |
| + | Copying from an S3 bucket is faster than refetching it over the Internet, but given the size |
| + | of sequence data, it still can take quite a bit of time. |
| + | We provide a script 'awssync.pl' to assist in copy your S3 bucket data. |
| + | |
| + | For example, if you are using data from the 1000 Genomes repository in AWS, you can copy the |
| + | individuals data like this: |
| + | |
| + | <code> |
| + | '''cd ~/myseq''' |
| + | '''/usr/local/biopipe/bin/awssync.pl seq 1000genomes/data/HG01112 1000genomes/data/HG01113 ...''' |
| + | Retreiving data from bucket '1000genomes/data/HG01112' into seq' |
| + | Found 39 files and 3 directories |
| + | Directories created |
| + | Copying data/HG01112/sequence_read/SRR063071.filt.fastq.gz (37496.31 KB) 14 secs |
| + | Copying data/HG01112/sequence_read/SRR063073_2.filt.fastq.gz (1960124.13 KB) 832 secs |
| + | [lines deleted] |
| + | Copying data/HG01112/sequence_read/SRR063073.filt.fastq.gz (39765.40 KB) 18 secs |
| + | Files created |
| + | Completed bucket 'seq/data/HG01112' in 378.38 min |
| + | |
| + | [lines deleted] |
| + | </code> |
| + | |
| + | You probably will not be surprised at the times shown (45G in ~380 minutes). |
| + | By now you know that getting your sequence data can take a long time. |
| + | When this completes you are finally ready to run the Pipeline software. |