Changes

From Genome Analysis Wiki
Jump to navigationJump to search
1,966 bytes added ,  10:02, 29 October 2012
no edit summary
Line 8: Line 8:  
Making your data available for the Pipeline software can be accomplished in many ways.
 
Making your data available for the Pipeline software can be accomplished in many ways.
 
Here is a simple straightforward organization you might want to use.
 
Here is a simple straightforward organization you might want to use.
 +
    
'''Create Volumes'''
 
'''Create Volumes'''
Line 23: Line 24:  
We suggest you create storage volumes and use each one for each particular kind of data:
 
We suggest you create storage volumes and use each one for each particular kind of data:
   −
* /dev/xvdf for sequence data
+
* /dev/xvdf for sequence data     This might have already been done for you
 
* /dev/xvdg for aligner output data
 
* /dev/xvdg for aligner output data
 
* /dev/xvdh for umake output data
 
* /dev/xvdh for umake output data
 +
    
'''Prepare and Attach Volumes'''
 
'''Prepare and Attach Volumes'''
Line 31: Line 33:  
This first time you need to prepare the disks by formatting and mounting them.
 
This first time you need to prepare the disks by formatting and mounting them.
 
Realize this step destroys the data on each volume, so be careful which volume you are working on.
 
Realize this step destroys the data on each volume, so be careful which volume you are working on.
 +
If someone has already put your sequence data on an EBS volume, attach it as /dev/xvdf,
 +
but do not format it, just do /dev/xvdg and /dev/xdvh.
 +
'''Do not format a volume that already has your sequence data.'''
    
<code>
 
<code>
Line 73: Line 78:  
   /dev/xvdh      500G  7.5G  467G  2% /home/ubuntu/myseq/umake
 
   /dev/xvdh      500G  7.5G  467G  2% /home/ubuntu/myseq/umake
 
</code>
 
</code>
 +
 +
 +
'''Getting Your Sequence Data'''
 +
 +
Now it's time to get your sequence data so you can run the Pipeline on it.
 +
If someone else has already put your sequence data on an EBS volume, consider yourself fortunate.
 +
Others will have to copy the data from wherever it is to ~/myseq/seq.
 +
You might do this with rsync, scp, ftp, sftp or aspera (see http://asperasoft.com/).
 +
 +
If your data is in an '''S3 Bucket''', you will need to copy the data from the bucket to some local storage.
 +
We suggest you copy it to ~/myseq/seq.
 +
Copying from an S3 bucket is faster than refetching it over the Internet, but given the size
 +
of sequence data, it still can take quite a bit of time.
 +
We provide a script 'awssync.pl' to assist in copy your S3 bucket data.
 +
 +
For example, if you are using data from the 1000 Genomes repository in AWS, you can copy the
 +
individuals data like this:
 +
 +
<code>
 +
'''cd ~/myseq'''
 +
'''/usr/local/biopipe/bin/awssync.pl seq 1000genomes/data/HG01112 1000genomes/data/HG01113 ...'''
 +
  Retreiving data from bucket '1000genomes/data/HG01112' into seq'
 +
  Found 39 files and  3 directories
 +
  Directories created
 +
  Copying data/HG01112/sequence_read/SRR063071.filt.fastq.gz  (37496.31 KB)  14 secs
 +
  Copying data/HG01112/sequence_read/SRR063073_2.filt.fastq.gz  (1960124.13 KB)  832 secs
 +
    [lines deleted] 
 +
  Copying data/HG01112/sequence_read/SRR063073.filt.fastq.gz  (39765.40 KB)  18 secs
 +
  Files created
 +
  Completed bucket 'seq/data/HG01112' in 378.38 min
 +
 +
  [lines deleted]
 +
</code>
 +
 +
You probably will not be surprised at the times shown (45G in ~380 minutes).
 +
By now you know that getting your sequence data can take a long time.
 +
When this completes you are finally ready to run the Pipeline software.
283

edits

Navigation menu