Changes

From Genome Analysis Wiki
Jump to navigationJump to search
1,999 bytes removed ,  13:06, 20 May 2013
no edit summary
Line 1: Line 1:  
Back to the beginning: [[GotCloud]]
 
Back to the beginning: [[GotCloud]]
 +
 +
'''We no longer use a snapshot.''' It is very likely that you will need quite a few packages installed
 +
so that you can compile your software, access the EC2 application data or access data on S3.
 +
It just seemed foolish to not make these software available in an AMI.
 +
 +
    
GotCloud is made available in various forms.
 
GotCloud is made available in various forms.
Line 44: Line 50:  
* '''Instance size'''  (memory and number of processors). The pipeline software will require at least 4GB of memory (''type m1.medium'') and can use as many processors as is available.
 
* '''Instance size'''  (memory and number of processors). The pipeline software will require at least 4GB of memory (''type m1.medium'') and can use as many processors as is available.
   −
* '''Storage''' for the instance refers to the size for root (/) partition. This can be quite small, as little as 8GB can work. Of course if you intend to bring lots of other files/programs to the instance, you may want to increase this to something a bit larger (e.g. 30GB).
+
* '''Storage''' for the instance refers to the size for root (/) partition. This can be quite small, as little as 8GB can work. Of course if you intend to bring other files/programs to the instance, you may need to increase this to something a bit larger (e.g. 30GB).
 
  −
* '''Data Storage''' for the aligner or snpcaller will likely be far larger than the system you are creating.
  −
You'll need to create EBS Volumes for the input and output of the aligner and snpcaller.
  −
These can be quite substantial and because of that we recommend you create separate volumes like this:
  −
 
  −
* Your input FASTQ files for the aligner. This might have been done for you by some vendor when they put your FASTQ data on an S3 volume. If so, your vendor will need to provide you with the details of how to access your FASTQ files.
  −
 
  −
* The output of the aligner (BAM files)
     −
* The intermediate files of the SNP caller
+
* '''Data Storage''' for the aligner or SNP caller (see below)
      Line 61: Line 59:  
You will also want additional storage volumes for:
 
You will also want additional storage volumes for:
   −
* GotCloud software and reference files
+
* '''Local Storage''' for the instance refers to the size for root (/) partition. This can be quite small, as little as 8GB can work. Of course if you intend to bring other files/programs to the instance, you may need to increase this to something a bit larger (e.g. 30GB).
* Your data
  −
** Sequence data
  −
** Output of the aligner
  −
** Output of umake
     −
The '''first of these''' is a small volume based on a snapshot containing the GotCloud files you will need.
+
* '''Data Storage''' for the aligner or SNP caller will likely be far larger than the system you are creating.
We provide an AWS snapshot of a small volume which contains the aligner and umake software and reference files.
+
You'll need to create EBS Volumes for the input and output of the aligner and SNP caller.
Create an EBS volume based on our snapshot and then mount that volume on your instance.
  −
In the EC2 Management Console under ELASTIC BLOCK STORE, select Volumes -> Create Volume.
  −
In the prompt supply the size and Snapshot (based on the table below).
  −
You may take the defaults for the Volume Type and IOPS.
     −
The snapshot ID varies by zone and the release of the software. You can see the complete list of GotCloud snapshots:
+
'''Prepare Your Storage'''
   −
<code>
+
These can be quite substantial and because of that we recommend you create separate volumes like this:
  wget -qO -  share.sph.umich.edu:gotcloud/snapshots.txt
  −
  −
  #                          GotCloud SnapShot List
  −
  #
  −
  #  Create an EBS volume from these snapshots. Use the AWS console or
  −
  #  with an ec2-api-tools command:
  −
  #
  −
  #    ec2-create-volume -K ~/ec2/EC2-X509-private_key.pem \
  −
  #      -C ~/ec2/EC2-X509-cert.pem -s 40 \
  −
  #      --snapshot snap-14ea7632 --region us-west-2 -z us-west-2a
  −
  #
  −
  #                        Availability
  −
  #  Zone        Snapshot      Size
  −
  us-west-2a    snap-14ea7632  40GB
  −
</code>
  −
 
  −
This will create a device which you need to mount in your instance.
  −
This will create a device like /dev/sdf, which unfortunately actually translates to
  −
the device /dev/xvdf in your Linux instance. Once the volume is ready, mount it
  −
by logging into your instance with ssh and issuing the command:
  −
 
  −
<code>
  −
  sudo mkdir -p /gotcloud
  −
  sudo mount /dev/xvdf  /gotcloud    # or whatever device yours is
  −
  df -h
  −
</code>
     −
This will make the GotCloud software available under the path /gotcloud/bin etc.
+
* Your '''input FASTQ''' files for the aligner.
Each time your instance is started, you'll need to mount this volume.
+
This may have been done for you by some vendor when they put your FASTQ data on an S3 volume.
You may want to create a small shell script to mount the device.
+
If so, your vendor will need to provide you with the details of how to access your FASTQ files.
 +
If your FASTQ files are not in S3 storage, you'll have to create a volume for this and copy your data into it.
 +
This can take a very long time.
   −
In '''Your Data''' the storage volumes will vary based on what you data you have.
+
* The '''output of the aligner''' (BAM files)
The sequence data might already exist, provided by a vendor who created the sequence data.
  −
If not, you'll have to create a volume for this and copy your data into it.
  −
You'll have to mount volumes for all three types of data (sequence, aligner and umake).
     −
You should expect the three data volumes will all need to be the same size. That is, if your sequence data is 300GB, then you'll need an additional 300GB for the aligner output and then another 300GB of storage for the umake output. We suggest you consider making each set of data be separate volumes.
+
* The '''intermediate files of the SNP caller''' (GLF files)
   −
You may also find that your sequence data is too large to be easily handled in one go,
+
* The '''final output of the SNP caller''' (VCF files)
so you might choose to only use the aligner/umake on part of your sequence data, capture the files
  −
of interest from umake, and then go back and rerun the software with the next bit of sequence data.
 
283

edits

Navigation menu