Difference between revisions of "StarCluster"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(38 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Back to the beginning [http://genome.sph.umich.edu/wiki/Pipelines]
+
__TOC__
 +
 
 +
Back to the beginning: [[GotCloud]]
 +
 
 +
Back to [[GotCloud: Amazon]]
  
 
If you have access to your own cluster, your task will be much simpler.
 
If you have access to your own cluster, your task will be much simpler.
Install the Pipeline software (links at [http://genome.sph.umich.edu/wiki/Pipelines])
+
Install the GotCloud software ([[GotCloud: Source Releases]])
and run it as descibed on the same pages.
+
and run it as described on the same pages.
  
For those who are not so lucky to have access to a cluster, AWS provides an alternative.
+
For those who are not so lucky to have access to a cluster, Amazon Web Services (AWS) provides an alternative.
You may run the pipeline software on a cluster created in AWS.
+
You may run the gotcloud software on a cluster created in AWS.
 
One tool that makes the creation of a cluster of AMIs (Amazon Machine Instances) is
 
One tool that makes the creation of a cluster of AMIs (Amazon Machine Instances) is
 
'''StarCluster''' (see http://star.mit.edu/cluster/).
 
'''StarCluster''' (see http://star.mit.edu/cluster/).
  
The following shows an example of how you might use ''starcluster'' to create
+
The following shows an example of how you might use ''StarCluster'' to create an AWS cluster and set it up to run GotCloud.
and AWS cluster and set it up to run the Pipeline.
 
 
 
We will use ''starcluster'' to launch a set of AWS instances.  
 
 
There are many details setting up starcluster and this is ''not'' intended to explain all
 
There are many details setting up starcluster and this is ''not'' intended to explain all
 
of the many variations you might choose, but should provide you a working example.
 
of the many variations you might choose, but should provide you a working example.
  
The tasks to be completed are:
 
  
* Install and configure starcluster on a machine you use.
+
== Getting Started With StarCluster ==
* Create an AWS cluster
+
StarCluster provides lots of documentation (http://star.mit.edu/cluster/) which will provide more information on it than we have here.
* Install the Pipeline software on the master node
 
* Create storage for your sequence data and make it available for the software
 
* Run the Pipeline software
 
  
Installing and configuring starcluster on your machine is described at http://star.mit.edu/cluster/.
+
To install and setup StarCluster for the first time, you can follow the QuickStart instructions: http://star.mit.edu/cluster/docs/latest/quickstart.html
Only the second step will be covered here, as the others are described at [http://genome.sph.umich.edu/wiki/Pipelines].
+
* Includes installation instructions
 +
** See http://star.mit.edu/cluster/docs/latest/installation.html for more detailed StarCluster installation instructions if the QuickStart instructions are not enought (especially if running on Windows)
 +
* Includes setting up a basic StarCluster configuration file
 +
** You will need your AWS Credentials to setup the configuration file
 +
*** If you need help setting up your AWS credentials, see: [[AWS Credentials]]
  
 +
You can skip actually starting the cluster in the QuickStart instructions if you want.
 +
 +
''' Troubleshooting: '''  When I tried this, the <code>starcluster start mycluster</code> step failed similar to:
 +
* http://star.mit.edu/cluster/mlarchives/2425.html
 +
* So I followed the suggestions  there and at https://github.com/jtriley/StarCluster/issues/455:
 +
*# <pre>$ sudo pip uninstall boto</pre>
 +
*# <pre>$ sudo easy_install boto==2.32.0</pre>
 +
*#* I was having trouble with pip install, but found that easy_install worked
 +
* I had to force terminate mycluster after a failed start:
 +
** <pre> starcluster terminate -f mycluster</pre>
 +
* Then I was able to successfully start my cluster
 +
 +
''' Don't forget to terminate your cluster:'''
 +
starcluster terminate mycluster
  
'''StarCluster Configuration Example'''
 
  
StarCluster creates a model configuration file in ~/.starcluster/config and you are instructed
+
== StarCluster and GotCloud ==
to edit this and set the correct values for the variables.
+
=== StarCluster Config Settings ===
Here is an example of a config file that we used (with some details changed of course).
 
  
<code>
+
By default, StarCluster expects a configuration file in ~/.starcluster/config.
####################################
+
* StarCluster will create a model file for you
## StarCluster Configuration File ##
 
####################################
 
[global]
 
DEFAULT_TEMPLATE=xxx
 
  
#############################################
+
Ensure your StarCluster configuration file is set for your usage.
## AWS Credentials Settings
+
* General AWS Settings:
#############################################
+
[aws info]
[aws info]
+
aws_access_key_id = #your aws access key id here
AWS_ACCESS_KEY_ID = AKImyexample8FHJJF2Q
+
aws_secret_access_key = #your secret aws access key here
AWS_SECRET_ACCESS_KEY = fthis_was_my_example_secretMqkMIkJjFCIGf
+
aws_user_id = #your 12-digit aws user id here
AWS_USER_ID=199998888709
+
* You should have set these in [[#Getting Started With StarCluster|Getting Started With StarCluster]] above (quickstart guide and AWS Credentials) .
  
AWS_REGION_NAME = us-west-2                # Choose your own region
+
* GotCloud Cluster Definition
AWS_REGION_HOST = ec2.us-west-2.amazonaws.com
+
** You may want to create a new cluster section for running GotCloud (or you can use smallcluster) in your configuration file: <code>~/.starcluster/config</code>
AWS_S3_HOST = s3-us-west-2.amazonaws.com
+
** You can call it anything you want, for example, <pre>gccluster</pre>
 +
** Example:
 +
**: <pre>[cluster gccluster]&#10;KEYNAME = mykey&#10;CLUSTER_SIZE = 4&#10;CLUSTER_USER = sgeadmin&#10;CLUSTER_SHELL = bash&#10;MASTER_IMAGE_ID = ami-6ae65e02&#10;NODE_IMAGE_ID = ami-3393a45a&#10;NODE_INSTANCE_TYPE = m3.large</pre>
 +
*** Set <code>KEYNAME</code> to the key you want to use
 +
*** Set <code>CLUSTER_SIZE</code> to the number of nodes you want to start up (this may be different than 4)
 +
*** Set <code>CLUSTER_USER</code> to add additional users, like <code>sgeadmin</code>
 +
*** Set <code>CLUSTER_SHELL</code> to define the shell you want to use, like <code>bash</code>
 +
*** Set <code>MASTER_IMAGE_ID</code> to the latest GotCloud AMI, see: [[GotCloud: AMIs]]
 +
**** Contains GotCloud, the reference, and the demo files in the /home/ubuntu/ directory that will be visible on all nodes in the cluster
 +
**** Has a 30G volume, but only 6G available
 +
*** Set <code>NODE_IMAGE_ID</code> to a StarCluster <code>ubuntu x86_64</code> AMI
 +
**** Since each node does not need its own 30G volume containing GotCloud, the reference, and the demo files, we use a separate image for the nodes.
 +
*** The nodes can just access the master's copy of GotCloud, the reference, and the Amazon demo
 +
*** Set <code>NODE_INSTANCE_TYPE</code> to the type of instances you want to start in your cluster
 +
**** See http://aws.amazon.com/ec2/pricing/ for instance descriptions and prices
 +
**** We do not recommend running GotCloud on machines with less than 4MB of memory
 +
** The <code>CLUSTER_SIZE</code> * CPUs in <code>NODE_INSTANCE_TYPE</code> = the number of jobs you can run concurrently in GotCloud
  
###########################
+
* Define Data Volumes
## EC2 Keypairs
+
** By default, the GotCloud AMI contains about 5G of extra space that you can use
###########################
+
*** /home/ubuntu/ directory is visible from all machines
[key east1_starcluster]
+
**** Use /home/ubuntu/ for the output directory if it is <5G
KEY_LOCATION = ~/.ssh/AWS/east1_starcluster_key.rsa
+
**** This directory will be deleted when you terminate the AMI
 +
** Create your Own Volumes and attach them to the GotCloud cluster
 +
*** '''Instructions TBD'''
  
[key west1_starcluster]
+
=== Starting the Cluster ===
KEY_LOCATION = ~/.ssh/AWS/west1_starcluster_key.rsa
+
# Start the cluster:
 +
#* <pre>starcluster start -c gccluster mycluster</pre>
 +
#** Alternatively, if you can the default template at the start of the configuration file in the <code>[global]</code> section to gccluster: <code>DEFAULT_TEMPLATE=gccluster</code>, you can run:
 +
#*** <pre>starcluster start mycluster</pre>
 +
#* It will take a few minutes for the cluster to start
  
[key west2_starcluster]
 
KEY_LOCATION = ~/.ssh/AWS/west2_starcluster_key.rsa
 
  
 +
=== Copying Data to/from the Cluster ===
 +
Copy data onto the cluster (command run from your local machine)
 +
starcluster put /path/to/local/file/or/dir /remote/path/
  
  
# Configure the default cluster template to use when starting a cluster
+
Pull the data from the cluster onto your local machine (command run from your local machine)
# defaults to 'smallcluster' defined below. This template should be usable
+
starcluster get /path/to/remote/file/or/dir /local/path/
# out-of-the-box provided you've configured your keypair correctly
 
DEFAULT_TEMPLATE=smallcluster
 
# enable experimental features for this release
 
#ENABLE_EXPERIMENTAL=True
 
# number of seconds to wait when polling instances (default: 30s)
 
#REFRESH_INTERVAL=15
 
# specify a web browser to launch when viewing spot history plots
 
#WEB_BROWSER=chromium
 
# split the config into multiple files
 
#INCLUDE=~/.starcluster/aws, ~/.starcluster/keys, ~/.starcluster/vols
 
  
#############################################
+
'''Reminder, if you write your output to /home/ubuntu/, it will be deleted when you terminate the cluster'''
## AWS Credentials and Connection Settings ##
 
#############################################
 
[aws info]
 
# This is the AWS credentials section (required).
 
# These settings apply to all clusters
 
# replace these with your AWS keys
 
AWS_ACCESS_KEY_ID = #your_aws_access_key_id
 
AWS_SECRET_ACCESS_KEY = #your_secret_access_key
 
# replace this with your account number
 
AWS_USER_ID= #your userid
 
# Uncomment to specify a different Amazon AWS region  (OPTIONAL)
 
# (defaults to us-east-1 if not specified)
 
# NOTE: AMIs have to be migrated!
 
#AWS_REGION_NAME = eu-west-1
 
#AWS_REGION_HOST = ec2.eu-west-1.amazonaws.com
 
# Uncomment these settings when creating an instance-store (S3) AMI (OPTIONAL)
 
#EC2_CERT = /path/to/your/cert-asdf0as9df092039asdfi02089.pem
 
#EC2_PRIVATE_KEY = /path/to/your/pk-asdfasd890f200909.pem
 
# Uncomment these settings to use a proxy host when connecting to AWS
 
#AWS_PROXY = your.proxyhost.com
 
#AWS_PROXY_PORT = 8080
 
#AWS_PROXY_USER = yourproxyuser
 
#AWS_PROXY_PASS = yourproxypass
 
  
###########################
 
## Defining EC2 Keypairs ##
 
###########################
 
# Sections starting with "key" define your keypairs. See "starcluster createkey
 
# --help" for instructions on how to create a new keypair. Section name should
 
# match your key name e.g.:
 
[key mykey]
 
KEY_LOCATION=~/.ssh/mykey.rsa
 
  
# You can of course have multiple keypair sections
+
=== Running GotCloud on StarCluster ===
# [key myotherkey]
+
* If you have not already, logon to the cluster as ubuntu:
# KEY_LOCATION=~/.ssh/myotherkey.rsa
+
** <pre>starcluster sshmaster -u ubuntu mycluster</pre>
 +
*** Type <code>yes</code> if the terminal asks if you want to continue connecting
 +
* When running GotCloud:
 +
** Set the cluster/batch type in either configuration or on the command line:
 +
*** In Configuration:
 +
***: <pre>BATCH_TYPE = sgei</pre>
 +
*** On the command-line:
 +
***: <pre>--batchtype sgei</pre>
 +
** Set the number of jobs to run:
 +
**: <pre>--numjobs #</pre>
 +
*** Replace number with the number of concurrent jobs you want to run (probably <code>CLUSTER_SIZE</code> * CPUs in <code>NODE_INSTANCE_TYPE</code>)
 +
** Otherwise, run GotCloud as you normally would.
  
################################
 
## Defining Cluster Templates ##
 
################################
 
# Sections starting with "cluster" represent a cluster template. These
 
# "templates" are a collection of settings that define a single cluster
 
# configuration and are used when creating and configuring a cluster. You can
 
# change which template to use when creating your cluster using the -c option
 
# to the start command:
 
#
 
#    $ starcluster start -c mediumcluster mycluster
 
#
 
# If a template is not specified then the template defined by DEFAULT_TEMPLATE
 
# in the [global] section above is used. Below is the "default" template named
 
# "smallcluster". You can rename it but dont forget to update the
 
# DEFAULT_TEMPLATE setting in the [global] section above. See the next section
 
# on defining multiple templates.
 
  
[cluster smallcluster]
+
To login to a specific non-master node, do:
# change this to the name of one of the keypair sections defined above
+
starcluster sshnode -u ubuntu mycluster node001
KEYNAME = mykey
 
# number of ec2 instances to launch
 
CLUSTER_SIZE = 2
 
# create the following user on the cluster
 
CLUSTER_USER = sgeadmin
 
# optionally specify shell (defaults to bash)
 
# (options: tcsh, zsh, csh, bash, ksh)
 
CLUSTER_SHELL = bash
 
# AMI to use for cluster nodes. These AMIs are for the us-east-1 region.
 
# Use the 'listpublic' command to list StarCluster AMIs in other regions
 
# The base i386 StarCluster AMI is ami-899d49e0
 
# The base x86_64 StarCluster AMI is ami-999d49f0
 
# The base HVM StarCluster AMI is ami-4583572c
 
NODE_IMAGE_ID = ami-899d49e0
 
# instance type for all cluster nodes
 
# (options: cg1.4xlarge, c1.xlarge, m1.small, c1.medium, m2.xlarge, t1.micro, cc1.4xlarge, m1.medium, cc2.8xlarge, m1.large, m1.xlarge, m2.4xlarge, m2.2xlarge)
 
NODE_INSTANCE_TYPE = m1.small
 
# Uncomment to disable installing/configuring a queueing system on the
 
# cluster (SGE)
 
#DISABLE_QUEUE=True
 
# Uncomment to specify a different instance type for the master node (OPTIONAL)
 
# (defaults to NODE_INSTANCE_TYPE if not specified)
 
#MASTER_INSTANCE_TYPE = m1.small
 
# Uncomment to specify a separate AMI to use for the master node. (OPTIONAL)
 
# (defaults to NODE_IMAGE_ID if not specified)
 
#MASTER_IMAGE_ID = ami-899d49e0 (OPTIONAL)
 
# availability zone to launch the cluster in (OPTIONAL)
 
# (automatically determined based on volumes (if any) or
 
# selected by Amazon if not specified)
 
#AVAILABILITY_ZONE = us-east-1c
 
# list of volumes to attach to the master node (OPTIONAL)
 
# these volumes, if any, will be NFS shared to the worker nodes
 
# see "Configuring EBS Volumes" below on how to define volume sections
 
#VOLUMES = oceandata, biodata
 
# list of plugins to load after StarCluster's default setup routines (OPTIONAL)
 
# see "Configuring StarCluster Plugins" below on how to define plugin sections
 
#PLUGINS = myplugin, myplugin2
 
# list of permissions (or firewall rules) to apply to the cluster's security
 
# group (OPTIONAL).
 
#PERMISSIONS = ssh, http
 
# Uncomment to always create a spot cluster when creating a new cluster from
 
# this template. The following example will place a $0.50 bid for each spot
 
# request.
 
#SPOT_BID = 0.50
 
  
###########################################
+
=== Monitoring Cluster Usage ===
## Defining Additional Cluster Templates ##
+
* Monitor jobs in the queue
###########################################
+
** <pre>qstat</pre>
# You can also define multiple cluster templates. You can either supply all
+
** This will show you how the currently running jobs and how they are spread across the nodes in your cluster
# configuration options as with smallcluster above, or create an
+
**:[[File:Qstat.png|800px]]
# EXTENDS=<cluster_name> variable in the new cluster section to use all
+
*** state descriptions:
# settings from <cluster_name> as defaults. Below are example templates that
+
**** <code>qw</code> : queued and waiting (not yet assigned to a node)
# use the EXTENDS feature:
+
**** <code>r</code> : running
 +
* View Sun Grid Engine Load
 +
** <pre>qhost</pre>
 +
**:[[File:Qhost.png|600px]]
 +
*** ARCH : architecture
 +
*** NCPU : number of CPUs
 +
*** LOAD : current load
 +
*** MEMTOT : total memory
 +
*** MEMUSE : memory in use
 +
*** SWAPTO : swap space
 +
*** SWAPUS : swap space in use
 +
* View the average load per node using:
 +
** <pre>qstat -f</pre>
 +
**:[[File:Qstatf.png|650px]]
 +
*** <code>load_avg</code> field contains the load average for each node
  
# [cluster mediumcluster]
 
# Declares that this cluster uses smallcluster as defaults
 
# EXTENDS=smallcluster
 
# This section is the same as smallcluster except for the following settings:
 
# KEYNAME=myotherkey
 
# NODE_INSTANCE_TYPE = c1.xlarge
 
# CLUSTER_SIZE=8
 
# VOLUMES = biodata2
 
  
# [cluster largecluster]
+
=== Terminate the Cluster ===
# Declares that this cluster uses mediumcluster as defaults
+
# Reminder, check if you need to copy any data off of the cluster that will be deleted upon termination
# EXTENDS=mediumcluster
+
#* [[#Copying Data to/from the Cluster|Copying Data to/from the Cluster]]
# This section is the same as mediumcluster except for the following variables:
+
# Terminate the cluster
# CLUSTER_SIZE=16
+
#* <pre>starcluster terminate mycluster</pre>
  
#############################
 
## Configuring EBS Volumes ##
 
#############################
 
# StarCluster can attach one or more EBS volumes to the master and then
 
# NFS_share these volumes to all of the worker nodes. A new [volume] section
 
# must be created for each EBS volume you wish to use with StarCluser. The
 
# section name is a tag for your volume. This tag is used in the VOLUMES
 
# setting of a cluster template to declare that an EBS volume is to be mounted
 
# and nfs shared on the cluster. (see the commented VOLUMES setting in the
 
# example 'smallcluster' template above) Below are some examples of defining
 
# and configuring EBS volumes to be used with StarCluster:
 
  
# Sections starting with "volume" define your EBS volumes
+
== Run GotCloud Demo Using StarCluster ==
# [volume biodata]
 
# attach vol-c9999999 to /home on master node and NFS-shre to worker nodes
 
# VOLUME_ID = vol-c999999
 
# MOUNT_PATH = /home
 
  
# Same volume as above, but mounts to different location
+
#Create a new cluster section in your configuration file: <code>~/.starcluster/config</code>
# [volume biodata2]
+
#* Add the following to the end of the configuration file:
# VOLUME_ID = vol-c999999
+
#*: <pre>[cluster gccluster]&#10;KEYNAME = mykey&#10;CLUSTER_SIZE = 4&#10;CLUSTER_USER = sgeadmin&#10;CLUSTER_SHELL = bash&#10;MASTER_IMAGE_ID = ami-6ae65e02&#10;NODE_IMAGE_ID = ami-3393a45a&#10;NODE_INSTANCE_TYPE = m3.large</pre>
# MOUNT_PATH = /opt/
+
# Start the cluster:
 +
#* <pre>starcluster start -c gccluster mycluster</pre>
 +
#** Alternatively, if you can the default template at the start of the configuration file in the <code>[global]</code> section to gccluster: <code>DEFAULT_TEMPLATE=gccluster</code>, you can run:
 +
#*** <pre>starcluster start mycluster</pre>
 +
#* It will take a few minutes for the cluster to start
 +
# Logon to the cluster as ubuntu:
 +
#* <pre>starcluster sshmaster -u ubuntu mycluster</pre>
 +
#** Type <code>yes</code> if the terminal asks if you want to continue connecting
  
# Another volume example
+
{{GotCloud: Amazon Demo Setup|hdr=====}}
# [volume oceandata]
 
# VOLUME_ID = vol-d7777777
 
# MOUNT_PATH = /mydata
 
  
# By default StarCluster will attempt first to mount the entire volume device,
+
==== Run GotCloud SnpCall Demo ====
# failing that it will try the first partition. If you have more than one
+
# Run GotCloud snpcall
# partition you will need to set the PARTITION number, e.g.:
+
#* <pre>gotcloud snpcall --conf example/test.conf --outdir output --numjobs 8 --batchtype sgei</pre>
# [volume oceandata]
+
#** The ubuntu user is setup to have the gotcloud program and tools in its path, so you can just type the program name and it will be found
# VOLUME_ID = vol-d7777777
+
#** There is enough space in /home/ubuntu to put the Demo output
# MOUNT_PATH = /mydata
+
#*** /home/ubuntu is visible from all nodes in the cluster
# PARTITION = 2
+
#** This will take a few minutes to run.
 +
#** GotCloud first generates a makefile, and then runs the makefile
 +
#** After a while GotCloud snpcall will print some messages to the screen. This is expected and ok.
 +
# See [[#Monitoring Cluster Usage|Monitoring Cluster Usage]] if you are interested in monitoring the cluster usage as GotCloud runs
 +
# When complete, GotCloud snpcall will indicate success/failure
 +
#* Look at the snpcall results, see: [[GotCloud:_Amazon_Demo#Examining_SnpCall_Output|GotCloud: Amazon Demo -> Examining SnpCall Output]]
  
############################################
+
==== Run GotCloud Indel Demo ====
## Configuring Security Group Permissions ##
+
# Run GotCloud indel
############################################
+
#* <pre>gotcloud indel --conf example/test.conf --outdir output --numjobs 8 --batchtype sgei</pre>
# Sections starting with "permission" define security group rules to
+
#** The ubuntu user is setup to have the gotcloud program and tools in its path, so you can just type the program name and it will be found
# automatically apply to newly created clusters. PROTOCOL in the following
+
#** There is enough space in /home/ubuntu to put the Demo output
# examples can be can be: tcp, udp, or icmp. CIDR_IP defaults to 0.0.0.0/0 or
+
#*** /home/ubuntu is visible from all nodes in the cluster
# "open to the # world"
+
#** This will take a few minutes to run.
 +
# See [[#Monitoring Cluster Usage|Monitoring Cluster Usage]] if you are interested in monitoring the cluster usage as GotCloud runs
 +
# When complete, GotCloud indel will indicate success/failure
 +
#* Look at the indel results, see: [[GotCloud:_Amazon_Demo#Examining_Indel_Output|GotCloud: Amazon Demo -> Examining Indel Output]]
  
# open port 80 on the cluster to the world
+
==== Terminate the Demo Cluster ====
# [permission http]
+
# Exit out of your master node
# PROTOCOL = tcp
+
#* <pre>exit</pre>
# FROM_PORT = 80
+
# Terminate the cluster
# TO_PORT = 80
+
#* Since this is just a demo, we don't have to worry about the data getting deleted upon termination
 +
#* <pre>starcluster terminate mycluster</pre>
 +
#** Answer <code>y</code> to the questions <code>Terminate EBS cluster mycluster (y/n)? </code>
  
# open https on the cluster to the world
+
== Old Instructions==
# [permission https]
+
'''StarCluster Configuration Example'''
# PROTOCOL = tcp
 
# FROM_PORT = 443
 
# TO_PORT = 443
 
  
# open port 80 on the cluster to an ip range using CIDR_IP
+
StarCluster creates a model configuration file in ~/.starcluster/config and you are instructed
# [permission http]
+
to edit this and set the correct values for the variables.
# PROTOCOL = tcp
+
Here is a highly simplified example of a config file that should work.
# FROM_PORT = 80
+
Please note there are many things you might want to choose, so craft the config file with care.
# TO_PORT = 80
+
You'll need to specify nodes with 4GB of memory (type m1.medium) and make sure each node has access to the input and output data for the step being run.
# CIDR_IP = 18.0.0.0/8
 
  
# restrict ssh access to a single ip address (<your_ip>)
+
<code>
# [permission ssh]
+
####################################
# PROTOCOL = tcp
+
## StarCluster Configuration File ##
# FROM_PORT = 22
+
####################################
# TO_PORT = 22
+
[global]
# CIDR_IP = <your_ip>/32
+
DEFAULT_TEMPLATE=myexample
 +
 +
#############################################
 +
## AWS Credentials Settings
 +
#############################################
 +
[aws info]
 +
AWS_ACCESS_KEY_ID = AKImyexample8FHJJF2Q
 +
AWS_SECRET_ACCESS_KEY = fthis_was_my_example_secretMqkMIkJjFCIGf
 +
AWS_USER_ID=199998888709
 +
 +
AWS_REGION_NAME = us-east-1                # Choose your own region
 +
AWS_REGION_HOST = ec2.us-east-1.amazonaws.com
 +
AWS_S3_HOST = s3-us-east-1.amazonaws.com
 +
 +
###########################
 +
## EC2 Keypairs
 +
###########################
 +
[key <font color='green'>east1_starcluster</font>]
 +
KEY_LOCATION = ~/.ssh/AWS/east1_starcluster_key.rsa  # Same region
 +
 +
###########################################
 +
## Define Cluster
 +
##  starcluster start -c east1_starcluster  nameichose4cluster
 +
###########################################
 +
[cluster <font color='red'>myexample</font>]         # Name of this cluster definition
 +
KEYNAME = <font color='green'>east1_starcluster</font>                # Name of keys I need
 +
CLUSTER_SIZE = 4                            # Number of nodes
 +
CLUSTER_SHELL = bash
 +
 +
# Choose the base AMI using  starcluster listpublic
 +
#  (http://star.mit.edu/cluster/docs/0.93.3/faq.html)
 +
NODE_IMAGE_ID = ami-765b3e1f
 +
AVAILABILITY_ZONE = us-east-1              # Region again!
 +
NODE_INSTANCE_TYPE = m1.medium              # 4G memory is the minimum for GotCloud
 +
 +
VOLUMES = <font color='orange'>gotcloud</font>, <font color='blue'>mydata</font>
 +
[volume <font color='blue'>mydata</font>]
 +
VOLUME_ID = vol-6e729657
 +
MOUNT_PATH = /mydata
 +
 +
[volume <font color='orange'>gotcloud</font>]
 +
VOLUME_ID = vol-56071570
 +
MOUNT_PATH = /gotcloud
 +
</code>
  
  
#####################################
+
'''Create Your Cluster'''
## Configuring StarCluster Plugins ##
 
#####################################
 
# Sections starting with "plugin" define a custom python class which perform
 
# additional configurations to StarCluster's default routines. These plugins
 
# can be assigned to a cluster template to customize the setup procedure when
 
# starting a cluster from this template (see the commented PLUGINS setting in
 
# the 'smallcluster' template above). Below is an example of defining a user
 
# plugin called 'myplugin':
 
  
# [plugin myplugin]
+
<code>
# NOTE: myplugin module must either live in ~/.starcluster/plugins or be
+
  '''starcluster start -c <font color='red'>myexample</font> myseq-example'''
# on your PYTHONPATH
+
  StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.3)
# SETUP_CLASS = myplugin.SetupClass
+
  Software Tools for Academics and Researchers (STAR)
# extra settings are passed as __init__ arguments to your plugin:
+
  Please submit bug reports to starcluster@mit.edu
# SOME_PARAM_FOR_MY_PLUGIN = 1
+
# SOME_OTHER_PARAM = 2
+
  >>> Validating cluster template settings...
 +
  >>> Cluster template settings are valid
 +
  >>> Starting cluster...
 +
      [lines deleted]
 +
  >>> Mounting EBS volume vol-32273514 on /gotcloud...
 +
  >>> Mounting EBS volume vol-36788522 on /mydata...
 +
      [lines deleted]
 +
</code>
  
######################
+
When this completes, you are ready to run the GotCloud software on your data.
## Built-in Plugins ##
+
Make sure you have defined and mounted volumes for your sequence data and the
######################
+
output steps of the aligner and umake.
# The following plugins ship with StarCluster and should work out-of-the-box.
+
These volumes (as well as /gotcloud) should be available on each node.
# Uncomment as needed. Don't forget to update your PLUGINS list!
 
# See http://web.mit.edu/star/cluster/docs/latest/plugins for plugin details.
 
#
 
# Use this plugin to install one or more packages on all nodes
 
# [plugin pkginstaller]
 
# SETUP_CLASS = starcluster.plugins.pkginstaller.PackageInstaller
 
# # list of apt-get installable packages
 
# PACKAGES = mongodb, python-pymongo
 
#
 
# Use this plugin to create one or more cluster users and download all user ssh
 
# keys to $HOME/.starcluster/user_keys/<cluster>-<region>.tar.gz
 
# [plugin createusers]
 
# SETUP_CLASS = starcluster.plugins.users.CreateUsers
 
# NUM_USERS = 30
 
# # you can also comment out NUM_USERS and specify exact usernames, e.g.
 
# # usernames = linus, tux, larry
 
# DOWNLOAD_KEYS = True
 
#
 
# Use this plugin to configure the Condor queueing system
 
# [plugin condor]
 
# SETUP_CLASS = starcluster.plugins.condor.CondorPlugin
 
#
 
# The SGE plugin is enabled by default and not strictly required. Only use this
 
# if you want to tweak advanced settings in which case you should also set
 
# DISABLE_QUEUE=TRUE in your cluster template. See the plugin doc for more
 
# details.
 
# [plugin sge]
 
# SETUP_CLASS = starcluster.plugins.sge.SGEPlugin
 
# MASTER_IS_EXEC_HOST = False
 
#
 
# The IPCluster plugin configures a parallel IPython cluster with optional
 
# web notebook support. This allows you to run Python code in parallel with low
 
# latency message passing via ZeroMQ.
 
# [plugin ipcluster]
 
# SETUP_CLASS = starcluster.plugins.ipcluster.IPCluster
 
# ENABLE_NOTEBOOK = True
 
# #set a password for the notebook for increased security
 
# NOTEBOOK_PASSWD = a-secret-password
 
#
 
# Use this plugin to create a cluster SSH "dashboard" using tmux. The plugin
 
# creates a tmux session on the master node that automatically connects to all
 
# the worker nodes over SSH. Attaching to the session shows a separate window
 
# for each node and each window is logged into the node via SSH.
 
# [plugin tmux]
 
# SETUP_CLASS = starcluster.plugins.tmux.TmuxControlCenter
 
#
 
# Use this plugin to change the default MPI implementation on the
 
# cluster from OpenMPI to MPICH2.
 
# [plugin mpich2]
 
# SETUP_CLASS = starcluster.plugins.mpich2.MPICH2Setup
 
#
 
# Configure a hadoop cluster. (includes dumbo setup)
 
# [plugin hadoop]
 
# SETUP_CLASS = starcluster.plugins.hadoop.Hadoop
 
#
 
# Configure a distributed MySQL Cluster
 
# [plugin mysqlcluster]
 
# SETUP_CLASS = starcluster.plugins.mysql.MysqlCluster
 
# NUM_REPLICAS = 2
 
# DATA_MEMORY = 80M
 
# INDEX_MEMORY = 18M
 
# DUMP_FILE = test.sql
 
# DUMP_INTERVAL = 60
 
# DEDICATED_QUERY = True
 
# NUM_DATA_NODES = 2
 
#
 
# Install and setup an Xvfb server on each cluster node
 
# [plugin xvfb]
 
# SETUP_CLASS = starcluster.plugins.xvfb.XvfbSetup
 
  
 +
<code>
 +
  '''starcluster sshmaster myseq-example'''
 +
  StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.3)
 +
  Software Tools for Academics and Researchers (STAR)
 +
    [lines deleted]
 +
 +
  '''df -h'''
 +
  '''ssh node001 df -h'''
 
</code>
 
</code>
 +
 +
If your data is visible on each node, you're ready to run the software as described
 +
in [[GotCloud]].

Latest revision as of 10:49, 7 November 2014

Back to the beginning: GotCloud

Back to GotCloud: Amazon

If you have access to your own cluster, your task will be much simpler. Install the GotCloud software (GotCloud: Source Releases) and run it as described on the same pages.

For those who are not so lucky to have access to a cluster, Amazon Web Services (AWS) provides an alternative. You may run the gotcloud software on a cluster created in AWS. One tool that makes the creation of a cluster of AMIs (Amazon Machine Instances) is StarCluster (see http://star.mit.edu/cluster/).

The following shows an example of how you might use StarCluster to create an AWS cluster and set it up to run GotCloud. There are many details setting up starcluster and this is not intended to explain all of the many variations you might choose, but should provide you a working example.


Getting Started With StarCluster

StarCluster provides lots of documentation (http://star.mit.edu/cluster/) which will provide more information on it than we have here.

To install and setup StarCluster for the first time, you can follow the QuickStart instructions: http://star.mit.edu/cluster/docs/latest/quickstart.html

  • Includes installation instructions
  • Includes setting up a basic StarCluster configuration file
    • You will need your AWS Credentials to setup the configuration file

You can skip actually starting the cluster in the QuickStart instructions if you want.

Troubleshooting: When I tried this, the starcluster start mycluster step failed similar to:

Don't forget to terminate your cluster:

starcluster terminate mycluster


StarCluster and GotCloud

StarCluster Config Settings

By default, StarCluster expects a configuration file in ~/.starcluster/config.

  • StarCluster will create a model file for you

Ensure your StarCluster configuration file is set for your usage.

  • General AWS Settings:
[aws info]
aws_access_key_id = #your aws access key id here
aws_secret_access_key = #your secret aws access key here
aws_user_id = #your 12-digit aws user id here
  • GotCloud Cluster Definition
    • You may want to create a new cluster section for running GotCloud (or you can use smallcluster) in your configuration file: ~/.starcluster/config
    • You can call it anything you want, for example,
      gccluster
    • Example:
      [cluster gccluster]
      KEYNAME = mykey
      CLUSTER_SIZE = 4
      CLUSTER_USER = sgeadmin
      CLUSTER_SHELL = bash
      MASTER_IMAGE_ID = ami-6ae65e02
      NODE_IMAGE_ID = ami-3393a45a
      NODE_INSTANCE_TYPE = m3.large
      • Set KEYNAME to the key you want to use
      • Set CLUSTER_SIZE to the number of nodes you want to start up (this may be different than 4)
      • Set CLUSTER_USER to add additional users, like sgeadmin
      • Set CLUSTER_SHELL to define the shell you want to use, like bash
      • Set MASTER_IMAGE_ID to the latest GotCloud AMI, see: GotCloud: AMIs
        • Contains GotCloud, the reference, and the demo files in the /home/ubuntu/ directory that will be visible on all nodes in the cluster
        • Has a 30G volume, but only 6G available
      • Set NODE_IMAGE_ID to a StarCluster ubuntu x86_64 AMI
        • Since each node does not need its own 30G volume containing GotCloud, the reference, and the demo files, we use a separate image for the nodes.
      • The nodes can just access the master's copy of GotCloud, the reference, and the Amazon demo
      • Set NODE_INSTANCE_TYPE to the type of instances you want to start in your cluster
    • The CLUSTER_SIZE * CPUs in NODE_INSTANCE_TYPE = the number of jobs you can run concurrently in GotCloud
  • Define Data Volumes
    • By default, the GotCloud AMI contains about 5G of extra space that you can use
      • /home/ubuntu/ directory is visible from all machines
        • Use /home/ubuntu/ for the output directory if it is <5G
        • This directory will be deleted when you terminate the AMI
    • Create your Own Volumes and attach them to the GotCloud cluster
      • Instructions TBD

Starting the Cluster

  1. Start the cluster:
    • starcluster start -c gccluster mycluster
      • Alternatively, if you can the default template at the start of the configuration file in the [global] section to gccluster: DEFAULT_TEMPLATE=gccluster, you can run:
        • starcluster start mycluster
    • It will take a few minutes for the cluster to start


Copying Data to/from the Cluster

Copy data onto the cluster (command run from your local machine)

starcluster put /path/to/local/file/or/dir /remote/path/


Pull the data from the cluster onto your local machine (command run from your local machine)

starcluster get /path/to/remote/file/or/dir /local/path/

Reminder, if you write your output to /home/ubuntu/, it will be deleted when you terminate the cluster


Running GotCloud on StarCluster

  • If you have not already, logon to the cluster as ubuntu:
    • starcluster sshmaster -u ubuntu mycluster
      • Type yes if the terminal asks if you want to continue connecting
  • When running GotCloud:
    • Set the cluster/batch type in either configuration or on the command line:
      • In Configuration:
        BATCH_TYPE = sgei
      • On the command-line:
        --batchtype sgei
    • Set the number of jobs to run:
      --numjobs #
      • Replace number with the number of concurrent jobs you want to run (probably CLUSTER_SIZE * CPUs in NODE_INSTANCE_TYPE)
    • Otherwise, run GotCloud as you normally would.


To login to a specific non-master node, do:

starcluster sshnode -u ubuntu mycluster node001

Monitoring Cluster Usage

  • Monitor jobs in the queue
    • qstat
    • This will show you how the currently running jobs and how they are spread across the nodes in your cluster
      Qstat.png
      • state descriptions:
        • qw : queued and waiting (not yet assigned to a node)
        • r : running
  • View Sun Grid Engine Load
    • qhost
      Qhost.png
      • ARCH : architecture
      • NCPU : number of CPUs
      • LOAD : current load
      • MEMTOT : total memory
      • MEMUSE : memory in use
      • SWAPTO : swap space
      • SWAPUS : swap space in use
  • View the average load per node using:
    • qstat -f
      Qstatf.png
      • load_avg field contains the load average for each node


Terminate the Cluster

  1. Reminder, check if you need to copy any data off of the cluster that will be deleted upon termination
  2. Terminate the cluster
    • starcluster terminate mycluster


Run GotCloud Demo Using StarCluster

  1. Create a new cluster section in your configuration file: ~/.starcluster/config
    • Add the following to the end of the configuration file:
      [cluster gccluster]
      KEYNAME = mykey
      CLUSTER_SIZE = 4
      CLUSTER_USER = sgeadmin
      CLUSTER_SHELL = bash
      MASTER_IMAGE_ID = ami-6ae65e02
      NODE_IMAGE_ID = ami-3393a45a
      NODE_INSTANCE_TYPE = m3.large
  2. Start the cluster:
    • starcluster start -c gccluster mycluster
      • Alternatively, if you can the default template at the start of the configuration file in the [global] section to gccluster: DEFAULT_TEMPLATE=gccluster, you can run:
        • starcluster start mycluster
    • It will take a few minutes for the cluster to start
  3. Logon to the cluster as ubuntu:
    • starcluster sshmaster -u ubuntu mycluster
      • Type yes if the terminal asks if you want to continue connecting

Examine the Setup

  1. After logging into the Amazon node as the ubuntu user, you should by default be in the ubuntu home directory: /home/ubuntu
    1. You can check this by doing:
      pwd
      • This should output: /home/ubuntu
    2. Take a look at the contents of the ubuntu user home directory
      ls
      • This should output be 2 directories, example and gotcloud
        • The example directory contains the files for this demo
        • The gotcloud directory contains the gotcloud programs and pre-compiled source
    DemoHome.png
  2. Look at the example input files:
    ls example
    ExampleFiles.png
    1. bam.list contains the list of BAM files per sample
    2. bams is a subdirectory containing the BAM files for this demo
    3. test.bed contains the region we want to process in this demo
      • To make the demo run faster, we only want to process a small region of chromosome 22. This file tells GotCloud the region. The region we are using is the APOL1 region
      BedContents.png
    4. test.conf contains the settings we want GotCloud to use for this run
      ConfContents.png
      • For the demo, we want to tell GotCloud:
        1. The list of bams to use: BAM_LIST = example/bam.list
        2. The region to process rather than the whole genome: UNIFORM_TARGET_BED = example/test.bed
        3. The chromosomes to process. The default chromosomes are 1-22 & X, but we only want to process chromosome 22: CHRS = 22

Run GotCloud SnpCall Demo

  1. Run GotCloud snpcall
    • gotcloud snpcall --conf example/test.conf --outdir output --numjobs 8 --batchtype sgei
      • The ubuntu user is setup to have the gotcloud program and tools in its path, so you can just type the program name and it will be found
      • There is enough space in /home/ubuntu to put the Demo output
        • /home/ubuntu is visible from all nodes in the cluster
      • This will take a few minutes to run.
      • GotCloud first generates a makefile, and then runs the makefile
      • After a while GotCloud snpcall will print some messages to the screen. This is expected and ok.
  2. See Monitoring Cluster Usage if you are interested in monitoring the cluster usage as GotCloud runs
  3. When complete, GotCloud snpcall will indicate success/failure

Run GotCloud Indel Demo

  1. Run GotCloud indel
    • gotcloud indel --conf example/test.conf --outdir output --numjobs 8 --batchtype sgei
      • The ubuntu user is setup to have the gotcloud program and tools in its path, so you can just type the program name and it will be found
      • There is enough space in /home/ubuntu to put the Demo output
        • /home/ubuntu is visible from all nodes in the cluster
      • This will take a few minutes to run.
  2. See Monitoring Cluster Usage if you are interested in monitoring the cluster usage as GotCloud runs
  3. When complete, GotCloud indel will indicate success/failure

Terminate the Demo Cluster

  1. Exit out of your master node
    • exit
  2. Terminate the cluster
    • Since this is just a demo, we don't have to worry about the data getting deleted upon termination
    • starcluster terminate mycluster
      • Answer y to the questions Terminate EBS cluster mycluster (y/n)?

Old Instructions

StarCluster Configuration Example

StarCluster creates a model configuration file in ~/.starcluster/config and you are instructed to edit this and set the correct values for the variables. Here is a highly simplified example of a config file that should work. Please note there are many things you might want to choose, so craft the config file with care. You'll need to specify nodes with 4GB of memory (type m1.medium) and make sure each node has access to the input and output data for the step being run.

####################################
## StarCluster Configuration File ##
####################################
[global]
DEFAULT_TEMPLATE=myexample

#############################################
## AWS Credentials Settings
#############################################
[aws info]
AWS_ACCESS_KEY_ID = AKImyexample8FHJJF2Q
AWS_SECRET_ACCESS_KEY = fthis_was_my_example_secretMqkMIkJjFCIGf
AWS_USER_ID=199998888709 

AWS_REGION_NAME = us-east-1                 # Choose your own region
AWS_REGION_HOST = ec2.us-east-1.amazonaws.com
AWS_S3_HOST = s3-us-east-1.amazonaws.com

###########################
## EC2 Keypairs
###########################
[key east1_starcluster]
KEY_LOCATION = ~/.ssh/AWS/east1_starcluster_key.rsa   # Same region

###########################################
## Define Cluster
##   starcluster start -c east1_starcluster  nameichose4cluster
###########################################
[cluster myexample]          # Name of this cluster definition
KEYNAME = east1_starcluster                 # Name of keys I need
CLUSTER_SIZE = 4                            # Number of nodes
CLUSTER_SHELL = bash

#  Choose the base AMI using   starcluster listpublic
#   (http://star.mit.edu/cluster/docs/0.93.3/faq.html)
NODE_IMAGE_ID = ami-765b3e1f
AVAILABILITY_ZONE = us-east-1               # Region again!
NODE_INSTANCE_TYPE = m1.medium              # 4G memory is the minimum for GotCloud

VOLUMES = gotcloud, mydata
[volume mydata]
VOLUME_ID = vol-6e729657
MOUNT_PATH = /mydata

[volume gotcloud]
VOLUME_ID = vol-56071570
MOUNT_PATH = /gotcloud


Create Your Cluster

 starcluster start -c myexample myseq-example
 StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.3)
 Software Tools for Academics and Researchers (STAR)
 Please submit bug reports to starcluster@mit.edu

 >>> Validating cluster template settings...
 >>> Cluster template settings are valid
 >>> Starting cluster...
     [lines deleted]
 >>> Mounting EBS volume vol-32273514 on /gotcloud...
 >>> Mounting EBS volume vol-36788522 on /mydata...
     [lines deleted]

When this completes, you are ready to run the GotCloud software on your data. Make sure you have defined and mounted volumes for your sequence data and the output steps of the aligner and umake. These volumes (as well as /gotcloud) should be available on each node.

 starcluster sshmaster myseq-example
 StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.3)
 Software Tools for Academics and Researchers (STAR)
   [lines deleted]

 df -h
 ssh node001 df -h

If your data is visible on each node, you're ready to run the software as described in GotCloud.