Mount S3 Volume

From Genome Analysis Wiki
Jump to: navigation, search

Back to the beginning: GotCloud

We are still working to figure out how to get S3 working with the 1000G data.

For now, we found it more reliable to copy the data into a volume and run on that.


Ref: http://thecrystalclouds.wordpress.com/2012/05/18/installation-and-setup-of-s3fs-on-amazon-web-services/ and http://code.google.com/p/s3fs/wiki/FuseOverAmazon

It seemed like such a good idea. Rather than copy the data from S3 storage, why not mount the S3 volume and run the Pipeline on it directly. After all, in theory, the aligner only reads the FASTQ files once and then creates it's output to local disk, which is then input to UMAKE.

Here's what happened:

 apt-get update
 apt-get upgrade
 apt-get install build-essential libcurl4-openssl-dev libxml2-dev libfuse-dev comerr-dev libfuse2 libidn11-dev  libkrb5-dev libldap2-dev libselinux1-dev libsepol1-dev pkg-config fuse-utils sshfs

 wget https://s3fs.googlecode.com/files/s3fs-r203.tar.gz
 tar xzvf s3fs-r203.tar.gz
 cd s3f3
 #  The Makefile does not work, Fix is put the s3fe.cpp file right after g++
 g++ s3fs.cpp -ggdb -Wall $(shell pkg-config ...

 make                        # Should create an s3f3 executable
 make install

 #  Configuration change for fuse
 vi /etc/fuse.conf
 user_allow_other            # Uncomment this line
 
 mkdir /mnt/s3
  s3fs 1000genomes -o accessKeyId=AKourkey2Q -o secretAccessKey=ftoutsecretaccesskeyIGf -o use_cache=/tmp -o allow_other /mnt/s3

 # Did it work?
 cd /mnt
 ls /mnt/s3/data/HG01550     # Wow, look at all those files
 rsync -av s3/data/HG01550 .
   sending incremental file list
   HG01550/
   HG01550/alignment/
   HG01550/alignment/HG01550.chrom11.ILLUMINA.bwa.CLM.low_coverage.20111114.bam
   HG01550/alignment/HG01550.chrom11.ILLUMINA.bwa.CLM.low_coverage.20111114.bam.bai
   HG01550/alignment/HG01550.chrom11.ILLUMINA.bwa.CLM.low_coverage.20111114.bam.bas
   HG01550/alignment/HG01550.chrom20.ILLUMINA.bwa.CLM.low_coverage.20111114.bam
   HG01550/alignment/HG01550.chrom20.ILLUMINA.bwa.CLM.low_coverage.20111114.bam.bai
   HG01550/alignment/HG01550.chrom20.ILLUMINA.bwa.CLM.low_coverage.20111114.bam.bas
   HG01550/alignment/HG01550.mapped.ILLUMINA.bwa.CLM.low_coverage.20111114.bam
   rsync: read errors mapping "/mnt/s3/data/HG01550/alignment
     /HG01550.mapped.ILLUMINA.bwa.CLM.low_coverage.20111114.bam": Bad file descriptor (9)
   HG01550/alignment/HG01550.mapped.ILLUMINA.bwa.CLM.low_coverage.20111114.bam.bai
   rsync: read errors mapping "/mnt/s3/data/HG01550/alignment
     /HG01550.mapped.ILLUMINA.bwa.CLM.low_coverage.20111114.bam.bai": Bad file descriptor (9)
   [deleted lines]

After all that, waiting 80 minutes for the rsync to work, we got nearly no files.

It does not appear you can actually mount an S3 bucket and make it work with data on this size.