Difference between revisions of "UMAKE"

From Genome Analysis Wiki
Jump to: navigation, search
Line 55: Line 55:
 
# BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls.
 
# BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls.
 
# Each line of Index file represents each individual under the following format. Note that multiple BAMs per individual may be provided.
 
# Each line of Index file represents each individual under the following format. Note that multiple BAMs per individual may be provided.
  [SAMPLE_ID] [SEX : 1-MALE, 2-FEMALE] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
+
  [SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
 +
# Additional input Files including Pedigree files (PED format) (to specify gender information in chrX calling), Target information (UCSC's BED format) in targeted or whole exome capture sequencing may be provided.
 
# Configuration file contains core information of run-time options including the software binaries and command line arguments. Refer to the example configuration file for further information
 
# Configuration file contains core information of run-time options including the software binaries and command line arguments. Refer to the example configuration file for further information
  

Revision as of 22:14, 4 July 2011


UMAKE is a software pipeline to detect SNPs and call their genotypes from a list of BAM files. UMAKE pipeline has been successfully applied in detecting SNPs from many large-scale next-generation sequencing studies.

Download UMAKE

To get a copy go to the UMAKE Download download page.

Build UMAKE

To build UMAKE, download the UMAKE package from the link above and run the following series of commands.

tar xzvf umake.r100.20110705.tar.gz
cd umake
make

UMAKE is designed to be portable. However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, and later, there are likely other portability issues.

Currently we support UMAKE only on Ubuntu 9.10 and later on 64-bit processors. perl (5.0 or higher) must be installed with IO::File, IO::Zlib, and Getopt::Long packages.

Note that UMAKE requires external software packages to be copied to $(UMAKE_HOME)/ext> directory

Basic Usage Example

Here is a typical command line:

perl $(UMAKE_HOME)/scripts/umake.pl --conf [conf.file]

Example configuration file can be found at examples/umake-example.conf. Users have to modify the configuration files to

The full pipeline of UMAKE has to be be partitioned into three parts, (1) SNP detection (2) LD-aware genotype refinement using beagle (3) MaCH/Thunder genotype refinement on top of beagle haplotypes. These three steps can be run with the same configuration file using the following options

perl $(UMAKE_HOME)/scripts/umake.pl --conf [conf.file] --snpcall
perl $(UMAKE_HOME)/scripts/umake.pl --conf [conf.file] --beagle
perl $(UMAKE_HOME)/scripts/umake.pl --conf [conf.file] --thunder

== Exercise with Example Resouces Example input files can be downloaded at UMAKE Download. These example resource files includs sequence alignment files over 60 individuals from the 1000 Genomes project, focusing on 300kb region in chromosome 20. Note that the reference genome FASTA file has also been modified to use chromosome 20 only.

Let UMAKE_HOME be the path to the UMAKE package and EXAMPLE_HOME be the path to the example resource files.

  1. First, modify UMAKE_ROOT, INPUT_ROOT, OUTPUT_ROOT parameters accordingly.
  2. Second, perform SNP calling procedure using the following command
perl $(UMAKE_HOME)/scripts/umake.pl --snpcall
  1. Third, run BEAGLE genotype refinement using the
perl $(UMAKE_HOME)/scripts/umake.pl --beagle
  1. Finally, run BEAGLE/THUNDER genotype refinement using the
perl $(UMAKE_HOME)/scripts/umake.pl --thunder

Preparing Your Own Input Files

UMAKE requires three types of input files (1) a set of BAM files (2) index file (3) configuration file

  1. BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls.
  2. Each line of Index file represents each individual under the following format. Note that multiple BAMs per individual may be provided.
[SAMPLE_ID]	[COMMA SEPARATED POPULATION LABELS]	[BAM_FILE1]	[BAM_FILE2]	...
  1. Additional input Files including Pedigree files (PED format) (to specify gender information in chrX calling), Target information (UCSC's BED format) in targeted or whole exome capture sequencing may be provided.
  2. Configuration file contains core information of run-time options including the software binaries and command line arguments. Refer to the example configuration file for further information

Software Components

UMAKE pipeline consists of the following software components (details TBA)

  • samtools-hybrid
  • glfMerge
  • glfMultiples
  • vcfPileup
  • infoCollector
  • vcfCooker
  • thunderVCF

Acknowledgements

UMAKE is a result from collaborative effort by Hyun Min Kang, Goo Jun, Carlo Sidore, Paul Anderson, Mary Kate Trost, Wei Chen, Tom Blackwell, and Goncalo Abecasis. Please email to Hyun Min Kang [hmkang@umich.edu ] for any questions.