Changes

From Genome Analysis Wiki
Jump to navigationJump to search
11,448 bytes added ,  11:53, 2 February 2017
Line 1: Line 1:  
[[Category:Software|UMAKE]]
 
[[Category:Software|UMAKE]]
 +
 +
<!--        BANNER ACROSS TOP OF PAGE        -->
 +
{| style="width:100%; background:#ffb6c1; margin-top:1.2em; border:1px solid #ccc;" |
 +
| style="width:100%; text-align:center; color:#000;" |
 +
<div style="font-size:162%; border:none; margin:0; padding:.1em; color:#000;">'''NOTE: The UMAKE pipeline is now included in [[GotCloud]] package and is no longer maintained as a separate download.  Please see [[GotCloud]] instead.  The documentation on this page is outdated. Please visit [[GotCloud]] page for more details on the latest version'''</div>
 +
|}
 +
    
'''UMAKE''' is a software pipeline to detect SNPs and call their genotypes from a list of BAM files. '''UMAKE''' pipeline has been successfully applied in detecting SNPs from many large-scale next-generation sequencing studies.  
 
'''UMAKE''' is a software pipeline to detect SNPs and call their genotypes from a list of BAM files. '''UMAKE''' pipeline has been successfully applied in detecting SNPs from many large-scale next-generation sequencing studies.  
 +
 +
''Updated version of UMAKE including SVM filtering will be available very soon''
    
== Download UMAKE  ==
 
== Download UMAKE  ==
   −
To get a copy go to the [http://www.sph.umich.edu/csg/kang/umake/ UMAKE Download] download page.
+
To get a copy go to the [http://csg.sph.umich.edu//kang/umake/download UMAKE Download] download page.
    
== Build UMAKE  ==
 
== Build UMAKE  ==
    
To build UMAKE, download the UMAKE package from the link above and run the following series of commands.
 
To build UMAKE, download the UMAKE package from the link above and run the following series of commands.
  tar xzvf umake.v1.20110705.tar.gz
+
  tar xzvf umake.v1.0.1.20110706.tar.gz
 
  cd umake
 
  cd umake
 
  make
 
  make
Line 18: Line 27:  
Currently we support UMAKE only on Ubuntu 9.10 and later on 64-bit processors. perl (5.0 or higher) must be installed with IO::File, IO::Zlib, and Getopt::Long packages.
 
Currently we support UMAKE only on Ubuntu 9.10 and later on 64-bit processors. perl (5.0 or higher) must be installed with IO::File, IO::Zlib, and Getopt::Long packages.
   −
Note that UMAKE requires external software packages to be copied to <code>$(UMAKE_HOME)/ext/</code> directory  
+
Note that UMAKE requires external software packages to be copied to <code>UMAKE_HOME/ext/</code> directory  
* <code>bgzip</code> and <code>tabix</code> - To download, go to [http://sourceforge.net/projects/samtools/files/tabix/ TABIX Download]
+
* Create an "ext" folder under UMAKE_HOME (the path to the UMAKE package)
* <code>beagle</code> - To download, go to [http://faculty.washington.edu/browning/beagle/beagle.html#download BEAGLE Download]
+
 
 +
* <code>bgzip</code> and <code>tabix</code> - To download, go to [http://sourceforge.net/projects/samtools/files/tabix/ TABIX Download] (after compiling the source code, copy bgzip and tabix to the "ext" folder above)
 +
 
 +
* <code>beagle</code> - To download, go to [http://faculty.washington.edu/browning/beagle/beagle.html#download BEAGLE Download] (rename "beagle.jar" to "beagle.20101226.jar" and copy it to the "ext" folder)
 +
 
 +
* Copy the executables <code>bgzip</code> and <code>tabix</code> to /usr/cluster/bin/ OR you could replace "/usr/cluster/bin" with the complete path of the above "ext" folder at line 652 and 654 of umake.pl under UMAKE_HOME/scripts/
    
== Basic Usage Example ==
 
== Basic Usage Example ==
Line 38: Line 52:     
== Exercise with Example Resources ==
 
== Exercise with Example Resources ==
Example input files can be downloaded at [http://www.sph.umich.edu/csg/kang/umake/ UMAKE Download]. These example resource files includs sequence alignment files over 60 individuals from the 1000 Genomes project, focusing on 300kb region in chromosome 20. Note that the reference genome FASTA file has also been modified to use chromosome 20 only.
+
Example input files can be downloaded at [http://csg.sph.umich.edu//kang/umake/download UMAKE Download]. These example resource files includs sequence alignment files over 60 individuals from the 1000 Genomes project, focusing on 300kb region in chromosome 20. Note that the reference genome FASTA file has also been modified to use chromosome 20 only.
    
Let <code>UMAKE_HOME</code> be the path to the UMAKE package and <code>EXAMPLE_HOME</code> be the path to the example resource files.  
 
Let <code>UMAKE_HOME</code> be the path to the UMAKE package and <code>EXAMPLE_HOME</code> be the path to the example resource files.  
   −
* First, modify <code>UMAKE_ROOT, INPUT_ROOT, OUTPUT_ROOT</code> parameters accordingly.
+
* First, modify <code>UMAKE_ROOT, INPUT_ROOT, OUTPUT_ROOT</code> parameters accordingly.
* Second, perform SNP calling procedure using the following command
+
* You need to run all following commands under the <code>EXAMPLE_HOME</code> folder. After each perl script is done, run the two make commands printed from the perl script.
 +
* Second, perform SNP calling procedure using the following command  
 
  perl $(UMAKE_HOME)/scripts/umake.pl --snpcall
 
  perl $(UMAKE_HOME)/scripts/umake.pl --snpcall
 
* Third, run BEAGLE genotype refinement using the  
 
* Third, run BEAGLE genotype refinement using the  
Line 49: Line 64:  
* Finally, run BEAGLE/THUNDER genotype refinement using the  
 
* Finally, run BEAGLE/THUNDER genotype refinement using the  
 
  perl $(UMAKE_HOME)/scripts/umake.pl --thunder
 
  perl $(UMAKE_HOME)/scripts/umake.pl --thunder
* If SNP site is already defined, instead of --snpcall option, you can extract the genotype likelihood using the following command
+
* If using MOSIX nodes, change default MOSIX_PREFIX as following:
  perl $(UMAKE_HOME)/scripts/umake.pl --extract
+
  MOS_PREFIX = mosrun -E/tmp -t -i -m 2000    # PREFIX FOR MOSIX COMMAND (BLANK IF UNUSED)
+
 
 
== Preparing Your Own Input Files ==
 
== Preparing Your Own Input Files ==
   Line 58: Line 73:  
* BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls.
 
* BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls.
 
* Each line of Index file represents each individual under the following format. Note that multiple BAMs per individual may be provided.
 
* Each line of Index file represents each individual under the following format. Note that multiple BAMs per individual may be provided.
  [SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
+
  [SAMPLE_ID]   [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
 
* Additional input Files including Pedigree files (PED format) (to specify gender information in chrX calling), Target information (UCSC's BED format) in targeted or whole exome capture sequencing may be provided.
 
* Additional input Files including Pedigree files (PED format) (to specify gender information in chrX calling), Target information (UCSC's BED format) in targeted or whole exome capture sequencing may be provided.
 
* Configuration file contains core information of run-time options including the software binaries and command line arguments. Refer to the example configuration file for further information
 
* Configuration file contains core information of run-time options including the software binaries and command line arguments. Refer to the example configuration file for further information
 +
 +
== Configuration File ==
 +
 +
The example configuration file below illustrate how to configure the UMAKE configuration file. Here are a few highlights
 +
* Steps to run could be automatically set using --snpcall, --beagle, --thunder, --extract, or manually set by uncommenting options in STEPS_TO_RUN. Note that the steps to run should be in a consecutive order
 +
* To run on full genome data, the resource file should be download at [[ftp://share.sph.umich.edu/1000genomes/umake-resources/ | FTP Download of Full Resource Files]].
 +
* You need to uncomment the target-related configuration lines in order to run in whole genome data
 +
* FILTER_ARGS needs to be carefully calibrated in order to obtain a good set of filtered set of SNPs
 +
 +
##################################################################
 +
# UMAKE CONFIGURATION FILE
 +
# This configuration file contains run-time configuration of
 +
# UMAKE SNP calling pipeline
 +
###############################################################################
 +
## KEY ELEMENTS TO CONFIGURE : NEED TO MODIFY
 +
###############################################################################
 +
#UMAKE_ROOT = FULL_PATH_TO_UMAKE  ## e.g. /home/myid/code/umake
 +
#INPUT_ROOT = FULL_PATH_TO_CURRENT_DIR ## e.g. /home/myid/data/umake-examples
 +
#OUTPUT_ROOT = FULL_PATH_TO_OUTPUT_DIR ## e.g. /home/myid/data/umake-examples/out
 +
BAM_INDEX = $(INPUT_ROOT)/umake-example.index  # SAMPLE INDEX FILE (See documentation for detailed format)
 +
CHRS = 20              # List of chromosomes to call SNPs. For multiple chromosomes, separate by whitespace
 +
OUT_DIR = $(OUTPUT_ROOT)    # output directory
 +
OUT_PREFIX = umake-example  # prefix of output Makefile $(OUT_PREFIX).Makefile will be generated
 +
#PED_INDEX = $(INPUT_ROOT)/umake-example.ped    # SAMPLE PED FILE (required only for chrX calling)
 +
#
 +
###############################################################################
 +
## STEPS TO RUN : COMMENT OUT TO EXCLUDE CERTAIN STEPS
 +
##  --snpcall, --extract, --beagle, --thunder commands automatically set them
 +
###############################################################################
 +
#RUN_INDEX = TRUE        # create BAM index file
 +
#RUN_PILEUP = TRUE      # create GLF file from BAM
 +
#RUN_GLFMULTIPLES = TRUE # create unfiltered SNP calls
 +
#RUN_VCFPILEUP = TRUE    # create PVCF files using vcfPileup and run infoCollector
 +
#RUN_FILTER = TRUE      # filter SNPs using vcfCooker
 +
#RUN_SPLIT = TRUE        # split SNPs into chunks for genotype refinement
 +
#RUN_BEAGLE = TRUE  # BEAGLE - MUST SET AFTER FINISHING PREVIOUS STEPS
 +
#RUN_SUBSET = TRUE  # SUBSET FOR THUNDER - MAY BE SET WITH BEAGLE STEP TOGETHER
 +
#RUN_THUNDER = TRUE # THUNDER - MUST SET AFTER FINISHING PREVIOUS STEPS
 +
#
 +
###############################################################################
 +
## OPTIONS FOR GLFEXTRACT (GLFMULTIPLES, VCFPILEUP, FILTER MUST BE TURNED OFF)
 +
###############################################################################
 +
#RUN_EXTRACT = TRUE  # Instead of discovering SNPs, extract genotype liklihood in the site of VCF_EXTRACT
 +
#VCF_EXTRACT = # whole-genome (gzipped and tabixed) .vcf.gz file to extract the site information to genotype (such as 1000 Genomes site list)
 +
#
 +
###############################################################################
 +
## OPTIONS FOR EXOME/TARGETED SEQUENCING : COMMENT OUT IF WHOLE GENOME SEQUENCING
 +
###############################################################################
 +
WRITE_TARGET_LOCI = TRUE  # FOR TARGETED SEQUENCING ONLY -- Write loci file when performing pileup
 +
UNIFORM_TARGET_BED = $(INPUT_ROOT)/umake-example.bed # Targeted sequencing : When all individuals has the same target. Otherwise, comment it out
 +
OFFSET_OFF_TARGET = 50 # Extend target by given # of bases
 +
MULTIPLE_TARGET_MAP =  # Target per individual : Each line contains [SM_ID] [TARGET_BED]
 +
TARGET_DIR = target    # Directory to store target information
 +
SAMTOOLS_VIEW_TARGET_ONLY = TRUE # When performing samtools view, exclude off-target regions (may make command line too long)
 +
#
 +
###############################################################################
 +
## RESOURCE FILES : Download the full resources for full genome calling
 +
###############################################################################
 +
REF = $(INPUT_ROOT)/data/ref/human_g1k_v37_chr20.fa # Reference FASTA sequence. Note that the FASTA file in the example package is only chr20.
 +
INDEL_PREFIX = $(INPUT_ROOT)/data/indels/1kg.pilot_release.merged.indels.sites.hg19 # 1000 Genomes Pilot 1 indel VCF prefix
 +
DBSNP_PREFIX =  $(INPUT_ROOT)/data/dbsnp/dbsnp_129_b37.rod # dbSNP file prefix
 +
HM3_PREFIX =  $(INPUT_ROOT)/data/HapMap/hapmap3_r3_b37_fwd.consensus.qc.poly # HapMap3 polymorphic site prefix
 +
#
 +
###############################################################################
 +
## BINARIES
 +
###############################################################################
 +
SAMTOOLS_FOR_PILEUP = $(UMAKE_ROOT)/bin/samtools-hybrid # for samtools pileup
 +
SAMTOOLS_FOR_OTHERS = $(UMAKE_ROOT)/bin/samtools-hybrid # for samtools view and calmd
 +
GLFMERGE = $(UMAKE_ROOT)/bin/glfMerge # glfMerge when multiple BAMs exist per indvidual
 +
GLFMULTIPLES = $(UMAKE_ROOT)/bin/glfMultiples --minMapQuality 0 --minDepth 1 --maxDepth 10000000 --uniformTsTv --smartFilter # glfMultiples and options
 +
GLFEXTRACT = $(UMAKE_ROOT)/bin/glfExtract  # glfExtract for obtaining VCF for known sites
 +
VCFPILEUP = $(UMAKE_ROOT)/bin/vcfPileup    # vcfPileup to generate rich per-site information
 +
INFOCOLLECTOR = $(UMAKE_ROOT)/bin/infoCollector # create filtering statistics
 +
VCFMERGE = perl $(UMAKE_ROOT)/scripts/bams2vcfMerge.pl # merge multiple BAMs separated by chunk of genomes
 +
VCFCOOKER = $(UMAKE_ROOT)/bin/vcfCooker  # vcfCooker for filtering
 +
VCFSUMMARY = perl $(UMAKE_ROOT)/scripts/vcfSummary.pl # Get summary statistics of discovered site
 +
VCFSPLIT = perl $(UMAKE_ROOT)/scripts/vcfSplit.pl # split VCF into overlapping chunks for genotype refinement
 +
VCFPASTE = perl $(UMAKE_ROOT)/scripts/vcfPaste.pl # vcfPaste to generate filtered genotype VCF
 +
BEAGLE = java -Xmx4g -jar $(UMAKE_ROOT)/ext/beagle.20101226.jar seed=993478 gprobs=true niterations=50 lowmem=true # BEAGLE BINARY : NEED TO COPY BEAGLE TO $(UMAKE_ROOT)/ext DIRECTORY BEFORE RUNNING PIPELINE
 +
VCF2BEAGLE = perl $(UMAKE_ROOT)/scripts/vcf2Beagle.pl --PL # convert VCF (with PL tag) into beagle input
 +
BEAGLE2VCF = perl $(UMAKE_ROOT)/scripts/beagle2Vcf.pl # convert beagle output to VCF
 +
THUNDER = $(UMAKE_ROOT)/bin/thunderVCF -r 30 --phase --dosage --compact --inputPhased # MaCH/Thunder genotype refinement step
 +
LIGATEVCF = perl $(UMAKE_ROOT)/scripts/ligateVcf.pl # ligate multiple phased VCFs while resolving the phase between VCFs
 +
BGZIP = $(UMAKE_ROOT)/ext/bgzip  # NEED TO COPY BGZIP TO $(UMAKE_ROOT)/ext DIRECTORY BEFORE RUNNING PIPELINE
 +
TABIX = $(UMAKE_ROOT)/ext/tabix  # NEED TO COPY TABIX TO $(UMAKE_ROOT)/ext DIRECTORY BEFORE RUNNING PIPELINE
 +
#
 +
###############################################################################
 +
## ARGUMENT FOR FILTERING
 +
###############################################################################
 +
SAMTOOLS_VIEW_FILTER = -q 20 -F 0x0704 # samtools view filter (-q by MQ, -F by flag)
 +
FILTER_MAX_SAMPLE_DP = 20  # Max Depth per Sample (20x default) -- will generate FILTER_MAX_TOTAL_DP automatically
 +
FILTER_MIN_SAMPLE_DP = 0.5 # Min Depth per Sample (0.5x defaul) -- will generate FILTER_MIN_TOTAL_DP automatically
 +
FILTER_ARGS = --write-vcf --filter --maxDP $(FILTER_MAX_TOTAL_DP) --minDP $(FILTER_MIN_TOTAL_DP) --maxAB 70 --maxSTR 20 --minSTR -20 --winIndel 5 --maxSTZ 5 --minSTZ -5 --maxAOI 5 # arguments for filtering (refer to vcfCooker for details)
 +
#
 +
#############################################################################
 +
## RELATIVE DIRECTORY UNDER OUT_DIR
 +
#############################################################################
 +
BAM_GLF_DIR = glfs/bams  # BAM level GLF
 +
SM_GLF_DIR = glfs/samples # sample level GLF (after glfMerge if necessary)
 +
VCF_DIR = vcfs            # unfiltered and filtered VCF
 +
PVCF_DIR = pvcfs          # vcfPileup results
 +
SPLIT_DIR = split        # chunks split to multiple overlappingpieces
 +
BEAGLE_DIR = beagle      # beagle output
 +
THUNDER_DIR = thunder    # MaCH/thunder output
 +
GLF_INDEX = glfIndex.ped  # glfMultiples/glfExtract index file info
 +
#
 +
#############################################################################
 +
## OTHER OPTIONS
 +
#############################################################################
 +
UNIT_CHUNK = 5000000      # Chunk size of SNP calling : 5Mb is default
 +
LD_NSNPS = 10000          # Chunk size of genotype refinement : 10,000 SNPs
 +
LD_OVERLAP = 1000        # Overlapping # of SNPs between chinks : 1,000 SNPs
 +
RUN_INDEX_FORCE = FALSE  # Regenerate BAM index file even if it exists
 +
MERGE_BEFORE_FILTER = FALSE # Merge across the chromosome before filtering
 +
NOBAQ_SUBSTRINGS = SOLID  # Avoid BAQ if the BAM file contains the substring
 +
ASSERT_BAM_EXIST = FALSE  # Check if BAM file exists
 +
#
 +
#############################################################################
 +
## CLUSTER SETTING : CURRENTLY COMPATIBLE WITH MOSIX PLATFORM
 +
#############################################################################
 +
MOS_PREFIX =    # PREFIX FOR MOSIX COMMAND (BLANK IF UNUSED)
 +
MOS_NODES =      # COMMA-SEPARATED LIST OF NODES TO SUBMIT JOBS
 +
REMOTE_PREFIX =  # REMOTE_PREFIX : Set if cluster node see the directory differently (e.g. /net/mymachine/[original-dir])
 +
 +
    
== Software Components ==
 
== Software Components ==
Line 71: Line 211:  
* [[vcfCooker]]
 
* [[vcfCooker]]
 
* [[thunderVCF]]
 
* [[thunderVCF]]
 +
 +
 +
== Common Problems ==
 +
If you ran UMAKE without success, please double check these pre-requeists.
 +
 +
* Input region file (BED) should only contain chromosome 1, 2, ... 22, X, Y
 +
 +
* BAM index files (e.g. bams.index) should use absolute path for each BAM files
 +
 +
* Your working partition (where OUTPUT_ROOT specified) should support symbolic link. So Windows partition will not work.
 +
 +
* If UMAKE complains about resoures file (e.g. Cannot find HapMap SNPs or 1000 Genome indels), please check your UMAKE configuration file have correct set up resources files in the correct location. A common mistake is to use path HapMap/ while the resource folder is HapMap3/ .
    
== Acknowledgements ==
 
== Acknowledgements ==
   −
UMAKE is a result from collaborative effort by Hyun Min Kang, Goo Jun, Carlo Sidore, Yun Li, Paul Anderson, Mary Kate Trost, Wei Chen, Tom Blackwell, and Goncalo Abecasis. Please email to Hyun Min Kang [[mailto:hmkang@umich.edu| hmkang@umich.edu ]] for any questions.
+
UMAKE is a result from collaborative effort by Hyun Min Kang, Goo Jun, Carlo Sidore, Yun Li, Paul Anderson, Mary Kate Wing, Wei Chen, Tom Blackwell, and Goncalo Abecasis. Please email to Hyun Min Kang [[mailto:hmkang@umich.edu| hmkang@umich.edu ]] for any questions.
96

edits

Navigation menu