GotCloud: Variant Calling Pipeline

From Genome Analysis Wiki
Revision as of 00:50, 6 November 2012 by Mktrost (talk | contribs)
Jump to navigationJump to search

Back to the beginning [1]

The Variant Calling Pipeline (UMAKE) takes recalibrated BAM files and detects SNPs and calls their genotypes, producing VCF files.

Input Data:

  • Aligned/Processed/Recalibrated BAM files
  • Index file containing Sample IDs & BAM file names
  • Reference files
  • (Optional) Configuration file to override default options

BAM files

Index file

Reference Files

Configuration File

A default configuration file is automatically loaded. Users must specify their own configuration file specifying just the values different than the defaults.

Comments begin with a #

Format: KEY = value

Where KEY is the item being set and value is its new value


Required User Config Files Settings

The following Config File Settings must be specified by the user:

  • CHRS = # space separated list of chromosomes you want
  • BAM_INDEX = # path to the Index File of BAMs

Required on Command-Line or in Config File

The following Command-Line or Config File Settings must be specified by the user:

  • --outdir/OUTDIR= # path to desired output directory

Targeted/Exome Sequencing Settings

If you are running Targeted/Exome Sequencing, the user should specify:

  • Write loci file when performing pileup
    • WRITE_TARGET_LOCI = TRUE
  • Specify the directory to store target information, for example: targetDir
    • TARGET_DIR = targetDir

If all individuals have the same target:

  • Specify the single bed file, for example: target.bed
    • UNIFORM_TARGET_BED = target.bed

If not all individuals have the same target:

  • Specify the file containing the sample id -> bed map, for example: targetMap.txt
    • MULTIPLE_TARGET_MAP = targetMap.txt
      • Each line of the file contains [SM_ID] [TARGET_BED]

Optional Settings:

  • Extend the target region by a given number of bases, for example: 50
    • OFFSET_OFF_TARGET = 50
  • Exclude off-target regions when using samtools view (may make command line too long)
    • SAMTOOLS_VIEW_TARGET_ONLY = TRUE


Reference Files

  • Reference Sequence in fasta format.
    • REF = path/file.fa
  • Indel VCF File Prefix
    • INDEL_PREFIX = path/indels.sites.hg19
    • path/ contains indels.sites.hg19.chr20.vcf for each chromosome being processed
  • DBSNP File Prefix
    • DBSNP_PREFIX = path/dbsnp_135_b37.rod
    • path/ contains dbsnp_135_b37.rod.chr20.map for each chromosome being processed
  • HapMap3 polymorphic site prefix
    • HM3_PREFIX = path/hapmap3.qc.poly
    • path/ contains hapmap3.qc.poly.chr20.bim & hapmap3.qc.poly.chr20.frq for each chromosome being processed

Can be downloaded from: [| FTP Download of Full Resource Files]

INDEL_PREFIX = $(UMAKE_ROOT)/ref/indels/1kg.pilot_release.merged.indels.sites.hg19 # 1000 Genomes Pilot 1 indel VCF prefix DBSNP_PREFIX = $(UMAKE_ROOT)/ref/dbSNP/dbsnp_135_b37.rod # dbSNP file prefix HM3_PREFIX = $(UMAKE_ROOT)/ref/HapMap3/hapmap3_r3_b37_fwd.consensus.qc.poly # HapMap3 polymorphic site prefix

Chromosome X Calling

  • PED_INDEX = pedfile.ped


Running

Running umake is straightforward:

cd ~/myseq
/usr/local/biopipe/bin/umake --conf myconf ???
make -f [out-prefix].Makefile -j [# parallel jobs]