GotCloud: Variant Calling Pipeline

From Genome Analysis Wiki
Revision as of 13:48, 6 November 2012 by Mktrost (talk | contribs)
Jump to navigationJump to search

Back to the beginning [1]

The Variant Calling Pipeline (UMAKE) takes recalibrated BAM files and detects SNPs and calls their genotypes, producing VCF files.

Input Data:

  • Aligned/Processed/Recalibrated BAM files
  • Index file containing Sample IDs & BAM file names
  • Reference files
  • (Optional) Configuration file to override default options

BAM files

The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls.

FASTQs can be converted to this type of BAM using the Mapping Pipeline.


Additional input Files including Pedigree files (PED format) (to specify gender information in chrX calling), Target information (UCSC's BED format) in targeted or whole exome capture sequencing may be provided. Configuration file contains core information of run-time options including the software binaries and command line arguments. Refer to the example configuration file for further information [edit]

Index File

Each line of the index file represents each individual under the following format. Note that multiple BAMs per individual may be provided.

[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...

Columns:

  1. sample id
  2. comma separated population labels
  3. BAM File 1
  4. BAM File 2 (if applicable)
...
# BAM File N

Reference Files

Reference files are required for doing Variant Calling.

See Configuration Files: Reference Files for information on how to specify the reference files in the configuration.


Configuration File

A default configuration file is automatically loaded. Users must specify their own configuration file specifying just the values different than the defaults.

Comments begin with a #

Format: KEY = value

Where KEY is the item being set and value is its new value


Required User Config Files Settings

The following Config File Settings must be specified by the user:

  • CHRS = # space separated list of chromosomes you want
  • BAM_INDEX = # path to the Index File of BAMs

Required on Command-Line or in Config File

The following Command-Line or Config File Settings must be specified by the user:

  • --outdir/OUTDIR= # path to desired output directory

Targeted/Exome Sequencing Settings

If you are running Targeted/Exome Sequencing, the user should specify:

  • Write loci file when performing pileup
    • WRITE_TARGET_LOCI = TRUE
  • Specify the directory to store target information, for example: targetDir
    • TARGET_DIR = targetDir

If all individuals have the same target:

  • Specify the single bed file, for example: target.bed
    • UNIFORM_TARGET_BED = target.bed

If not all individuals have the same target:

  • Specify the file containing the sample id -> bed map, for example: targetMap.txt
    • MULTIPLE_TARGET_MAP = targetMap.txt
      • Each line of the file contains [SM_ID] [TARGET_BED]

Optional Settings:

  • Extend the target region by a given number of bases, for example: 50
    • OFFSET_OFF_TARGET = 50
  • Exclude off-target regions when using samtools view (may make command line too long)
    • SAMTOOLS_VIEW_TARGET_ONLY = TRUE


Reference Files

  • Reference Sequence in fasta format.
    • REF = path/file.fa
  • Indel VCF File Prefix
    • INDEL_PREFIX = path/indels.sites.hg19
    • path/ contains indels.sites.hg19.chr20.vcf for each chromosome being processed
  • DBSNP File Prefix
    • DBSNP_PREFIX = path/dbsnp_135_b37.rod
    • path/ contains dbsnp_135_b37.rod.chr20.map for each chromosome being processed
  • HapMap3 polymorphic site prefix
    • HM3_PREFIX = path/hapmap3.qc.poly
    • path/ contains hapmap3.qc.poly.chr20.bim & hapmap3.qc.poly.chr20.frq for each chromosome being processed

Can be downloaded from: [| FTP Download of Full Resource Files]

INDEL_PREFIX = $(UMAKE_ROOT)/ref/indels/1kg.pilot_release.merged.indels.sites.hg19 # 1000 Genomes Pilot 1 indel VCF prefix DBSNP_PREFIX = $(UMAKE_ROOT)/ref/dbSNP/dbsnp_135_b37.rod # dbSNP file prefix HM3_PREFIX = $(UMAKE_ROOT)/ref/HapMap3/hapmap3_r3_b37_fwd.consensus.qc.poly # HapMap3 polymorphic site prefix

Chromosome X Calling

  • PED_INDEX = pedfile.ped


Running

Running umake is straightforward:

cd ~/myseq
/usr/local/biopipe/bin/umake --conf myconf ???
make -f [out-prefix].Makefile -j [# parallel jobs]