Revision as of 09:43, 4 April 2013

Back to parent: GotCloud

The Variant Calling Pipeline (previously called 'UMAKE') makes genotype calls from recalibrated BAM files. These genotype calls are output into VCF (Variant Call Format) files.

Running the GotCloud Variant Calling Pipeline

The variant calling pipeline (umake) is run using gotcloud snpcall and gotcloud ldrefine. gotcloud is found under gotcloud/.

Running the Automatic Test

The automatic test runs the variant calling pipeline on a small testset and checks the results against expected results validating that GotCloud is installed correctly.

Run variant calling pipeline test:

gotcloud snpcall --test OUTPUT_DIR

where OUTPUT_DIR is the directory where you want to store the test results

If you see "Successfully ran the test case, congratulations!", then you are ready to align samples.

Overview of Variant Calling Pipeline Steps

Here is an overview of the Variant Calling Pipeline:

Input Data

Aligned/Processed/Recalibrated BAM files
Index file containing Sample IDs & BAM file names
Reference files
(Optional) Configuration file to override default options

BAM files

The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is documented elsewhere as part of the Alignment Pipeline of gotCloud.

Index File

Each line of the index file represents each individual under the following format. Note that multiple BAMs per individual may be provided. Note that if all samples are from the same population, just specify "ALL" for the population label for each sample.

[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...

Columns:

sample id
comma separated population labels
BAM File 1 (preferable to have full paths to BAM files)
BAM File 2 (if applicable)

...

# BAM File N

Reference Files

The variant calling pipeline requires multiple reference files in order to work correctly.

Reference Sequence in fasta format.
- Configuration File Setting: REF = path/file.fa
Indel VCF File Prefix
- Configuration File Setting: INDEL_PREFIX = path/indels.sites.hg19
- path/ contains indels.sites.hg19.chr20.vcf for each chromosome being processed
DBSNP File vcf.gz file (must be indexed with tabix)
- Configuration File Setting: DBSNP_VCF = path/dbsnp_135.b37.vcf.gz
- path/ contains dbsnp_135_b37.rod.chr20.map for each chromosome being processed
HapMap3 polymorphic site vcf.gz file (must be indexed with tabix)
- Configuration File Setting: HM3_VCF = path/hapmap_3.3.b37.sites.vcf.gz

A set of reference files can be downloaded from: [| FTP Download of Full Resource Files]
Configuration File Example Reference Settings:

REF = path/file.fa
INDEL_PREFIX = path/indels.sites.hg19
DBSNP_VCF = path/dbsnp_135_b37.vcf.gz
HM3_VCF = path/hapmap_3.3.b37.sites.vcf.gz

Configuration File

Configuration file contains the run-time options including the software binaries and command line arguments.  A default configuration file is automatically loaded.  Users must specify their own configuration file specifying just the values different than the defaults.
Comments begin with a #
Format: KEY = value
Where KEY is the item being set and value is its new value

Required User Config Files Settings

The following Config File Settings must be specified by the user:

CHRS = space separated list of chromosomes you want
BAM_INDEX = path to the Index File of BAMs

Required on Command-Line or in Config File

The following Command-Line or Config File Settings must be specified by the user:

--outdir/OUT_DIR= path to desired output directoryTargeted/Exome Sequencing Settings

If you are running Targeted/Exome Sequencing, the user should specify:

Write loci file when performing pileup
WRITE_TARGET_LOCI = TRUE
Specify the output sub-directory to store target information, for example: targetDir
Should not be a full path as this will co under the OUT_DIR directory.
TARGET_DIR = targetDir

If all individuals have the same target:

Specify the single bed file, for example: target.bed
UNIFORM_TARGET_BED = target.bed

If not all individuals have the same target:

Specify the file containing the sample id -> bed map, for example: targetMap.txt
MULTIPLE_TARGET_MAP = targetMap.txt
Each line of the file contains [SM_ID] [TARGET_BED]

Optional Settings:

Extend the target region by a given number of bases, for example: 50
OFFSET_OFF_TARGET = 50

Configure Reference Files

See  Reference Files for information on how to specify the reference files.

Chromosome X Calling

Making calls on the X chromosome requires the user to specifty a PED file with sex information.

PED_INDEX = pedfile.pedExample Configuration File

Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5

CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
BAM_INDEX = /path/freeze5/freeze5.bam.index  ### The BAM index file described above
OUT_DIR = /path/freeze5/output               ### Directory in which to put all gotcloud output
REF = /path/reference/hs37d5.fa              ### Reference sequence
INDEL_PREFIX = /path/reference/1kg.pilot_release.merged.indels.sites.hg19   ### Known indel sites
HM3_VCF = /path/reference/hapmap3_r3_b37.sites.vcf.gz    ### HapMap variants (requires tabix index file in same directory)
DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz   ### dbSNP variants (requires tabix index file in same directory)

Running

Running variant calling is straightforward:

gotcloud snpcall --conf vc.conf --numjobs 2
gotcloud ldrefine --conf vc.conf --numjobs 2


Replace vc.conf with the approprate path/name of the user's configuration file.
If OUT_DIR is not defined in the configuration file, add --outdir followed by the path to the user's desired output directory.
Update the value following --numjobs to the appropriate number of jobs to be run in parallel.

Running on a Cluster

To run on the Cluster, the following settings need to be added to the configuration file:
<--- Following may need revision --->
TODO: COMING SOON

SLEEP_MULT =     20
REMOTE_PREFIX =  # REMOTE_PREFIX : Set if cluster node see the directory differently (e.g. /net/mymachine/[original-dir])

<--- End: Following may need revision --->


Here's the same configuration file we used above but now made to run on a cluster computer with MOSIX.

== Example Configuration File ==
CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
BAM_INDEX = /path/freeze5/freeze5.bam.index 
OUT_DIR = /path/freeze5/output              
REF = /path/reference/hs37d5.fa             
INDEL_PREFIX = /path/reference/1kg.pilot_release.merged.indels.sites.hg19
HM3_VCF = /path/reference/hapmap3_r3_b37.sites.vcf.gz
DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz
BATCH_TYPE = mosix             ### Specify MOSIX as the batch system
BATCH_OPTS = -j10,11,12,13     ### Specify available MOSIX compute nodes

Results

If there is a failure, you should see a message like:

make: *** [...] Error 1

Where ... is filled in with other text indicating what step failed.
On SNP Call success, you should see the following output sub-directories under your output directory:

glfs with a bams & samples subdirectory
pvcfs with a subdirectory per chromosome and then per region
split with a subdirectory per chromosome
vcfs with a subdirectory per chromosome
(optionally your target directory)

Under the vcf/chrXX directory, there should be:

chrXX.filtered.sites.vcf
chrXX.filtered.sites.vcf.log
chrXX.filtered.sites.vcf.summary
chrXX.filtered.vcf.gz
chrXX.filtered.vcf.gz.OK
chrXX.filtered.vcf.gz.tbi
chrXX.merged.sites.vcf
chrXX.merged.stats.vcf
chrXX.merged.vcf
chrXX.merged.vcf.OK

The .merged.vcf is the merged together versions of the separate regions in the same chromosome.
The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL.
Under the split/chrXX directory, there should be:

chrXX.filtered.PASS.split.[N].vcf.gz
chrXX.filtered.PASS.split.err
chrXX.filtered.PASS.split.vcflist
chrXX.filtered.PASS.gz
subset.OK

Difference between revisions of "GotCloud: Variant Calling Pipeline"

Revision as of 09:43, 4 April 2013

Contents

Running the GotCloud Variant Calling Pipeline

Running the Automatic Test

Overview of Variant Calling Pipeline Steps

Input Data

BAM files

Index File

Reference Files

Configuration File

Required User Config Files Settings

Required on Command-Line or in Config File

Targeted/Exome Sequencing Settings

Configure Reference Files

Chromosome X Calling

Example Configuration File

Running

Running on a Cluster

Results

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools