Difference between revisions of "GotCloud: Variant Calling Pipeline"
Line 1: | Line 1: | ||
− | |||
Back to parent: [[GotCloud]] | Back to parent: [[GotCloud]] | ||
Line 10: | Line 9: | ||
The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>. <code>gotcloud</code> is found under <code>gotcloud/</code>. | The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>. <code>gotcloud</code> is found under <code>gotcloud/</code>. | ||
− | + | ==Running the Automatic Test== | |
The automatic test runs the variant calling pipeline on a small testset and checks the results against expected results validating that GotCloud is installed correctly. | The automatic test runs the variant calling pipeline on a small testset and checks the results against expected results validating that GotCloud is installed correctly. | ||
Line 20: | Line 19: | ||
If you see "Successfully ran the test case, congratulations!", then you are ready to align samples. | If you see "Successfully ran the test case, congratulations!", then you are ready to align samples. | ||
− | + | = Overview of Variant Calling Pipeline Steps = | |
Here is an overview of the Variant Calling Pipeline: | Here is an overview of the Variant Calling Pipeline: | ||
Line 26: | Line 25: | ||
− | + | = Input Data= | |
*Aligned/Processed/Recalibrated BAM files | *Aligned/Processed/Recalibrated BAM files | ||
*Index file containing Sample IDs & BAM file names | *Index file containing Sample IDs & BAM file names | ||
Line 32: | Line 31: | ||
*(Optional) Configuration file to override default options | *(Optional) Configuration file to override default options | ||
− | + | == BAM files == | |
The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is documented elsewhere as part of the [[Alignment Pipeline]] of gotCloud. | The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is documented elsewhere as part of the [[Alignment Pipeline]] of gotCloud. | ||
− | + | == Index File == | |
Each line of the index file represents each individual under the following format. Note that multiple BAMs per individual may be provided. | Each line of the index file represents each individual under the following format. Note that multiple BAMs per individual may be provided. | ||
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ... | [SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ... | ||
Line 48: | Line 47: | ||
: # BAM File N | : # BAM File N | ||
− | + | == Reference Files == | |
The variant calling pipeline requires multiple reference files in order to work correctly. | The variant calling pipeline requires multiple reference files in order to work correctly. | ||
Line 71: | Line 70: | ||
HM3_PREFIX = path/hapmap3.qc.poly | HM3_PREFIX = path/hapmap3.qc.poly | ||
− | + | == Configuration File == | |
Configuration file contains the run-time options including the software binaries and command line arguments. A default configuration file is automatically loaded. Users must specify their own configuration file specifying just the values different than the defaults. | Configuration file contains the run-time options including the software binaries and command line arguments. A default configuration file is automatically loaded. Users must specify their own configuration file specifying just the values different than the defaults. | ||
Line 80: | Line 79: | ||
Where KEY is the item being set and value is its new value | Where KEY is the item being set and value is its new value | ||
− | + | ===Required User Config Files Settings=== | |
The following Config File Settings must be specified by the user: | The following Config File Settings must be specified by the user: | ||
* CHRS = space separated list of chromosomes you want | * CHRS = space separated list of chromosomes you want | ||
* BAM_INDEX = path to the Index File of BAMs | * BAM_INDEX = path to the Index File of BAMs | ||
− | + | ===Required on Command-Line or in Config File=== | |
The following Command-Line or Config File Settings must be specified by the user: | The following Command-Line or Config File Settings must be specified by the user: | ||
* --outdir/OUT_DIR= path to desired output directory | * --outdir/OUT_DIR= path to desired output directory | ||
− | + | ===Targeted/Exome Sequencing Settings=== | |
If you are running Targeted/Exome Sequencing, the user should specify: | If you are running Targeted/Exome Sequencing, the user should specify: | ||
* Write loci file when performing pileup | * Write loci file when performing pileup | ||
Line 110: | Line 109: | ||
** OFFSET_OFF_TARGET = 50 | ** OFFSET_OFF_TARGET = 50 | ||
− | + | === Configure Reference Files === | |
See [[#Reference Files| Reference Files]] for information on how to specify the reference files. | See [[#Reference Files| Reference Files]] for information on how to specify the reference files. | ||
− | + | === Chromosome X Calling === | |
* PED_INDEX = pedfile.ped | * PED_INDEX = pedfile.ped | ||
− | + | = Running = | |
Running variant calling is straightforward: | Running variant calling is straightforward: | ||
Line 132: | Line 131: | ||
− | + | == Running on a Cluster == | |
To run on the Cluster, the following settings need to be added to the configuration file: | To run on the Cluster, the following settings need to be added to the configuration file: | ||
Line 140: | Line 139: | ||
− | + | = Results = | |
If there is a failure, you should see a message like: | If there is a failure, you should see a message like: |
Revision as of 17:30, 21 March 2013
Back to parent: GotCloud
The Variant Calling Pipeline (previously called 'UMAKE') makes genotype calls from recalibrated BAM files. These genotype calls are output into VCF (Variant Call Format) files.
Running the GotCloud Variant Calling Pipeline
The variant calling pipeline (umake) is run using gotcloud snpcall
and gotcloud ldrefine
. gotcloud
is found under gotcloud/
.
Running the Automatic Test
The automatic test runs the variant calling pipeline on a small testset and checks the results against expected results validating that GotCloud is installed correctly.
- Run variant calling pipeline test:
gotcloud snpcall --test OUTPUT_DIR
where OUTPUT_DIR is the directory where you want to store the test results
If you see "Successfully ran the test case, congratulations!", then you are ready to align samples.
Overview of Variant Calling Pipeline Steps
Here is an overview of the Variant Calling Pipeline:
Input Data
- Aligned/Processed/Recalibrated BAM files
- Index file containing Sample IDs & BAM file names
- Reference files
- (Optional) Configuration file to override default options
BAM files
The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is documented elsewhere as part of the Alignment Pipeline of gotCloud.
Index File
Each line of the index file represents each individual under the following format. Note that multiple BAMs per individual may be provided.
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
Columns:
- sample id
- comma separated population labels
- BAM File 1 (preferable to have full paths to BAM files)
- BAM File 2 (if applicable)
- ...
- # BAM File N
Reference Files
The variant calling pipeline requires multiple reference files in order to work correctly.
- Reference Sequence in fasta format.
- Configuration File Setting:
REF = path/file.fa
- Configuration File Setting:
- Indel VCF File Prefix
- Configuration File Setting:
INDEL_PREFIX = path/indels.sites.hg19
path/
containsindels.sites.hg19.chr20.vcf
for each chromosome being processed
- Configuration File Setting:
- DBSNP File Prefix
- Configuration File Setting:
DBSNP_PREFIX = path/dbsnp_135_b37.rod
path/
containsdbsnp_135_b37.rod.chr20.map
for each chromosome being processed
- Configuration File Setting:
- HapMap3 polymorphic site prefix
- Configuration File Setting:
HM3_PREFIX = path/hapmap3.qc.poly
path/
containshapmap3.qc.poly.chr20.bim
&hapmap3.qc.poly.chr20.frq
for each chromosome being processed
- Configuration File Setting:
A set of reference files can be downloaded from: [| FTP Download of Full Resource Files]
Configuration File Example Reference Settings:
REF = path/file.fa INDEL_PREFIX = path/indels.sites.hg19 DBSNP_PREFIX = path/dbsnp_135_b37.rod HM3_PREFIX = path/hapmap3.qc.poly
Configuration File
Configuration file contains the run-time options including the software binaries and command line arguments. A default configuration file is automatically loaded. Users must specify their own configuration file specifying just the values different than the defaults.
Comments begin with a #
Format: KEY = value
Where KEY is the item being set and value is its new value
Required User Config Files Settings
The following Config File Settings must be specified by the user:
- CHRS = space separated list of chromosomes you want
- BAM_INDEX = path to the Index File of BAMs
Required on Command-Line or in Config File
The following Command-Line or Config File Settings must be specified by the user:
- --outdir/OUT_DIR= path to desired output directory
Targeted/Exome Sequencing Settings
If you are running Targeted/Exome Sequencing, the user should specify:
- Write loci file when performing pileup
- WRITE_TARGET_LOCI = TRUE
- Specify the output sub-directory to store target information, for example: targetDir
- Should not be a full path as this will co under the OUT_DIR directory.
- TARGET_DIR = targetDir
If all individuals have the same target:
- Specify the single bed file, for example: target.bed
- UNIFORM_TARGET_BED = target.bed
If not all individuals have the same target:
- Specify the file containing the sample id -> bed map, for example: targetMap.txt
- MULTIPLE_TARGET_MAP = targetMap.txt
- Each line of the file contains [SM_ID] [TARGET_BED]
- MULTIPLE_TARGET_MAP = targetMap.txt
Optional Settings:
- Extend the target region by a given number of bases, for example: 50
- OFFSET_OFF_TARGET = 50
Configure Reference Files
See Reference Files for information on how to specify the reference files.
Chromosome X Calling
- PED_INDEX = pedfile.ped
Running
Running variant calling is straightforward:
gotcloud snpcall --conf vc.conf --numjobs 2
gotcloud ldrefine --conf vc.conf --numjobs 2
Replace vc.conf with the approprate path/name of the user's configuration file.
If OUT_DIR
is not defined in the configuration file, add --outdir
followed by the path to the user's desired output directory.
Update the value following --numjobs
to the appropriate number of jobs to be run in parallel.
Running on a Cluster
To run on the Cluster, the following settings need to be added to the configuration file:
TODO: COMING SOON
SLEEP_MULT = 20 REMOTE_PREFIX = # REMOTE_PREFIX : Set if cluster node see the directory differently (e.g. /net/mymachine/[original-dir])
Results
If there is a failure, you should see a message like:
make: *** [...] Error 1
Where ... is filled in with other text indicating what step failed.
On SNP Call success, you should see the following output sub-directories under your output directory:
- glfs with a bams & samples subdirectory
- pvcfs with a subdirectory per chromosome and then per region
- split with a subdirectory per chromosome
- vcfs with a subdirectory per chromosome
- (optionally your target directory)
Under the vcf/chrXX directory, there should be:
- chrXX.filtered.sites.vcf
- chrXX.filtered.sites.vcf.log
- chrXX.filtered.sites.vcf.summary
- chrXX.filtered.vcf.gz
- chrXX.filtered.vcf.gz.OK
- chrXX.filtered.vcf.gz.tbi
- chrXX.merged.sites.vcf
- chrXX.merged.stats.vcf
- chrXX.merged.vcf
- chrXX.merged.vcf.OK
The .merged.vcf is the merged together versions of the separate regions in the same chromosome.
The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL.
Under the split/chrXX directory, there should be:
- chrXX.filtered.PASS.split.[N].vcf.gz
- chrXX.filtered.PASS.split.err
- chrXX.filtered.PASS.split.vcflist
- chrXX.filtered.PASS.gz
- subset.OK