Line 1: |
Line 1: |
− | Back to the beginning [http://genome.sph.umich.edu/wiki/Pipelines]
| |
| | | |
− | The Variant Calling Pipeline (UMAKE) takes recalibrated BAM files and detects SNPs and calls their genotypes, producing VCF files.
| + | Back to parent: [[GotCloud]] |
| | | |
| + | The Variant Calling Pipeline (previously called 'UMAKE') makes genotype calls from recalibrated BAM files. These genotype calls are output into [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 VCF (Variant Call Format) files]. |
| + | |
| + | |
| + | == Running the GotCloud Variant Calling Pipeline == |
| + | |
| + | The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>. |
| + | |
| + | ===Running the Automatic Test=== |
| + | |
| + | The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly. |
| + | |
| + | *Run <code>snpcall</code> pipeline test: |
| + | gotcloud snpcall --test OUTPUT_DIR |
| + | ** Where OUTPUT_DIR is the directory where you want to store the test results |
| + | ** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run snpcall on your own samples. |
| + | *Run <code>ldrefine</code> pipeline test: |
| + | gotcloud ldrefine --test OUTPUT_DIR |
| + | ** Where <code>OUTPUT_DIR</code> is the directory where you want to store the test results |
| + | ** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run ldrefine on your own samples. |
| + | |
| + | == Overview of Variant Calling Pipeline Steps == |
| Here is an overview of the Variant Calling Pipeline: | | Here is an overview of the Variant Calling Pipeline: |
| | | |
| [[File: umakeSteps.png]] | | [[File: umakeSteps.png]] |
| | | |
| + | |
| + | For more information on the filters applied during the Variant Calling Pipeline, see, [[GotCloud: Filters]]. |
| | | |
| == Input Data== | | == Input Data== |
− | *Aligned/Processed/Recalibrated BAM files | + | * [[#BAM Files|Aligned/Processed/Recalibrated BAM files]] |
− | *Index file containing Sample IDs & BAM file names | + | * [[#BAM List File|BAM list file containing Sample IDs & BAM file names]] |
− | *Reference files | + | * [[#Reference Files|Reference files]] |
− | *(Optional) Configuration file to override default options | + | * (Optional) [[#Configuration File|Configuration file to override default options]] |
| | | |
− | === BAM files === | + | === BAM Files === |
− | The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. | + | The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the [[Alignment Pipeline]] of GotCloud. |
| | | |
− | FASTQs can be converted to this type of BAM using the [[Mapping Pipeline]].
| + | === BAM List File === |
− | | + | * Automatically created when running the GotCloud [[Alignment Pipeline]] |
− | === Index File ===
| + | * Each line of the BAM list file represents a single individual |
− | Each line of the index file represents each individual under the following format. Note that multiple BAMs per individual may be provided. | |
− | [SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
| |
| | | |
| Columns: | | Columns: |
| # sample id | | # sample id |
− | # comma separated population labels | + | # comma separated population labels (optional column) |
− | # BAM File 1 | + | # BAM File 1 (preferable to have full paths to BAM files) |
− | # BAM File 2 (if applicable) | + | # BAM File 2 (if more than 1 BAM per sample) |
| :... | | :... |
| | | |
− | : # BAM File N | + | : # BAM File N (if more than 1 BAM per sample) |
| + | [SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ... |
| + | or |
| + | [SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ... |
| | | |
− | === Reference Files ===
| + | * Notes: |
− | Reference files are required for doing Variant Calling.
| + | ** tab delimited |
| + | ** multiple BAMs per individual may be provided, but should all be on the same line of the list file |
| + | ** population label is optional - it will default to <code>ALL</code> |
| + | *** only used by Thunder (part of ldrefine pipeline) |
| + | *** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample. |
| | | |
− | * Reference Sequence in fasta format.
| + | The path to the BAM List file is defaulted to the <code>outputDirectory/bam.list</code>. It can be overridden by setting <code>--bamlist</code>, <code>--bam_list</code>, or <code>--list</code> on the command-line or by setting BAM_LIST in your configuration file to the path to the BAM List File. See [[#Required_Options|Required Options]] for more information. |
− | ** Configuration File Setting: <code>REF = path/file.fa</code>
| |
− | * Indel VCF File Prefix
| |
− | ** Configuration File Setting: <code>INDEL_PREFIX = path/indels.sites.hg19</code>
| |
− | ** <code>path/</code> contains <code>indels.sites.hg19.chr20.vcf</code> for each chromosome being processed
| |
− | * DBSNP File Prefix
| |
− | ** Configuration File Setting: <code>DBSNP_PREFIX = path/dbsnp_135_b37.rod</code>
| |
− | ** <code>path/</code> contains <code>dbsnp_135_b37.rod.chr20.map</code> for each chromosome being processed
| |
− | * HapMap3 polymorphic site prefix
| |
− | ** Configuration File Setting: <code>HM3_PREFIX = path/hapmap3.qc.poly</code>
| |
− | ** <code>path/</code> contains <code>hapmap3.qc.poly.chr20.bim</code> & <code>hapmap3.qc.poly.chr20.frq</code> for each chromosome being processed
| |
− | | |
− | A set of reference files can be downloaded from: [[ftp://share.sph.umich.edu/1000genomes/umake-resources/ | FTP Download of Full Resource Files]]
| |
− | | |
− | Configuration File Example Reference Settings:
| |
− | REF = path/file.fa
| |
− | INDEL_PREFIX = path/indels.sites.hg19
| |
− | DBSNP_PREFIX = path/dbsnp_135_b37.rod
| |
− | HM3_PREFIX = path/hapmap3.qc.poly
| |
| | | |
| + | === Reference Files === |
| + | See [[GotCloud: Genetic Reference and Resource Files]] for detailed information about the multiple required reference files for the variant calling pipeline, including: |
| + | * How to obtain default references |
| + | * Configuration keys & default values |
| + | * How to generate your own references |
| + | * How to point GotCloud to your reference files |
| | | |
| + | Required Reference File Types: |
| + | * [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] |
| + | * [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF Files|DBSNP VCF Files]] |
| + | * [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF Files|HapMap3 VCF Files]] |
| + | * [[GotCloud: Genetic Reference and Resource Files#OMNI VCF Files|OMNI VCF Files]] |
| + | * [[GotCloud: Genetic Reference and Resource Files#INDEL VCF File(s)|INDEL VCF File(s)]] |
| | | |
| === Configuration File === | | === Configuration File === |
− | Configuration file contains the run-time options including the software binaries and command line arguments. A default configuration file is automatically loaded. Users must specify their own configuration file specifying just the values different than the defaults. | + | {{:GotCloud: Configuration}} |
− | | |
− | Comments begin with a <code>#</code>
| |
| | | |
− | Format: KEY = value
| + | See [[#Variant Calling Command-line Options/Configuration Settings|Variant Calling Command-line Options/Configuration Settings]] for more information on Configuration options. |
| | | |
− | Where KEY is the item being set and value is its new value
| + | ==== Example Configuration File ==== |
| + | Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5 |
| + | CHRS = 20 22 |
| + | BAM_LIST = /path/freeze5.bam.list |
| + | OUT_DIR = /path/freeze5/output |
| + | REF_DIR = /path/reference/ |
| + | REF = $(REF_DIR)/hs37d5.fa |
| + | INDEL_PREFIX = $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19 |
| + | HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz |
| + | DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz |
| | | |
− | ====Required User Config Files Settings====
| |
− | The following Config File Settings must be specified by the user:
| |
− | * CHRS = space separated list of chromosomes you want
| |
− | * BAM_INDEX = path to the Index File of BAMs
| |
| | | |
− | ====Required on Command-Line or in Config File==== | + | == Variant Calling Command-line Options/Configuration Settings == |
− | The following Command-Line or Config File Settings must be specified by the user:
| + | {{:GotCloud: Variant Calling Options}} |
− | * --outdir/OUTDIR= path to desired output directory
| |
| | | |
− | ====Targeted/Exome Sequencing Settings====
| |
− | If you are running Targeted/Exome Sequencing, the user should specify:
| |
− | * Write loci file when performing pileup
| |
− | ** WRITE_TARGET_LOCI = TRUE
| |
− | * Specify the output sub-directory to store target information, for example: targetDir
| |
− | ** Should not be a full path as this will co under the OUTDIR directory.
| |
− | ** TARGET_DIR = targetDir
| |
| | | |
− | If all individuals have the same target:
| + | == Use Cases & Recommended Settings == |
− | * Specify the single bed file, for example: target.bed
| + | === Single Sample Processing === |
− | ** UNIFORM_TARGET_BED = target.bed
| + | To run single sample processing we recommend adding the following settings to your configuration file: |
| + | UNIT_CHUNK = 20000000 |
| + | MODEL_GLFSINGLE = TRUE |
| + | MODEL_SKIP_DISCOVER = FALSE |
| + | MODEL_AF_PRIOR = TRUE |
| + | VCF_EXTRACT = $(REF_DIR)/snpOnly.vcf.gz |
| + | EXT = $(REF_DIR)/ALL.chrCHR.phase3.combined.sites.unfiltered.vcf.gz $(REF_DIR)/chrCHR.filtered.sites.vcf.gz |
| | | |
− | If not all individuals have the same target:
| + | Explanation of these settings: |
− | * Specify the file containing the sample id -> bed map, for example: targetMap.txt | + | * <code>UNIT_CHUNK</code> - since this is only 1 sample, process larger regions at a time than default |
− | ** MULTIPLE_TARGET_MAP = targetMap.txt | + | * <code>MODEL_GLFSINGLE</code> - single sample, so model glfsingle |
− | *** Each line of the file contains [SM_ID] [TARGET_BED] | + | * <code>MODEL_SKIP_DISCOVER</code> - do not skip the variant discovery step |
| + | * <code>MODEL_AF_PRIOR</code> - use AF prior for genotyping |
| + | * <code>VCF_EXTRACT</code> - VCF file to use for extracting the site information to genotype |
| + | ** This file is included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]] |
| + | * <code>EXT</code> - VCF reference files to use for the external filtering |
| + | ** These files are included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]] |
| | | |
− | Optional Settings:
| |
− | * Extend the target region by a given number of bases, for example: 50
| |
− | ** OFFSET_OFF_TARGET = 50
| |
− | * Exclude off-target regions when using samtools view (may make command line too long)
| |
− | ** SAMTOOLS_VIEW_TARGET_ONLY = TRUE
| |
− |
| |
− | ==== Configure Reference Files ====
| |
− | See [[#Reference Files| Reference Files]] for information on how to specify the reference files.
| |
− |
| |
− | ==== Chromosome X Calling ====
| |
− | * PED_INDEX = pedfile.ped
| |
| | | |
| | | |
| == Running == | | == Running == |
| | | |
− | Running umake is straightforward: | + | Running variant calling is straightforward: |
| | | |
| <code> | | <code> |
− | '''/usr/local/biopipe/bin/umake.pl --conf umake.conf --snpcall --numjobs 2 | + | '''gotcloud snpcall --conf vc.conf --numjobs 2 |
| + | '''gotcloud ldrefine --conf vc.conf --numjobs 2 |
| </code> | | </code> |
| | | |
− | Replace umake.conf with the approprate path/name of the user's configuration file. | + | * Replace <code>vc.conf</code> with the path/name of the user's configuration file |
− | | + | ** If you are not overriding any defaults, you can alternatively specify <code>--list path/bam.list</code> replacing <code>path/bam.list</code> with the path/name of your BAM list file. |
− | If <code>OUTDIR</code> is not defined in the configuration file, add <code>--outdir</code> followed by the path to the user's desired output directory. | + | * Replace <code>2</code> following <code>--numjobs</code> with the number of jobs to be run in parallel |
− | | + | * If <code>OUT_DIR</code> is not defined in the configuration file, add <code>--outdir</code> followed by the path to the user's desired output directory. |
− | Update the value following <code>--numjobs</code> to the appropriate number of jobs that the user wants to run in parallel.
| |
− | | |
| | | |
| === Running on a Cluster === | | === Running on a Cluster === |
− | To run on the Cluster, the following settings need to be added to the configuration file:
| + | See [[#Cluster Configuration|Cluster Configuration]] for information on how to configure GotCloud to run on a cluster. |
− | | |
− | SLEEP_MULT = 20
| |
− | MOS_PREFIX = # PREFIX FOR MOSIX COMMAND (BLANK IF UNUSED)
| |
− | MOS_NODES = # COMMA-SEPARATED LIST OF NODES TO SUBMIT JOBS
| |
− | REMOTE_PREFIX = # REMOTE_PREFIX : Set if cluster node see the directory differently (e.g. /net/mymachine/[original-dir])
| |
− | | |
− | Set the MOS_NODES to the appropriate node list.
| |
− | | |
− | Update MOS_PREFIX to the applicable prefix.
| |
− | * For MOSIX, use:
| |
− | MOS_PREFIX = mosrun -E/tmp -t -i
| |
| | | |
− | === Results ===
| + | == Results == |
| | | |
| If there is a failure, you should see a message like: | | If there is a failure, you should see a message like: |
Line 143: |
Line 147: |
| * glfs with a bams & samples subdirectory | | * glfs with a bams & samples subdirectory |
| * pvcfs with a subdirectory per chromosome and then per region | | * pvcfs with a subdirectory per chromosome and then per region |
− | * split with a subdirectory per chromosome | + | * '''split''' with a subdirectory per chromosome |
− | * vcfs with a subdirectory per chromosome | + | * '''vcfs''' with a subdirectory per chromosome |
| * (optionally your target directory) | | * (optionally your target directory) |
| | | |
− | Under the vcf/chrXX directory, there should be: | + | Under the '''vcf/chrXX''' directory, there should be: |
| * chrXX.filtered.sites.vcf | | * chrXX.filtered.sites.vcf |
− | * chrXX.filtered.sites.vcf.log | + | * chrXX.filtered.sites.vcf.norm.log |
| * chrXX.filtered.sites.vcf.summary | | * chrXX.filtered.sites.vcf.summary |
− | * chrXX.filtered.vcf.gz | + | * '''chrXX.filtered.vcf.gz''' - final filtered variant call file |
| * chrXX.filtered.vcf.gz.OK | | * chrXX.filtered.vcf.gz.OK |
| * chrXX.filtered.vcf.gz.tbi | | * chrXX.filtered.vcf.gz.tbi |
| + | * chrXX.hardfiltered.sites.vcf |
| + | * chrXX.hardfiltered.sites.vcf.log |
| + | * chrXX.hardfiltered.sites.vcf.summary |
| + | * chrXX.hardfiltered.vcf.gz |
| + | * chrXX.hardfiltered.vcf.gz.OK |
| + | * chrXX.hardfiltered.vcf.gz.tbi |
| * chrXX.merged.sites.vcf | | * chrXX.merged.sites.vcf |
| * chrXX.merged.stats.vcf | | * chrXX.merged.stats.vcf |
Line 159: |
Line 169: |
| * chrXX.merged.vcf.OK | | * chrXX.merged.vcf.OK |
| | | |
− | Under the split/chrXX directory, there should be: | + | The .merged.vcf is the merged together versions of the separate regions in the same chromosome. |
| + | |
| + | The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL. |
| + | |
| + | Under the '''split/chrXX''' directory, there should be: |
| * chrXX.filtered.PASS.split.[N].vcf.gz | | * chrXX.filtered.PASS.split.[N].vcf.gz |
− | * chr20.filtered.PASS.split.err | + | * chrXX.filtered.PASS.split.err |
− | * chr20.filtered.PASS.split.vcflist | + | * chrXX.filtered.PASS.split.vcflist |
− | * chr20.filtered.PASS.gz | + | * '''chrXX.filtered.PASS.gz''' - final variant call file with only PASS variants |
| * subset.OK | | * subset.OK |