Changes

2,301 bytes added , 18:54, 30 March 2015

Line 1: Line 1: −

~~Back to the beginning [http://genome.sph.umich.edu/wiki/Pipelines]~~

−

~~The Variant Calling Pipeline (UMAKE) takes recalibrated BAM files and detects SNPs and calls their genotypes, producing VCF files.~~

+

Back to parent: [[GotCloud]]

+

The Variant Calling Pipeline (previously called 'UMAKE') makes genotype calls from recalibrated BAM files. These genotype calls are output into [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 VCF (Variant Call Format) files].

+

== Running the GotCloud Variant Calling Pipeline ==

+

The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>.

+

===Running the Automatic Test===

+

The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly.

+

*Run <code>snpcall</code> pipeline test:

+

gotcloud snpcall --test OUTPUT_DIR

+

** Where OUTPUT_DIR is the directory where you want to store the test results

+

** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run snpcall on your own samples.

+

*Run <code>ldrefine</code> pipeline test:

+

gotcloud ldrefine --test OUTPUT_DIR

+

** Where <code>OUTPUT_DIR</code> is the directory where you want to store the test results

+

** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run ldrefine on your own samples.

+

== Overview of Variant Calling Pipeline Steps ==

Here is an overview of the Variant Calling Pipeline:

[[File: umakeSteps.png]]

+

For more information on the filters applied during the Variant Calling Pipeline, see, [[GotCloud: Filters]].

== Input Data==

−

*Aligned/Processed/Recalibrated BAM files

+

* [[#BAM Files|Aligned/Processed/Recalibrated BAM files]]

−

*~~Index~~ file containing Sample IDs & BAM file names

+

* [[#BAM List File|BAM list file containing Sample IDs & BAM file names]]

−

*Reference files

+

* [[#Reference Files|Reference files]]

−

*(Optional) Configuration file to override default options

+

* (Optional) [[#Configuration File|Configuration file to override default options]]

−

=== BAM ~~files~~ ===

+

=== BAM Files ===

−

The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls.

+

The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the [[Alignment Pipeline]] of GotCloud.

−

~~FASTQs can be converted to this type of~~ BAM ~~using~~ the [[~~Mapping~~ Pipeline]].

+

=== BAM List File ===

−

+

* Automatically created when running the GotCloud [[Alignment Pipeline]]

−

~~=== Index File ===~~

+

* Each line of the BAM list file represents a single individual

−

Each line of the ~~index~~ file represents ~~each individual under the following format. Note that multiple BAMs per~~ individual ~~may be provided.~~

−

~~[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...~~

Columns:

# sample id

−

# comma separated population labels

+

# comma separated population labels (optional column)

−

# BAM File 1

+

# BAM File 1 (preferable to have full paths to BAM files)

−

# BAM File 2 (if ~~applicable~~)

+

# BAM File 2 (if more than 1 BAM per sample)

:...

−

: # BAM File N

+

: # BAM File N (if more than 1 BAM per sample)

+

[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...

+

or

+

[SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ...

−

~~=== Reference Files ===~~

+

* Notes:

−

~~Reference files~~ are ~~required~~ for ~~doing Variant Calling~~.

+

** tab delimited

+

** multiple BAMs per individual may be provided, but should all be on the same line of the list file

+

** population label is optional - it will default to <code>ALL</code>

+

*** only used by Thunder (part of ldrefine pipeline)

+

*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.

−

* Reference Sequence in fasta format.

+

The path to the BAM List file is defaulted to the <code>outputDirectory/bam.list</code>. It can be overridden by setting <code>--bamlist</code>, <code>--bam_list</code>, or <code>--list</code> on the command-line or by setting BAM_LIST in your configuration file to the path to the BAM List File. See [[#Required_Options|Required Options]] for more information.

−

** Configuration File Setting: <code>REF = path/file~~.fa</code>~~

−

* Indel VCF File Prefix

−

** Configuration File Setting: <code>~~INDEL_PREFIX = path~~/~~indels.sites~~.~~hg19~~</code>

−

** <code>path/</code> contains <code>indels.sites.hg19.chr20.~~vcf</code> for each chromosome being processed~~

−

* DBSNP File Prefix

−

** Configuration File Setting: <code>~~DBSNP_PREFIX = path/dbsnp_135_b37.rod~~</code>

−

** <code>~~path/~~</code> ~~contains~~ <code>~~dbsnp_135_b37.rod.chr20.map~~</code> ~~for each chromosome being processed~~

−

* HapMap3 polymorphic site prefix

−

** Configuration File ~~Setting:~~ ~~<code>HM3_PREFIX = path/hapmap3.qc.poly</code>~~

−

** <code>path/</code> contains <code>hapmap3.qc.poly.chr20.bim</code> & <code>hapmap3.qc.poly.chr20.frq</code> for each chromosome being processed

−

~~A set of reference files can be downloaded from:~~ [[~~ftp://share.sph.umich.edu/1000genomes/umake-resources/~~ | ~~FTP Download of Full Resource Files~~]]

−

~~Configuration File Example Reference Settings:~~

−

~~REF = path/file.fa~~

−

~~INDEL_PREFIX = path/indels.sites.hg19~~

−

~~DBSNP_PREFIX = path/dbsnp_135_b37.rod~~

−

~~HM3_PREFIX = path/hapmap3.qc~~.~~poly~~

+

=== Reference Files ===

+

See [[GotCloud: Genetic Reference and Resource Files]] for detailed information about the multiple required reference files for the variant calling pipeline, including:

+

* How to obtain default references

+

* Configuration keys & default values

+

* How to generate your own references

+

* How to point GotCloud to your reference files

+

Required Reference File Types:

+

* [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]]

+

* [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF Files|DBSNP VCF Files]]

+

* [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF Files|HapMap3 VCF Files]]

+

* [[GotCloud: Genetic Reference and Resource Files#OMNI VCF Files|OMNI VCF Files]]

+

* [[GotCloud: Genetic Reference and Resource Files#INDEL VCF File(s)|INDEL VCF File(s)]]

=== Configuration File ===

−

Configuration file contains the run-time options including the software binaries and command line arguments. A default configuration file is automatically loaded. Users must specify their own configuration file specifying just the values different than the defaults.

+

−

~~Comments begin with a <code>#</code>~~

−

~~Format: KEY = value~~

+

See [[#Variant Calling Command-line Options/Configuration Settings|Variant Calling Command-line Options/Configuration Settings]] for more information on Configuration options.

−

~~Where KEY is the item being set~~ and ~~value is its new value~~

+

==== Example Configuration File ====

+

Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5

+

CHRS = 20 22

+

BAM_LIST = /path/freeze5.bam.list

+

OUT_DIR = /path/freeze5/output

+

REF_DIR = /path/reference/

+

REF = $(REF_DIR)/hs37d5.fa

+

INDEL_PREFIX = $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19

+

HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz

+

DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz

−

~~====Required User Config Files Settings====~~

−

~~The following Config File Settings must be specified by the user:~~

−

* CHRS = space separated list of chromosomes you want

−

* BAM_INDEX = path to the Index File of BAMs

−

==~~==Required on~~ Command-~~Line or in Config File==~~==

+

== Variant Calling Command-line Options/Configuration Settings ==

−

~~The following Command-Line or Config File Settings must be specified by the user~~:

+

−

* --outdir/OUTDIR= path to desired output directory

−

~~====Targeted/Exome Sequencing Settings====~~

−

~~If you are running Targeted/Exome Sequencing, the user should specify:~~

−

* Write loci file when performing pileup

−

** WRITE_TARGET_LOCI = TRUE

−

* Specify the output sub-directory to store target information, for example: targetDir

−

** Should not be a full path as this will co under the OUTDIR directory.

−

** TARGET_DIR = targetDir

−

~~If all individuals have the same target:~~

+

== Use Cases & Recommended Settings ==

−

* Specify the ~~single bed~~ file~~, for example~~: ~~target~~.~~bed~~

+

=== Single Sample Processing ===

−

** UNIFORM_TARGET_BED = ~~target~~.~~bed~~

+

To run single sample processing we recommend adding the following settings to your configuration file:

+

UNIT_CHUNK = 20000000

+

MODEL_GLFSINGLE = TRUE

+

MODEL_SKIP_DISCOVER = FALSE

+

MODEL_AF_PRIOR = TRUE

+

VCF_EXTRACT = $(REF_DIR)/snpOnly.vcf.gz

+

EXT = $(REF_DIR)/ALL.chrCHR.phase3.combined.sites.unfiltered.vcf.gz $(REF_DIR)/chrCHR.filtered.sites.vcf.gz

−

If not ~~all individuals have~~ the ~~same target:~~

+

Explanation of these settings:

−

* ~~Specify the file containing the sample id~~ -> ~~bed map,~~ for ~~example: targetMap.txt~~

+

* <code>UNIT_CHUNK</code> - since this is only 1 sample, process larger regions at a time than default

−

** ~~MULTIPLE_TARGET_MAP = targetMap.txt~~

+

* <code>MODEL_GLFSINGLE</code> - single sample, so model glfsingle

−

*** ~~Each line of~~ the ~~file contains~~ [~~SM_ID~~] ~~[TARGET_BED~~]

+

* <code>MODEL_SKIP_DISCOVER</code> - do not skip the variant discovery step

+

* <code>MODEL_AF_PRIOR</code> - use AF prior for genotyping

+

* <code>VCF_EXTRACT</code> - VCF file to use for extracting the site information to genotype

+

** This file is included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]]

+

* <code>EXT</code> - VCF reference files to use for the external filtering

+

** These files are included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]]

−

~~Optional Settings:~~

−

* Extend the target region by a given number of bases, for example: 50

−

** OFFSET_OFF_TARGET = 50

−

* Exclude off-target regions when using samtools view (may make command line too long)

−

** SAMTOOLS_VIEW_TARGET_ONLY = TRUE

−

~~==== Configure Reference Files ====~~

−

~~See [[#Reference Files| Reference Files]] for information on how to specify the reference files.~~

−

~~==== Chromosome X Calling ====~~

−

* PED_INDEX = pedfile.ped

== Running ==

−

Running ~~umake~~ is straightforward:

+

Running variant calling is straightforward:

<code>

−

'''~~/usr/local/biopipe/bin/umake.pl~~ --conf ~~umake~~.conf --~~snpcall~~ --numjobs 2

+

'''gotcloud snpcall --conf vc.conf --numjobs 2

+

'''gotcloud ldrefine --conf vc.conf --numjobs 2

</code>

−

Replace ~~umake~~.conf with the ~~approprate~~ path/name of the user's configuration file.

+

* Replace <code>vc.conf</code> with the path/name of the user's configuration file

−

+

** If you are not overriding any defaults, you can alternatively specify <code>--list path/bam.list</code> replacing <code>path/bam.list</code> with the path/name of your BAM list file.

−

If <code>~~OUTDIR~~</code> is not defined in the configuration file, add <code>--outdir</code> followed by the path to the user's desired output directory.

+

* Replace <code>2</code> following <code>--numjobs</code> with the number of jobs to be run in parallel

−

+

* If <code>OUT_DIR</code> is not defined in the configuration file, add <code>--outdir</code> followed by the path to the user's desired output directory.

−

~~Update the value following <code>--numjobs</code> to the appropriate number of jobs that the user wants to run in parallel.~~

−

=== Running on a Cluster ===

−

~~To run~~ on ~~the Cluster, the following settings need~~ to ~~be added~~ to ~~the configuration file:~~

+

See [[#Cluster Configuration|Cluster Configuration]] for information on how to configure GotCloud to run on a cluster.

−

~~SLEEP_MULT = 20~~

−

~~MOS_PREFIX = # PREFIX FOR MOSIX COMMAND (BLANK IF UNUSED)~~

−

~~MOS_NODES = # COMMA-SEPARATED LIST OF NODES TO SUBMIT JOBS~~

−

~~REMOTE_PREFIX = # REMOTE_PREFIX : Set if~~ cluster ~~node see the directory differently (e.g. /net/mymachine/[original-dir])~~

−

~~Set the MOS_NODES to the appropriate node list.~~

−

~~Update MOS_PREFIX to the applicable prefix~~.

−

* For MOSIX, use:

−

~~MOS_PREFIX = mosrun -E/tmp -t -i~~

−

=== Results ===

+

== Results ==

If there is a failure, you should see a message like:

Line 143: Line 147:

* glfs with a bams & samples subdirectory

* pvcfs with a subdirectory per chromosome and then per region

−

* split with a subdirectory per chromosome

+

* '''split''' with a subdirectory per chromosome

−

* vcfs with a subdirectory per chromosome

+

* '''vcfs''' with a subdirectory per chromosome

* (optionally your target directory)

−

Under the vcf/chrXX directory, there should be:

+

Under the '''vcf/chrXX''' directory, there should be:

* chrXX.filtered.sites.vcf

−

* chrXX.filtered.sites.vcf.log

+

* chrXX.filtered.sites.vcf.norm.log

* chrXX.filtered.sites.vcf.summary

−

* chrXX.filtered.vcf.gz

+

* '''chrXX.filtered.vcf.gz''' - final filtered variant call file

* chrXX.filtered.vcf.gz.OK

* chrXX.filtered.vcf.gz.tbi

+

* chrXX.hardfiltered.sites.vcf

+

* chrXX.hardfiltered.sites.vcf.log

+

* chrXX.hardfiltered.sites.vcf.summary

+

* chrXX.hardfiltered.vcf.gz

+

* chrXX.hardfiltered.vcf.gz.OK

+

* chrXX.hardfiltered.vcf.gz.tbi

* chrXX.merged.sites.vcf

* chrXX.merged.stats.vcf

Line 159: Line 169:

* chrXX.merged.vcf.OK

−

Under the split/chrXX directory, there should be:

+

The .merged.vcf is the merged together versions of the separate regions in the same chromosome.

+

The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL.

+

Under the '''split/chrXX''' directory, there should be:

* chrXX.filtered.PASS.split.[N].vcf.gz

−

* ~~chr20~~.filtered.PASS.split.err

+

* chrXX.filtered.PASS.split.err

−

* ~~chr20~~.filtered.PASS.split.vcflist

+

* chrXX.filtered.PASS.split.vcflist

−

* ~~chr20~~.filtered.PASS.gz

+

* '''chrXX.filtered.PASS.gz''' - final variant call file with only PASS variants

* subset.OK

Kleckner

87

edits

Changes

GotCloud: Variant Calling Pipeline (view source)

Revision as of 18:54, 30 March 2015

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools