Difference between revisions of "GotCloud: Genetic Reference and Resource Files"
(11 intermediate revisions by the same user not shown) | |||
Line 6: | Line 6: | ||
You can generate your own files or use the set available for [[#Downloadable Reference and Resource Files|download]]. | You can generate your own files or use the set available for [[#Downloadable Reference and Resource Files|download]]. | ||
− | * | + | * By default, GotCloud looks for the reference/resource files in the <code>gotcloud.ref</code> subdirectory within the base GotCloud directory |
− | ** | + | * To look in a different directory, set your reference/resource file location by setting either of the following to that path: |
+ | ** <code>REF_DIR</code> in your configuration file | ||
+ | ** <code>--ref_dir</code> on the command-line | ||
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
Line 24: | Line 26: | ||
| snpcall || Must be tabixed | | snpcall || Must be tabixed | ||
|- | |- | ||
− | | rowspan="2"|[[#INDEL VCF File(s)|INDEL VCF File(s)]] || INDEL_PREFIX || $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19 || rowspan="2"|snpcall || | + | | rowspan="2"|[[#INDEL VCF File(s)|INDEL VCF File(s)]] || INDEL_PREFIX || $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19 || rowspan="2"|snpcall || .chr#.vcf extension will be appended |
|- | |- | ||
− | | INDEL_VCF || ''alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome'' | + | | INDEL_VCF || ''alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome''||Must be tabixed |
|} | |} | ||
Line 80: | Line 82: | ||
! Configuration Key !! Default Value | ! Configuration Key !! Default Value | ||
|- | |- | ||
− | | DBSNP_VCF || $(REF_DIR)/ | + | | DBSNP_VCF || $(REF_DIR)/dbsnp_142.b37.vcf.gz |
|} | |} | ||
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
Line 124: | Line 126: | ||
=== INDEL VCF File(s) === | === INDEL VCF File(s) === | ||
VCF file containing known INDEL positions | VCF file containing known INDEL positions | ||
− | |||
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
Line 139: | Line 140: | ||
|} | |} | ||
− | * Use <code>INDEL_PREFIX</code> if <code>path/</code> contains a separate file for each chromosome in the format:<code>indels.sites.hg19. | + | * Use <code>INDEL_PREFIX</code> if <code>path/</code> contains a separate file for each chromosome in the format: <code>indels.sites.hg19.chr#.vcf</code> for each <code>#</code> chromosome being processed |
* Use <code>INDEL_VCF</code> if you have all chromosomes in a single VCF file (it can be, but does not have to be a gz file) | * Use <code>INDEL_VCF</code> if you have all chromosomes in a single VCF file (it can be, but does not have to be a gz file) | ||
== Downloadable Reference and Resource Files == | == Downloadable Reference and Resource Files == | ||
+ | * When running on Amazon, a default set of reference files are included in the GotCloud AMI in the default <code>REF_DIR</code> | ||
+ | |||
'''Installing Genetic Reference and Resource Files''' | '''Installing Genetic Reference and Resource Files''' | ||
+ | |||
Choose a destination for these files and install them as shown below. We'll assume you will use '''gotcloud/gotcloud.ref'''. Replace <code>gotcloud</code> with the path to where you installed gotcloud. | Choose a destination for these files and install them as shown below. We'll assume you will use '''gotcloud/gotcloud.ref'''. Replace <code>gotcloud</code> with the path to where you installed gotcloud. | ||
<code> | <code> | ||
<b>cd gotcloud</b> # path to where you installed gotcloud | <b>cd gotcloud</b> # path to where you installed gotcloud | ||
− | |||
− | |||
</code> | </code> | ||
− | If you use a path other than a gotcloud.ref subdirectory of gotcloud, note this path as you will need to set the | + | If you use a path other than a gotcloud.ref subdirectory of gotcloud, note this path as you will need to set either of the following to the installation path: |
+ | * <code>REF_DIR</code> in your configuration file | ||
+ | * <code>--ref_dir</code> on the command-line | ||
− | '''Get the Resource Files''' | + | '''Get & Install the Resource Files''' |
+ | |||
GotCloud makes use of various reference and other genetic resource files. | GotCloud makes use of various reference and other genetic resource files. | ||
You are free to use your own files, of course, but we also are making the files we use available. | You are free to use your own files, of course, but we also are making the files we use available. | ||
− | < | + | <ul> |
+ | <li> <div id="h37-db135">Human reference 37, dbsnp 135:</div></li> | ||
<b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db135-v3.tgz</b> | <b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db135-v3.tgz</b> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
<b>tar xzf h37-db135-v3.tgz</b> | <b>tar xzf h37-db135-v3.tgz</b> | ||
<b>rm -f h37-db135-v3.tgz</b> | <b>rm -f h37-db135-v3.tgz</b> | ||
− | </ | + | <li><div id="h37-db142">Human reference 37, dbsnp 142:</div></li> |
+ | <b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db142-v1.tgz</b> | ||
+ | <b>tar xzf h37-db142-v1.tgz</b> | ||
+ | <b>rm -f h37-db142-v1.tgz</b> | ||
+ | <li><div id="hs37d5-db142">Human reference 37 with decoy, dbsnp 142:</div></li> | ||
+ | <b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/hs37d5-db142-v1.tgz</b> | ||
+ | <b>tar xzf hs37d5-db142-v1.tgz</b> | ||
+ | <b>rm -f hs37d5-db142-v1.tgz</b> | ||
+ | </ul> |
Latest revision as of 12:00, 14 May 2015
Genetic Reference and Resource Files
Back to parent: GotCloud
In order to run GotCloud, you need to provide Genetic Reference and Resource Files.
You can generate your own files or use the set available for download.
- By default, GotCloud looks for the reference/resource files in the
gotcloud.ref
subdirectory within the base GotCloud directory - To look in a different directory, set your reference/resource file location by setting either of the following to that path:
REF_DIR
in your configuration file--ref_dir
on the command-line
Description | Configuration Key | Default Value | Pipelines | Special Info |
---|---|---|---|---|
Reference fasta | REF | $(REF_DIR)/human.g1k.v37.fa | align, snpcall, indel | Additional Files Required |
DBSNP VCF File | DBSNP_VCF | $(REF_DIR)/dbsnp_135.b37.vcf.gz | align, snpcall | Must be tabixed |
HapMap3 VCF File | HM3_VCF | $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz | align, snpcall | Must be tabixed |
OMNI VCF File | OMNI_VCF | $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz | snpcall | Must be tabixed |
INDEL VCF File(s) | INDEL_PREFIX | $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19 | snpcall | .chr#.vcf extension will be appended |
INDEL_VCF | alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome | Must be tabixed |
Reference fasta Files
Reference Sequence in fasta format
- Contains reference base at each reference position
Configuration Key | Default Value |
---|---|
REF | $(REF_DIR)/human.g1k.v37.fa |
Pipeline | Use |
---|---|
align | mapping to reference, recalibration, quality control |
snpcall | pileup & identify variants, summarize filtered variants |
indel | discovery, genotyping |
Additional files generated from the reference fasta
In addition to the fasta file a few additional files generated from the fasta are required
- Already included with default reference files
- If you are using your own reference files, you will need to be sure to create these files
- Expected to be at the same location as the reference file
- Be sure to create these additional files using the version of tool being run by GotCloud (by default they are in the
gotcloud/bin/
directory) - In the commands below, replace
ref.fa
with the path/name of the reference fasta file
Pipeline | Step | Required Extensions | Command to Create | More Information |
---|---|---|---|---|
align, snpcall, indel | .fai | bin/samtools faidx ref.fa
| ||
align, snpcall, indel | -bs.umfa | If it does not already exist, GotCloud automatically creates this file in same directory as the REF file | ||
align | bwa mapping | .amb, .ann, .bwt, .pac, .sa | bin/bwa index ref.fa |
http://bio-bwa.sourceforge.net/bwa.shtml |
align | qplot | .winsize100.gc | bin/qplot --reference ref.fa |
NOTE: Ignore the error at the end of qplot that says:
FATAL ERROR - No SAM/BAM files provided, stopped! This error is due to using qplot to just generate a GC Content file and not also process a BAM file. |
DBSNP VCF File
VCF file containing known dbsnp variant positions
- Must be bgzip'd and tabix'd
Configuration Key | Default Value |
---|---|
DBSNP_VCF | $(REF_DIR)/dbsnp_142.b37.vcf.gz |
Pipeline | Use |
---|---|
align | recalibration (exclude known dbsnps when generating recalibration tables) & qplot |
snpcall | generating filtered VCF summary statistics |
HapMap3 VCF File
HapMap3 Polymorphic Sites VCF File
- Must be bgzip'd and tabix'd
Configuration Key | Default Value |
---|---|
HM3_VCF | $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz |
Pipeline | Use |
---|---|
align | verifyBamID (contamination checking) |
snpcall | generating filtered VCF summary statistics & positive example sites for SVM filtering |
OMNI VCF File
VCF file containing OMNI positions
- Must be bgzip'd and tabix'd
Configuration Key | Default Value |
---|---|
OMNI_VCF | $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz |
Pipeline | Use |
---|---|
snpcall | positive example sites for SVM filtering |
INDEL VCF File(s)
VCF file containing known INDEL positions
Configuration Key | Default Value |
---|---|
INDEL_PREFIX | $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19 |
INDEL_VCF | alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome |
Pipeline | Use |
---|---|
snpcall | used to filter variants that are too close to a known indel |
- Use
INDEL_PREFIX
ifpath/
contains a separate file for each chromosome in the format:indels.sites.hg19.chr#.vcf
for each#
chromosome being processed - Use
INDEL_VCF
if you have all chromosomes in a single VCF file (it can be, but does not have to be a gz file)
Downloadable Reference and Resource Files
- When running on Amazon, a default set of reference files are included in the GotCloud AMI in the default
REF_DIR
Installing Genetic Reference and Resource Files
Choose a destination for these files and install them as shown below. We'll assume you will use gotcloud/gotcloud.ref. Replace gotcloud
with the path to where you installed gotcloud.
cd gotcloud # path to where you installed gotcloud
If you use a path other than a gotcloud.ref subdirectory of gotcloud, note this path as you will need to set either of the following to the installation path:
REF_DIR
in your configuration file--ref_dir
on the command-line
Get & Install the Resource Files
GotCloud makes use of various reference and other genetic resource files. You are free to use your own files, of course, but we also are making the files we use available.
- Human reference 37, dbsnp 135:
wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db135-v3.tgz tar xzf h37-db135-v3.tgz rm -f h37-db135-v3.tgz
wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db142-v1.tgz tar xzf h37-db142-v1.tgz rm -f h37-db142-v1.tgz
wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/hs37d5-db142-v1.tgz tar xzf hs37d5-db142-v1.tgz rm -f hs37d5-db142-v1.tgz