Difference between revisions of "GotCloud: Genetic Reference and Resource Files"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(23 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Genetic Reference and Resource Files =
+
== Genetic Reference and Resource Files ==
  
 
Back to parent: [[GotCloud]]
 
Back to parent: [[GotCloud]]
Line 6: Line 6:
  
 
You can generate your own files or use the set available for [[#Downloadable Reference and Resource Files|download]].
 
You can generate your own files or use the set available for [[#Downloadable Reference and Resource Files|download]].
 +
* By default, GotCloud looks for the reference/resource files in the <code>gotcloud.ref</code> subdirectory within the base GotCloud directory
 +
* To look in a different directory, set your reference/resource file location by setting either of the following to that path:
 +
** <code>REF_DIR</code> in your configuration file
 +
** <code>--ref_dir</code> on the command-line
  
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Description !! Configuration Key !! Default Value !! Pipelines !! Special Info
 +
|-
 +
| [[#Reference fasta Files| Reference fasta]] || REF || $(REF_DIR)/human.g1k.v37.fa
 +
| align, snpcall, indel || [[#Additional files generated from the reference fasta|Additional Files Required]]
 +
|-
 +
| [[#DBSNP VCF File|DBSNP VCF File]] || DBSNP_VCF || $(REF_DIR)/dbsnp_135.b37.vcf.gz
 +
| align, snpcall || Must be tabixed
 +
|-
 +
| [[#HapMap3 VCF File|HapMap3 VCF File]] || HM3_VCF || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 +
| align, snpcall || Must be tabixed
 +
|-
 +
| [[#OMNI VCF File|OMNI VCF File]] || OMNI_VCF || $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz
 +
| snpcall || Must be tabixed
 +
|-
 +
| rowspan="2"|[[#INDEL VCF File(s)|INDEL VCF File(s)]] || INDEL_PREFIX || $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19 || rowspan="2"|snpcall || .chr#.vcf extension will be appended
 +
|-
 +
| INDEL_VCF || ''alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome''||Must be tabixed
 +
|}
  
== Required Files ==
 
  
=== Human Reference Files ===
+
=== Reference fasta Files ===
 
Reference Sequence in fasta format
 
Reference Sequence in fasta format
 +
* Contains reference base at each reference position
 +
 
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 
! Configuration Key !! Default Value
 
! Configuration Key !! Default Value
Line 31: Line 55:
 
* Already included with default reference files
 
* Already included with default reference files
 
* If you are using your own reference files, you will need to be sure to create these files
 
* If you are using your own reference files, you will need to be sure to create these files
 +
** Expected to be at the same location as the reference file
 
** Be sure to create these additional files using the version of tool being run by GotCloud (by default they are in the <code>gotcloud/bin/</code> directory)
 
** Be sure to create these additional files using the version of tool being run by GotCloud (by default they are in the <code>gotcloud/bin/</code> directory)
 
** In the commands below, replace <code>ref.fa</code> with the path/name of the reference fasta file
 
** In the commands below, replace <code>ref.fa</code> with the path/name of the reference fasta file
Line 36: Line 61:
 
! Pipeline !! Step !! Required Extensions !! Command to Create !! More Information
 
! Pipeline !! Step !! Required Extensions !! Command to Create !! More Information
 
|-
 
|-
| all || ||.fai || <code>bin/samtools faidx ref.fa</code>
+
| align, snpcall, indel || ||.fai || <code>bin/samtools faidx ref.fa</code>
 
|-
 
|-
| all || || -bs.umfa || automatically created in same directory as REF file by GotCloud
+
| align, snpcall, indel || || -bs.umfa || || If it does not already exist, GotCloud automatically creates this file in same directory as the REF file
 
|-
 
|-
 
| align || bwa mapping || .amb, .ann, .bwt, .pac, .sa || <code>bin/bwa index ref.fa</code> || http://bio-bwa.sourceforge.net/bwa.shtml  
 
| align || bwa mapping || .amb, .ann, .bwt, .pac, .sa || <code>bin/bwa index ref.fa</code> || http://bio-bwa.sourceforge.net/bwa.shtml  
Line 51: Line 76:
  
 
=== DBSNP VCF File ===
 
=== DBSNP VCF File ===
 +
VCF file containing known dbsnp variant positions
 +
* Must be bgzip'd and tabix'd
 +
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Configuration Key !! Default Value
 +
|-
 +
| DBSNP_VCF || $(REF_DIR)/dbsnp_142.b37.vcf.gz
 +
|}
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| align || recalibration (exclude known dbsnps when generating recalibration tables) & qplot
 +
|-
 +
| snpcall || generating filtered VCF summary statistics
 +
|}
 +
 +
=== HapMap3 VCF File ===
 +
HapMap3 Polymorphic Sites VCF File
 +
* Must be bgzip'd and tabix'd
 +
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Configuration Key !! Default Value
 +
|-
 +
| HM3_VCF || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 +
|}
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| align || verifyBamID (contamination checking)
 +
|-
 +
| snpcall || generating filtered VCF summary statistics & positive example sites for SVM filtering
 +
|}
 +
 +
=== OMNI VCF File ===
 +
VCF file containing OMNI positions
 +
* Must be bgzip'd and tabix'd
  
=== HapMap3 Polymorphic Sites VCF File ===
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Configuration Key !! Default Value
 +
|-
 +
| OMNI_VCF || $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz
 +
|}
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| snpcall || positive example sites for SVM filtering
 +
|}
  
 
=== INDEL VCF File(s) ===
 
=== INDEL VCF File(s) ===
 +
VCF file containing known INDEL positions
  
=== OMNI VCF File ===
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Configuration Key !! Default Value
 +
|-
 +
| INDEL_PREFIX || $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19
 +
|-
 +
| INDEL_VCF || ''alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome''
 +
|}
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| snpcall || used to filter variants that are too close to a known indel
 +
|}
 +
 
 +
* Use <code>INDEL_PREFIX</code> if <code>path/</code> contains a separate file for each chromosome in the format: <code>indels.sites.hg19.chr#.vcf</code> for each <code>#</code> chromosome being processed
 +
* Use <code>INDEL_VCF</code> if you have all chromosomes in a single VCF file (it can be, but does not have to be a gz file)
 +
 
 +
== Downloadable Reference and Resource Files ==
 +
* When running on Amazon, a default set of reference files are included in the GotCloud AMI in the default <code>REF_DIR</code>
  
= Downloadable Reference and Resource Files =
 
  
 
'''Installing Genetic Reference and Resource Files'''
 
'''Installing Genetic Reference and Resource Files'''
Choose a destination for these files and install them as shown below.  We'll assume you will use '''/usr/local/gotcloud.ref'''.  If you use a different directory, replace /usr/local/gotcloud.ref with your path.
+
 
 +
Choose a destination for these files and install them as shown below.  We'll assume you will use '''gotcloud/gotcloud.ref'''.  Replace <code>gotcloud</code> with the path to where you installed gotcloud.
  
 
<code>
 
<code>
  <b>mkdir -p /usr/local/gotcloud.ref</b>   # Where you want the files installed
+
  <b>cd gotcloud</b>   # path to where you installed gotcloud
<b>cd /usr/local/gotcloud.ref</b>
 
 
</code>
 
</code>
  
Note this path as you will need to set the variable '''REF_DIR''' in the configuration file for gotcloud.
+
If you use a path other than a gotcloud.ref subdirectory of gotcloud, note this path as you will need to set either of the following to the installation path:
 +
* <code>REF_DIR</code> in your configuration file
 +
* <code>--ref_dir</code> on the command-line
 +
 
  
 +
'''Get & Install the Resource Files'''
  
'''Get the Resource Files'''
+
GotCloud makes use of various reference and other genetic resource files.
The GotCloud Aligner and Umake makes use of various reference and other genetic resource files.
 
 
You are free to use your own files, of course, but we also are making the files we use available.
 
You are free to use your own files, of course, but we also are making the files we use available.
  
<code>
+
<ul>
#  The easiest way to get the data:
+
<li> <div id="h37-db135">Human reference 37, dbsnp 135:</div></li>
 
  <b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db135-v3.tgz</b>
 
  <b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db135-v3.tgz</b>
 
#  Another way:
 
<b>ftp share.sph.umich.edu</b>
 
Connected to share.sph.umich.edu.
 
220 (vsFTPd 2.3.5)
 
Name (share.sph.umich.edu:tpg): <b>anonymous</b>
 
230 Login successful.
 
Remote system type is UNIX.
 
Using binary mode to transfer files.
 
ftp> <b>prompt</b>
 
Interactive mode off.
 
ftp> <b>cd gotcloud</b>
 
250 Directory successfully changed.
 
ftp> <b>mget ref/h37-db135-v3.tgz</b>
 
ftp> <b>quit</b>
 
221 Goodbye.
 
</code>
 
 
'''Install the Resource Files'''
 
 
<code>
 
 
  <b>tar xzf h37-db135-v3.tgz</b>
 
  <b>tar xzf h37-db135-v3.tgz</b>
 
  <b>rm -f h37-db135-v3.tgz</b>
 
  <b>rm -f h37-db135-v3.tgz</b>
</code>
+
<li><div id="h37-db142">Human reference 37, dbsnp 142:</div></li>
 
+
<b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db142-v1.tgz</b>
= Using Your own Reference Files =
+
<b>tar xzf h37-db142-v1.tgz</b>
 +
<b>rm -f h37-db142-v1.tgz</b>
 +
<li><div id="hs37d5-db142">Human reference 37 with decoy, dbsnp 142:</div></li>
 +
<b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/hs37d5-db142-v1.tgz</b>
 +
<b>tar xzf hs37d5-db142-v1.tgz</b>
 +
<b>rm -f hs37d5-db142-v1.tgz</b>
 +
</ul>

Latest revision as of 12:00, 14 May 2015

Genetic Reference and Resource Files

Back to parent: GotCloud

In order to run GotCloud, you need to provide Genetic Reference and Resource Files.

You can generate your own files or use the set available for download.

  • By default, GotCloud looks for the reference/resource files in the gotcloud.ref subdirectory within the base GotCloud directory
  • To look in a different directory, set your reference/resource file location by setting either of the following to that path:
    • REF_DIR in your configuration file
    • --ref_dir on the command-line
Description Configuration Key Default Value Pipelines Special Info
Reference fasta REF $(REF_DIR)/human.g1k.v37.fa align, snpcall, indel Additional Files Required
DBSNP VCF File DBSNP_VCF $(REF_DIR)/dbsnp_135.b37.vcf.gz align, snpcall Must be tabixed
HapMap3 VCF File HM3_VCF $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz align, snpcall Must be tabixed
OMNI VCF File OMNI_VCF $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz snpcall Must be tabixed
INDEL VCF File(s) INDEL_PREFIX $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19 snpcall .chr#.vcf extension will be appended
INDEL_VCF alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome Must be tabixed


Reference fasta Files

Reference Sequence in fasta format

  • Contains reference base at each reference position
Configuration Key Default Value
REF $(REF_DIR)/human.g1k.v37.fa
Pipeline Use
align mapping to reference, recalibration, quality control
snpcall pileup & identify variants, summarize filtered variants
indel discovery, genotyping

Additional files generated from the reference fasta

In addition to the fasta file a few additional files generated from the fasta are required

  • Already included with default reference files
  • If you are using your own reference files, you will need to be sure to create these files
    • Expected to be at the same location as the reference file
    • Be sure to create these additional files using the version of tool being run by GotCloud (by default they are in the gotcloud/bin/ directory)
    • In the commands below, replace ref.fa with the path/name of the reference fasta file
Pipeline Step Required Extensions Command to Create More Information
align, snpcall, indel .fai bin/samtools faidx ref.fa
align, snpcall, indel -bs.umfa If it does not already exist, GotCloud automatically creates this file in same directory as the REF file
align bwa mapping .amb, .ann, .bwt, .pac, .sa bin/bwa index ref.fa http://bio-bwa.sourceforge.net/bwa.shtml
align qplot .winsize100.gc bin/qplot --reference ref.fa NOTE: Ignore the error at the end of qplot that says:
FATAL ERROR - 
No SAM/BAM files provided, stopped!

This error is due to using qplot to just generate a GC Content file and not also process a BAM file.

QPLOT: InputFiles

DBSNP VCF File

VCF file containing known dbsnp variant positions

  • Must be bgzip'd and tabix'd
Configuration Key Default Value
DBSNP_VCF $(REF_DIR)/dbsnp_142.b37.vcf.gz
Pipeline Use
align recalibration (exclude known dbsnps when generating recalibration tables) & qplot
snpcall generating filtered VCF summary statistics

HapMap3 VCF File

HapMap3 Polymorphic Sites VCF File

  • Must be bgzip'd and tabix'd
Configuration Key Default Value
HM3_VCF $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
Pipeline Use
align verifyBamID (contamination checking)
snpcall generating filtered VCF summary statistics & positive example sites for SVM filtering

OMNI VCF File

VCF file containing OMNI positions

  • Must be bgzip'd and tabix'd
Configuration Key Default Value
OMNI_VCF $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz
Pipeline Use
snpcall positive example sites for SVM filtering

INDEL VCF File(s)

VCF file containing known INDEL positions

Configuration Key Default Value
INDEL_PREFIX $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19
INDEL_VCF alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome
Pipeline Use
snpcall used to filter variants that are too close to a known indel
  • Use INDEL_PREFIX if path/ contains a separate file for each chromosome in the format: indels.sites.hg19.chr#.vcf for each # chromosome being processed
  • Use INDEL_VCF if you have all chromosomes in a single VCF file (it can be, but does not have to be a gz file)

Downloadable Reference and Resource Files

  • When running on Amazon, a default set of reference files are included in the GotCloud AMI in the default REF_DIR


Installing Genetic Reference and Resource Files

Choose a destination for these files and install them as shown below. We'll assume you will use gotcloud/gotcloud.ref. Replace gotcloud with the path to where you installed gotcloud.

cd gotcloud   # path to where you installed gotcloud

If you use a path other than a gotcloud.ref subdirectory of gotcloud, note this path as you will need to set either of the following to the installation path:

  • REF_DIR in your configuration file
  • --ref_dir on the command-line


Get & Install the Resource Files

GotCloud makes use of various reference and other genetic resource files. You are free to use your own files, of course, but we also are making the files we use available.