Line 1: |
Line 1: |
− | = Genetic Reference and Resource Files = | + | == Genetic Reference and Resource Files == |
| | | |
| Back to parent: [[GotCloud]] | | Back to parent: [[GotCloud]] |
Line 6: |
Line 6: |
| | | |
| You can generate your own files or use the set available for [[#Downloadable Reference and Resource Files|download]]. | | You can generate your own files or use the set available for [[#Downloadable Reference and Resource Files|download]]. |
| + | * By default, GotCloud looks for the reference/resource files in the <code>gotcloud.ref</code> subdirectory within the base GotCloud directory |
| + | * To look in a different directory, set your reference/resource file location by setting either of the following to that path: |
| + | ** <code>REF_DIR</code> in your configuration file |
| + | ** <code>--ref_dir</code> on the command-line |
| | | |
| + | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" |
| + | ! Description !! Configuration Key !! Default Value !! Pipelines !! Special Info |
| + | |- |
| + | | [[#Reference fasta Files| Reference fasta]] || REF || $(REF_DIR)/human.g1k.v37.fa |
| + | | align, snpcall, indel || [[#Additional files generated from the reference fasta|Additional Files Required]] |
| + | |- |
| + | | [[#DBSNP VCF File|DBSNP VCF File]] || DBSNP_VCF || $(REF_DIR)/dbsnp_135.b37.vcf.gz |
| + | | align, snpcall || Must be tabixed |
| + | |- |
| + | | [[#HapMap3 VCF File|HapMap3 VCF File]] || HM3_VCF || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz |
| + | | align, snpcall || Must be tabixed |
| + | |- |
| + | | [[#OMNI VCF File|OMNI VCF File]] || OMNI_VCF || $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz |
| + | | snpcall || Must be tabixed |
| + | |- |
| + | | rowspan="2"|[[#INDEL VCF File(s)|INDEL VCF File(s)]] || INDEL_PREFIX || $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19 || rowspan="2"|snpcall || .chr#.vcf extension will be appended |
| + | |- |
| + | | INDEL_VCF || ''alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome''||Must be tabixed |
| + | |} |
| | | |
− | == Required Files ==
| |
| | | |
− | === Human Reference Files === | + | === Reference fasta Files === |
| + | Reference Sequence in fasta format |
| + | * Contains reference base at each reference position |
| | | |
− | === DBSNP VCF File === | + | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" |
| + | ! Configuration Key !! Default Value |
| + | |- |
| + | | REF || $(REF_DIR)/human.g1k.v37.fa |
| + | |} |
| + | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" |
| + | ! Pipeline !! Use |
| + | |- |
| + | | align || mapping to reference, recalibration, quality control |
| + | |- |
| + | | snpcall || pileup & identify variants, summarize filtered variants |
| + | |- |
| + | | indel || discovery, genotyping |
| + | |} |
| | | |
− | === === | + | ==== Additional files generated from the reference fasta ==== |
| + | In addition to the fasta file a few additional files generated from the fasta are required |
| + | * Already included with default reference files |
| + | * If you are using your own reference files, you will need to be sure to create these files |
| + | ** Expected to be at the same location as the reference file |
| + | ** Be sure to create these additional files using the version of tool being run by GotCloud (by default they are in the <code>gotcloud/bin/</code> directory) |
| + | ** In the commands below, replace <code>ref.fa</code> with the path/name of the reference fasta file |
| + | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" |
| + | ! Pipeline !! Step !! Required Extensions !! Command to Create !! More Information |
| + | |- |
| + | | align, snpcall, indel || ||.fai || <code>bin/samtools faidx ref.fa</code> |
| + | |- |
| + | | align, snpcall, indel || || -bs.umfa || || If it does not already exist, GotCloud automatically creates this file in same directory as the REF file |
| + | |- |
| + | | align || bwa mapping || .amb, .ann, .bwt, .pac, .sa || <code>bin/bwa index ref.fa</code> || http://bio-bwa.sourceforge.net/bwa.shtml |
| + | |- |
| + | | align || qplot || .winsize100.gc || <code>bin/qplot --reference ref.fa || NOTE: Ignore the error at the end of qplot that says: |
| + | <pre>FATAL ERROR - |
| + | No SAM/BAM files provided, stopped!</pre> |
| + | This error is due to using qplot to just generate a GC Content file and not also process a BAM file. |
| | | |
− | = Downloadable Reference and Resource Files =
| + | [[QPLOT#Input_files|QPLOT: InputFiles]] |
| + | |} |
| | | |
− | '''Installing Genetic Reference and Resource Files'''
| + | === DBSNP VCF File === |
− | | + | VCF file containing known dbsnp variant positions |
− | | + | * Must be bgzip'd and tabix'd |
− | '''Get the Resource Files | |
− | '''
| |
− | The GotCloud Aligner and Umake makes use of various reference and other genetic resource files.
| |
− | You are free to use your own files, of course, but we also are making the files we use available.
| |
| | | |
− | <code>
| + | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" |
− | # The easiest way to get the data:
| + | ! Configuration Key !! Default Value |
− | <b>cd /tmp</b>
| + | |- |
− | <b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db135-v2.tgz</b>
| + | | DBSNP_VCF || $(REF_DIR)/dbsnp_142.b37.vcf.gz |
| + | |} |
| + | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" |
| + | ! Pipeline !! Use |
| + | |- |
| + | | align || recalibration (exclude known dbsnps when generating recalibration tables) & qplot |
| + | |- |
| + | | snpcall || generating filtered VCF summary statistics |
| + | |} |
| | | |
− | # Another way:
| + | === HapMap3 VCF File === |
− | <b>cd /tmp</b>
| + | HapMap3 Polymorphic Sites VCF File |
− | <b>ftp share.sph.umich.edu</b>
| + | * Must be bgzip'd and tabix'd |
− | Connected to share.sph.umich.edu.
| |
− | 220 (vsFTPd 2.3.5)
| |
− | Name (share.sph.umich.edu:tpg): <b>anonymous</b>
| |
− | 230 Login successful.
| |
− | Remote system type is UNIX.
| |
− | Using binary mode to transfer files.
| |
− | ftp> <b>prompt</b>
| |
− | Interactive mode off.
| |
− | ftp> <b>cd gotcloud</b>
| |
− | 250 Directory successfully changed.
| |
− | ftp> <b>mget h37-db135.tar.gz</b>
| |
− | ftp> <b>quit</b>
| |
− | 221 Goodbye.
| |
− | </code>
| |
| | | |
− | '''Install the Resource Files'''
| + | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" |
| + | ! Configuration Key !! Default Value |
| + | |- |
| + | | HM3_VCF || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz |
| + | |} |
| + | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" |
| + | ! Pipeline !! Use |
| + | |- |
| + | | align || verifyBamID (contamination checking) |
| + | |- |
| + | | snpcall || generating filtered VCF summary statistics & positive example sites for SVM filtering |
| + | |} |
| | | |
− | Choose a destination for these files and install them as shown below (we'll assume you will use '''/usr/local/gotcloud.ref''').
| + | === OMNI VCF File === |
| + | VCF file containing OMNI positions |
| + | * Must be bgzip'd and tabix'd |
| | | |
− | <code>
| + | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" |
− | <b>mkdir -p /usr/local/gotcloud.ref</b> # Where you want the files installed
| + | ! Configuration Key !! Default Value |
− | <b>cd /usr/local/gotcloud.ref</b>
| + | |- |
− | <b>tar xzf h37-db135.tar.gz</b>
| + | | OMNI_VCF || $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz |
− | <b>rm -f h37-db135.tar.gz</b>
| + | |} |
− | </code>
| + | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" |
| + | ! Pipeline !! Use |
| + | |- |
| + | | snpcall || positive example sites for SVM filtering |
| + | |} |
| | | |
− | Note this path as you will need to set the variable '''REF_DIR''' in the configuration file for gotcloud.
| + | === INDEL VCF File(s) === |
| + | VCF file containing known INDEL positions |
| | | |
− | = Using Your own Reference Files = | + | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" |
| + | ! Configuration Key !! Default Value |
| + | |- |
| + | | INDEL_PREFIX || $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19 |
| + | |- |
| + | | INDEL_VCF || ''alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome'' |
| + | |} |
| + | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" |
| + | ! Pipeline !! Use |
| + | |- |
| + | | snpcall || used to filter variants that are too close to a known indel |
| + | |} |
| | | |
− | == Human Reference ==
| + | * Use <code>INDEL_PREFIX</code> if <code>path/</code> contains a separate file for each chromosome in the format: <code>indels.sites.hg19.chr#.vcf</code> for each <code>#</code> chromosome being processed |
| + | * Use <code>INDEL_VCF</code> if you have all chromosomes in a single VCF file (it can be, but does not have to be a gz file) |
| | | |
− | === Generating BWA Reference Files === | + | == Downloadable Reference and Resource Files == |
− | Use "bwa index" to generate the human reference files with the required extensions:
| + | * When running on Amazon, a default set of reference files are included in the GotCloud AMI in the default <code>REF_DIR</code> |
− | * .amb
| |
− | * .ann
| |
− | * .bwt
| |
− | * .fai
| |
− | * .pac
| |
− | * .rbwt
| |
− | * .rpac
| |
− | * .rsa
| |
− | * .sa
| |
| | | |
− | See http://bio-bwa.sourceforge.net/bwa.shtml for more information about using "bwa index".
| |
| | | |
− | === Generating GC Content File ===
| + | '''Installing Genetic Reference and Resource Files''' |
− | The GC Content file is used by QPLOT. It is assumed to be at the same location as the reference file.
| |
| | | |
− | If the reference file is at path/ref.fa, the GC Content file is expected to be:path/ref.winsize100.gc
| + | Choose a destination for these files and install them as shown below. We'll assume you will use '''gotcloud/gotcloud.ref'''. Replace <code>gotcloud</code> with the path to where you installed gotcloud. |
| | | |
| + | <code> |
| + | <b>cd gotcloud</b> # path to where you installed gotcloud |
| + | </code> |
| | | |
− | To generate the GC content file, run qplot:
| + | If you use a path other than a gotcloud.ref subdirectory of gotcloud, note this path as you will need to set either of the following to the installation path: |
− | GOTCLOUD_DIR/bin/qplot --reference reference.fa --winsize windowSize
| + | * <code>REF_DIR</code> in your configuration file |
− | * Replace reference.fa with the name of your human reference fasta file.
| + | * <code>--ref_dir</code> on the command-line |
− | * Replace windowSize with your desired window size, or leave out --winsize to use the default (100). | |
| | | |
− | NOTE: You will get an error at the end of qplot that says:
| |
− | <pre>
| |
− | FATAL ERROR -
| |
− | No SAM/BAM files provided, stopped!
| |
− | </pre>
| |
− | This error is due to using qplot to just generate a GC Content file and not also process a BAM file.
| |
| | | |
− | But it was successful as long as you see (where reference is the name of your reference file):
| + | '''Get & Install the Resource Files''' |
− | <pre>
| |
− | GC content file [ reference.winsize100.gc ] created.
| |
− | </pre>
| |
| | | |
| + | GotCloud makes use of various reference and other genetic resource files. |
| + | You are free to use your own files, of course, but we also are making the files we use available. |
| | | |
− | See [[QPLOT#Input_files|QPLOT: InputFiles]] for more information.
| + | <ul> |
| + | <li> <div id="h37-db135">Human reference 37, dbsnp 135:</div></li> |
| + | <b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db135-v3.tgz</b> |
| + | <b>tar xzf h37-db135-v3.tgz</b> |
| + | <b>rm -f h37-db135-v3.tgz</b> |
| + | <li><div id="h37-db142">Human reference 37, dbsnp 142:</div></li> |
| + | <b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db142-v1.tgz</b> |
| + | <b>tar xzf h37-db142-v1.tgz</b> |
| + | <b>rm -f h37-db142-v1.tgz</b> |
| + | <li><div id="hs37d5-db142">Human reference 37 with decoy, dbsnp 142:</div></li> |
| + | <b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/hs37d5-db142-v1.tgz</b> |
| + | <b>tar xzf hs37d5-db142-v1.tgz</b> |
| + | <b>rm -f hs37d5-db142-v1.tgz</b> |
| + | </ul> |