Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Created page with '= GotCloud Reference Files = Reference files are required for running both the alignment and variant calling pipelines. * Genome sequence reference files (needed for both pipel…'
= GotCloud Reference Files =
Reference files are required for running both the alignment and variant calling pipelines.
* Genome sequence reference files (needed for both pipelines)
* DBSNP site VCF file (needed for both pipelines)
* HAPMAP site VCF file (needed for both pipelines)
* Indel sites file (need for variant calling pipeline)

The chromosome 20 reference files required for the tutorial are included with the tutorial example data in $GCDATA/chr20Ref/.

If you are running more than just chromosome 20, you will need whole genome reference files which can be downloaded from [[GotCloudReference]].

TODO
If you are using these reference files, you will only need to specify REF_DIR in your configuration file to the full path to where they are installed.




TODO, maybe only provide info on how to use the DEFAULT reference.
Move detailed description to generic gotcloud documentation. This is too much info for the Tutorial page!!!!

'''Genome sequence reference file'''
* FA_REF in the configuration file
* human_g1k_v37_chr20* files in the Tutorial
* Specify with a .fa or .fa.gz extension
* Implies the existence of the following files at the same path, with the same name with the following extensions appended:
** .fai
*** fasta index file
** .amb, .ann, .bwt, .pac, .rbwt, .rpac, .rsa, .sa
*** for the BWA step of the alignment pipeline
*** can be generated using TBD
** .Gccontent
*** for the QPLOT step of the alignment pipeline
*** can be generated using TBD
* Implies the existence of the following files with the same basename, but different extension:
** .dict
*** can be generated using TBD
** -bs.umfa
*** can be generated using TBD

'''DBSNP site VCF file'''
* DBSNP_VCF in the configuration file
* dbsnp135_chr20.vcf.gz* in the Tutorial
* Specify with a .vcf.gz extension
* Implies the existence of .vcf.gz.tbi, the vcf index file

'''HapMap site VCF file'''
* HM3_VCF in the configuration file
* hapmap_3.3.b37.sites.chr20.vcf.gz * in the Tutorial
* Specify with a .vcf.gz extension
* Implies the existence of .vcf.gz.tbi, the vcf index file

'''Indel Sites VCF file'''
* INDEL_PREFIX in the configuration file
* 1kg.pilot_release.merged.indels.sites.hg19 in the Tutorial
* Prefix, so excludes the extension but implies the existence of .chrXX.vcf.gz for each chromosome


Below is the list of chromosome 20 reference files required for the tutorial and included with the tutorial example data in $GCDATA/chr20Ref/:
<pre>
1kg.pilot_release.merged.indels.sites.hg19.chr20.vcf
dbsnp135_chr20.vcf.gz
dbsnp135_chr20.vcf.gz.tbi
hapmap_3.3.b37.sites.chr20.vcf.gz
hapmap_3.3.b37.sites.chr20.vcf.gz.tbi
human_g1k_v37_chr20-bs.umfa
human_g1k_v37_chr20.dict
human_g1k_v37_chr20.fa
human_g1k_v37_chr20.fa.amb
human_g1k_v37_chr20.fa.ann
human_g1k_v37_chr20.fa.bwt
human_g1k_v37_chr20.fa.fai
human_g1k_v37_chr20.fa.GCcontent
human_g1k_v37_chr20.fa.pac
human_g1k_v37_chr20.fa.rbwt
human_g1k_v37_chr20.fa.rpac
human_g1k_v37_chr20.fa.rsa
human_g1k_v37_chr20.fa.sa
</pre>

Navigation menu