GotCloud: Reference Files

From Genome Analysis Wiki
Revision as of 13:19, 6 October 2014 by Mktrost (talk | contribs) (GotCloud Reference Files)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

GotCloud Reference Files

Reference files are required for running both the alignment and variant calling pipelines.

  • Genome sequence reference files (needed for both pipelines)
  • DBSNP site VCF file (needed for both pipelines)
  • HAPMAP site VCF file (needed for both pipelines)
  • Indel sites file (need for variant calling pipeline)

The chromosome 20 reference files required for the tutorial are included with the tutorial example data in $GCDATA/chr20Ref/.

If you are running more than just chromosome 20, you will need whole genome reference files which can be downloaded from GotCloud: Genetic Reference and Resource Files.


If you are using these reference files, you will only need to specify REF_DIR in your configuration file to the full path to where they are installed.


Genome sequence reference file

  • FA_REF in the configuration file
  • human_g1k_v37_chr20* files in the Tutorial
  • Specify with a .fa or .fa.gz extension
  • Implies the existence of the following files at the same path, with the same name with the following extensions appended:
    • .fai
      • fasta index file
    • .amb, .ann, .bwt, .pac, .rbwt, .rpac, .rsa, .sa
      • for the BWA step of the alignment pipeline
      • can be generated using TBD
    • .Gccontent
      • for the QPLOT step of the alignment pipeline
      • can be generated using TBD
  • Implies the existence of the following files with the same basename, but different extension:
    • .dict
      • can be generated using TBD
    • -bs.umfa
      • can be generated using TBD

DBSNP site VCF file

  • DBSNP_VCF in the configuration file
  • dbsnp135_chr20.vcf.gz* in the Tutorial
  • Specify with a .vcf.gz extension
  • Implies the existence of .vcf.gz.tbi, the vcf index file

HapMap site VCF file

  • HM3_VCF in the configuration file
  • hapmap_3.3.b37.sites.chr20.vcf.gz * in the Tutorial
  • Specify with a .vcf.gz extension
  • Implies the existence of .vcf.gz.tbi, the vcf index file

Indel Sites VCF file

  • INDEL_PREFIX in the configuration file
  • 1kg.pilot_release.merged.indels.sites.hg19 in the Tutorial
  • Prefix, so excludes the extension but implies the existence of .chrXX.vcf.gz for each chromosome


Below is the list of chromosome 20 reference files required for the tutorial and included with the tutorial example data in $GCDATA/chr20Ref/:

1kg.pilot_release.merged.indels.sites.hg19.chr20.vcf 
dbsnp135_chr20.vcf.gz 
dbsnp135_chr20.vcf.gz.tbi 
hapmap_3.3.b37.sites.chr20.vcf.gz 
hapmap_3.3.b37.sites.chr20.vcf.gz.tbi 
human_g1k_v37_chr20-bs.umfa 
human_g1k_v37_chr20.dict 
human_g1k_v37_chr20.fa 
human_g1k_v37_chr20.fa.amb 
human_g1k_v37_chr20.fa.ann 
human_g1k_v37_chr20.fa.bwt 
human_g1k_v37_chr20.fa.fai 
human_g1k_v37_chr20.fa.GCcontent 
human_g1k_v37_chr20.fa.pac 
human_g1k_v37_chr20.fa.rbwt 
human_g1k_v37_chr20.fa.rpac 
human_g1k_v37_chr20.fa.rsa 
human_g1k_v37_chr20.fa.sa