Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 32: Line 32:  
= Input Data=
 
= Input Data=
 
*Aligned/Processed/Recalibrated BAM files
 
*Aligned/Processed/Recalibrated BAM files
*Index file containing Sample IDs & BAM file names
+
*BAM list file containing Sample IDs & BAM file names
 
*Reference files
 
*Reference files
 
*(Optional) Configuration file to override default options
 
*(Optional) Configuration file to override default options
   −
== BAM files ==
+
== BAM Files ==
The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is documented as part of the [[Alignment Pipeline]] of GotCloud.
+
The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the [[Alignment Pipeline]] of GotCloud.
   −
== Index File ==
+
== BAM List File ==
Each line of the index file represents each individual under the following format. Note that multiple BAMs per individual may be provided. Note that if all samples are from the same population, just specify "ALL" for the population label for each sample.
+
Each line of the BAM list file represents a single individual.
[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
      
Columns:
 
Columns:
 
# sample id
 
# sample id
# comma separated population labels
+
# comma separated population labels (optional column)
 
# BAM File 1 (preferable to have full paths to BAM files)
 
# BAM File 1 (preferable to have full paths to BAM files)
# BAM File 2 (if applicable)
+
# BAM File 2 (if more than 1 BAM per sample)
 
:...
 
:...
   −
: # BAM File N
+
: # BAM File N (if more than 1 BAM per sample)
 +
[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
 +
or
 +
[SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ...
 +
 
 +
* Notes:
 +
** tab delimited
 +
** multiple BAMs per individual may be provided, but should all be on the same line of the list file
 +
** population label is optional - it will default to <code>ALL</code>
 +
*** population is only used by Thunder (part of ldrefine pipeline)
 +
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.
    
== Reference Files ==
 
== Reference Files ==
 
The variant calling pipeline requires multiple reference files in order to work correctly.  
 
The variant calling pipeline requires multiple reference files in order to work correctly.  
   −
* Reference Sequence in fasta format.
+
See [[GotCloud: Genetic Reference and Resource Files]] for detailed information about the required reference files, including:
** Configuration File Setting:  <code>REF = path/file.fa</code>
+
* How to obtain default references
* Indel VCF File Prefix
+
* Configuration keys & default values
** Configuration File Setting:  <code>INDEL_PREFIX = path/indels.sites.hg19</code>
+
* How to generate your own references
** <code>path/</code> contains <code>indels.sites.hg19.chr20.vcf</code> for each chromosome being processed
+
* How to point GotCloud to your reference files
** Alternatively, if you have all chromosomes in a single VCF file, you can specify (it can be, but does not have to be a gz file): <code>INDEL_VCF = path/indel.sites.hg19.vcf</code>
  −
* DBSNP File vcf.gz file (must be indexed with tabix)
  −
** Configuration File Setting:  <code>DBSNP_VCF = path/dbsnp_135.b37.vcf.gz</code>
  −
* HapMap3 polymorphic site vcf.gz file (must be indexed with tabix)
  −
** Configuration File Setting:  <code>HM3_VCF = path/hapmap_3.3.b37.sites.vcf.gz</code>
  −
 
  −
See [[GotCloud: Genetic Reference and Resource Files]] for more information on the reference files.
     −
See [[GotCloud:_Genetic_Reference_and_Resource_Files#Downloadable Reference and Resource Files| GotCloud Downloadable Reference and Resource Files]] for instructions on downloading a set of reference files.
+
Required Reference File Types:
 +
* [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]]
 +
* [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF Files|DBSNP VCF Files]]
 +
* [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF Files|HapMap3 VCF Files]]
 +
* [[GotCloud: Genetic Reference and Resource Files#OMNI VCF Files|OMNI VCF Files]]
 +
* [[GotCloud: Genetic Reference and Resource Files#INDEL VCF File(s)|INDEL VCF File(s)]]
   −
Configuration File Example Reference Settings:
  −
REF = path/file.fa
  −
INDEL_PREFIX = path/indels.sites.hg19  # or INDEL_VCF = path/indels.sites.hg19.vcf  if all chromosomes are in a single VCF
  −
DBSNP_VCF = path/dbsnp_135_b37.vcf.gz
  −
HM3_VCF = path/hapmap_3.3.b37.sites.vcf.gz
      
== Configuration File ==
 
== Configuration File ==

Navigation menu