Difference between revisions of "GotCloud: Alignment Sub-Pipelines"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(35 intermediate revisions by the same user not shown)
Line 21: Line 21:
 
This sub-pipeline takes in a single, recalibrated BAM file, creates an index file for it, and performs quality control (running qplot and verifyBamID). It differs from *bamQC* in that it does not require that the user already have a .bai file for the recalibrated BAM file.  
 
This sub-pipeline takes in a single, recalibrated BAM file, creates an index file for it, and performs quality control (running qplot and verifyBamID). It differs from *bamQC* in that it does not require that the user already have a .bai file for the recalibrated BAM file.  
  
== recab ==  
+
== recab ==
 +
*What it does:
 +
# merge BAMs for samples that have multiple BAMs
 +
# dedup and recalibrate
 +
# index the recalibrated BAM
  
 +
====Inputs====
 +
* Bam files (stored in a [[#BAM_LIST File for recab|BAM_LIST File]])
 +
* Reference files
 +
* (Optional) configuration file to override default options
  
== recabQC ==  
+
=====BAM_LIST File for recab=====
== bamQC ==  
+
* Each line of the BAM list file represents a single individual
== bamQC_createIndex ==  
 
  
 +
Columns:
 +
# sample id
 +
# comma separated population labels (optional column)
 +
# BAM File 1 (preferable to have full paths to BAM files)
 +
# BAM File 2 (if more than 1 BAM per sample)
 +
:...
  
 +
: # BAM File N (if more than 1 BAM per sample)
 +
[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
 +
or
 +
[SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ...
  
 +
* Notes:
 +
** tab delimited
 +
** multiple BAMs per individual may be provided, but should all be on the same line of the list file
 +
** population label is optional - it will default to <code>ALL</code>
 +
*** only used by Thunder (part of ldrefine pipeline)
 +
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.
  
 +
====Outputs====
 +
Upon successful completion of the *recab* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:
 +
*'''recab/mergedBams/'''
 +
** '''''*/SAMPLE.merged.bam'' - a merged BAM file'''
 +
** ''*/SAMPLE.merged.bam.log'' - merge log
 +
** ''*/SAMPLE.merged.bam.OK'' - temp file indicating the merge step completed successfully
  
 +
* '''recab/'''
 +
** '''''*/SAMPLE.recal.bam'' - a merged, recalibrated, and deduped BAM file'''
 +
** '''''*/SAMPLE.recal.bam.bai'' - an indexed version of the  merged, recalibrated, and deduped BAM file'''
 +
** ''*/SAMPLE.recal.bam.metrics'' - dedup & recalibration log
 +
** ''*/SAMPLE.recal.bam.qemp'' - recalibration tables
 +
** ''*/SAMPLE.recal.bam.done'' - temp file indicating the recalibration step completed successfully
 +
** ''*/SAMPLE.recal.bam.bai.done'' - temp file indicating the indexing step completed successfully
 +
You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *recab* sub-pipeline failed.
  
 +
'''On success, the recab/ folder contains the final BAMs and bais.'''
  
 +
===Command-Line and Configuration Options===
  
 +
*Required Options
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for recab|BAM_LIST File]] || $(OUT_DIR)/bam.list
 +
|-
 +
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
 +
|}
  
 +
*Common Options
  
The automated test runs the alignment pipeline on a small set of test data and checks that the results against expected results validating that GotCloud is installed correctly.  
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --outdir ''path'' || OUT_DIR || output directory ||
 +
|-
 +
| --conf ''file'' || || configuration file to use ||
 +
|-
 +
|  || REF_DIR || where the reference/resource files are stored || gotcloud.ref subdirectory within the base GotCloud directory
 +
|-
 +
| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa
 +
|-
 +
| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF Files|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz
 +
|}
  
*Run alignment pipeline test:
+
==== Example Configuration File ====
  gotcloud align --test OUTPUT_DIR
+
Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5
where OUTPUT_DIR is the directory where you want to store the test results
+
BAM_LIST = /path/freeze5.bam.list
 +
OUT_DIR = /path/freeze5/output
 +
  REF_DIR = /path/reference/
 +
REF = $(REF_DIR)/hs37d5.fa
 +
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
  
If you see "Successfully ran the test case, congratulations!", then you are ready to align samples.
+
==== Example Command Line ====
 +
gotcloud pipe –-name recab --numjobs <N>
  
== Input Data:==  
+
== recabQC ==
*Raw Sequence (FASTQ) files
+
*What it does:
*FASTQ List file mapping fastq pairs to sample (optional: Read Group information)
+
# merge BAMs for samples that have multiple BAMs
*Reference files
+
# dedup and recalibrate
*(Optional) Configuration file to override default options
+
# index the recalibrated BAM
 +
# qplot
 +
# verifyBamID
  
=== Raw Sequence (FASTQ) files ===
+
====Inputs====
 +
* Bam files (stored in a [[#BAM_LIST File for recabQC|BAM_LIST]] file)
 +
* Reference files
 +
* (Optional) configuration file to override default options
  
These are the FASTQ files that need to be mapped to BAM files.
+
=====BAM_LIST File for recabQC=====
 +
* Each line of the BAM list file represents a single individual
  
These files are specified in the [[#FASTQ List File|FASTQ List File]].  
+
Columns:
 +
# sample id
 +
# comma separated population labels (optional column)
 +
# BAM File 1 (preferable to have full paths to BAM files)
 +
# BAM File 2 (if more than 1 BAM per sample)
 +
:...
  
=== FASTQ List File ===
+
: # BAM File N (if more than 1 BAM per sample)
This file specifies the FASTQ files that need to be processedIt maps the FASTQ pairs to the associated Sample ID. Optionally Read Group information for the FASTQ pairs can be specified. If the Read Group information is not specified, it is inferred.  
+
[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
 +
or
 +
  [SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ...
  
This file is specified either via the command line parameter <code>--list</code> or via the configuration file setting <code>FASTQ_LIST</code>.
+
* Notes:
 +
** tab delimited
 +
** multiple BAMs per individual may be provided, but should all be on the same line of the list file
 +
** population label is optional - it will default to <code>ALL</code>
 +
*** only used by Thunder (part of ldrefine pipeline)
 +
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.
  
The command-line setting takes precedence over the configuration file setting.  
+
====Outputs====
 +
Upon successful completion of the *recabQC* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:
 +
*'''recab/mergedBams/''' - contains merge results
 +
** '''''*/SAMPLE.merged.bam'' - a merged BAM file'''
 +
** ''*/SAMPLE.merged.bam.log'' - merge log
 +
** ''*/SAMPLE.merged.bam.OK'' - temp file indicating the merge step completed successfully
  
The FASTQ list is a tab delimited file that starts with a header lineThe columns may be in any order.  
+
* '''recab/''' - contains recalibration results
 +
** '''''*/SAMPLE.recal.bam'' - a merged, recalibrated, and deduped BAM file'''
 +
** '''''*/SAMPLE.recal.bam.bai'' - an indexed version of the merged, recalibrated, and deduped BAM file'''
 +
** ''*/SAMPLE.recal.bam.metrics'' - dedup & recalibration log
 +
** ''*/SAMPLE.recal.bam.qemp'' - recalibration tables
 +
** ''*/SAMPLE.recal.bam.done'' - temp file indicating the recalibration step completed successfully
 +
** ''*/SAMPLE.recal.bam.bai.done'' - temp file indicating the indexing step completed successfully
  
Following the header line, there is one line per single-end read and one line per paired-end read (only 1 line per pair).  
+
* '''QCFiles/''' - contains quality control results
 +
** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information
 +
*** ''*/SAMPLE.genoCheck.depthRG'' - depth distribution of the sequence reads per read group
 +
*** ''*/SAMPLE.genoCheck.depthSM'' - depth distribution of the sequence reads per sample
 +
*** ''*/SAMPLE.genoCheck.err'' - log file
 +
*** ''*/SAMPLE.genoCheck.log'' - log file
 +
*** ''*/SAMPLE.genoCheck.OK'' - temp file indicating the VerifyBAMID step completed successfully
 +
*** ''*/SAMPLE.genoCheck.selfRG'' - per-readGroup statistics describing how well each lane matches to the annotated sample
 +
*** '''''*/SAMPLE.genoCheck.selfSM'' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample'''
 +
**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
 +
**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
 +
** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results
 +
*** ''*/SAMPLE.qplot.OK'' - temp file indicating the qplot step completed successfully
 +
*** '''''*/SAMPLE.qplot.R'' - Rscript that can be used to generate the pdf graphs'''
 +
*** '''''*/SAMPLE.qplot.stats'' - sample statistics'''
  
Required Column Names:
+
You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *recabQC* sub-pipeline failed.
* MERGE_NAME - base name for the resulting BAM file for the sample (used to group multiple fastqs or fastq pairs into a single BAM)
 
** The SAMPLE column can be specified instead of MERGE_NAME. SAMPLE will be used for both the sample and the base name.
 
* FASTQ1 - name of the fastq or the first in the pair if paired-end. (Only 1 line per pair)
 
  
Optional Column Names:
+
'''On success, the recab/ folder contains the final BAMs and bais, while the QCFiles/ folder contains the quality control output'''
* FASTQ2 - name of the 2nd fastq in paired-end reads.  Specify '.' if the column exists, but this line is single-ended.
 
* RGID - Read Group ID for this entry
 
** If this field is not specified, the first line of the fastq will be used to determine the RG.
 
*** If the first line does not match the expected format for determining RG, incrementing numbers per fastq file will be used.
 
* SAMPLE - Sample Name for this entry
 
** If SAMPLE is not specified, MERGE_NAME will be used for the sample name
 
* LIBRARY - Library for this entry
 
** If LIBRARY is not specified, the sample name will be used
 
* CENTER - Center Name for this entry
 
** If CENTER is not specified, it will default to "unknown"
 
* PLATFORM - Platform for this entry
 
** If PLATFORM is not specified, it will default to ILLUMINA
 
  
The RGID, SAMPLE, LIBRARY, CENTER, and PLATFORM are used to populate the Read Group information for this entry. 
+
===Command-Line and Configuration Options===
  
MERGE_NAME FASTQ1 FASTQ2 RGID SAMPLE LIBRARY CENTER PLATFORM
+
*Required Options
Sample1 fastq/S1/F1_R1.fastq.gz fastq/S1/F1_R2.fastq.gz RGID1 SampleID1 Lib1 UM ILLUMINA
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
Sample1 fastq/S1/F2_R1.fastq.gz fastq/S1/F2_R2.fastq.gz RGID1a SampleID1 Lib1 UM ILLUMINA
+
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
Sample2 fastq/S2/F1_R1.fastq.gz fastq/S2/F1_R2.fastq.gz RGID2 SampleID2 Lib2 UM ILLUMINA
+
|-
Sample2 fastq/S2/F2.fastq.gz . RGID2 SampleID2 Lib2 UM ILLUMINA
+
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for recabQC|BAM_LIST File]] || $(OUT_DIR)/bam.list
 +
|-
 +
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
 +
|}
  
The <code>--fastq_prefix</code>/<code>FASTQ_PREFIX</code> setting can be used to specify a prefix to the FASTQ1/FASTQ2 file paths that should be applied before using the files.
+
*Common Options
  
=== Reference Files ===
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
See [[GotCloud: Genetic Reference and Resource Files]] for detailed information about the multiple required reference files for the alignment pipeline, including:
+
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
* How to obtain default references
+
|-
* Configuration keys & default values
+
| --outdir ''path'' || OUT_DIR || output directory ||
* How to generate your own references
+
|-
* How to point GotCloud to your reference files
+
| --conf ''file'' || || configuration file to use ||
 +
|-
 +
|  || REF_DIR || where the reference/resource files are stored || gotcloud.ref subdirectory within the base GotCloud directory
 +
|-
 +
| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa
 +
|-
 +
| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF File|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz
 +
|-
 +
| || HM3_VCF || [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF File|HapMap3 VCF Files]] || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 +
|}
  
Required Reference File Types:
+
==== Example Configuration File ====
* [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]]
+
Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5
* [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF Files|DBSNP VCF Files]]
+
BAM_LIST = /path/freeze5.bam.list
* [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF Files|HapMap3 VCF Files]]
+
OUT_DIR = /path/freeze5/output
 +
REF_DIR = /path/reference/
 +
REF = $(REF_DIR)/hs37d5.fa
 +
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
 +
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz
  
=== Configuration File ===  
+
==== Example Command Line ====
{{:GotCloud: Configuration}}
+
gotcloud pipe –-name recabQC --numjobs <N>
  
==== Recommended Settings ====
+
== bamQC ==
 +
*What it does:
 +
# qplot
 +
# verifyBamID
  
As of GotCloud version 1.16, the alignment pipeline uses <code>bwa mem</code> by default.  Prior to version 1.16, the default aligner was <code>bwa aln</code>. 
+
====Inputs====
 +
* Single merged, recalibrated, and deduped BAM file for each subject (stored in a [[#BAM_LIST File for bamQC|BAM_LIST File]])
 +
* BAI file for each subject
 +
* Reference files
 +
* (Optional) configuration file to override default options
  
You can override the defaults by setting in your configuration file:
+
=====BAM_LIST File for bamQC=====
* to use <code>bwa mem</code> (you do not need to set this in version 1.16 and later since it is the default)
+
* Each line of the BAM list file represents a single individual
MAP_TYPE = BWA_MEM
 
* to use <code>bwa aln</code> (you do not need to set this prior to version 1.16 since it is the default)
 
MAP_TYPE = BWA
 
  
==== Additional Required Settings ====
+
Columns:
 +
# sample id
 +
# comma separated population labels (optional column)
 +
# BAM File (preferable to have full path to BAM file)
 +
# BAI File (preferable to have full path to BAI file)
  
See [[#FASTQ List File|FASTQ List File]] for how to set the index file either via command line options or via configuration.
+
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE] [BAI_FILE]
 +
or
 +
[SAMPLE_ID] [BAM_FILE] [BAI_FILE]
  
==== Turning Off Optional Steps====
+
* Notes:
Quality Control steps can be disabled.  
+
** tab delimited
 +
** population label is optional - it will default to <code>ALL</code>
 +
*** only used by Thunder (part of ldrefine pipeline)
 +
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.
  
To Disable QPLOT, remove qplot from the PER_MERGE_STEPS configuration by setting:  
+
====Outputs====
PER_MERGE_STEPS = verifyBamID index recab
+
Upon successful completion of the *bamQC* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:
 +
 
 +
* '''QCFiles/''' - contains quality control results
 +
** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information
 +
*** ''*/SAMPLE.genoCheck.depthRG'' - depth distribution of the sequence reads per read group
 +
*** ''*/SAMPLE.genoCheck.depthSM'' - depth distribution of the sequence reads per sample
 +
*** ''*/SAMPLE.genoCheck.err'' - log file
 +
*** ''*/SAMPLE.genoCheck.log'' - log file
 +
*** ''*/SAMPLE.genoCheck.OK'' - temp file indicating the VerifyBAMID step completed successfully
 +
*** ''*/SAMPLE.genoCheck.selfRG'' - per-readGroup statistics describing how well each lane matches to the annotated sample
 +
*** '''''*/SAMPLE.genoCheck.selfSM'' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample'''
 +
**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
 +
**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
 +
** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results
 +
*** ''*/SAMPLE.qplot.OK'' - temp file indicating the qplot step completed successfully
 +
*** '''''*/SAMPLE.qplot.R'' - Rscript that can be used to generate the pdf graphs'''
 +
*** '''''*/SAMPLE.qplot.stats'' - sample statistics'''
  
 +
You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *bamQC* sub-pipeline failed.
  
To Disable VerifyBamID, remove qplot from the PER_MERGE_STEPS configuration by setting:
+
'''On success, the QCFiles/ folder contains the quality control output'''
PER_MERGE_STEPS = qplot index recab
 
  
==== Optional Configurable Settings ====
+
===Command-Line and Configuration Options===
You may want to adjust the amount of memory/threads that are used:
 
  
There are additional configurable settings, but these are the ones most likely to be adjusted.
+
*Required Options
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for bamQC|BAM_LIST File]] || $(OUT_DIR)/bam.list
 +
|-
 +
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
 +
|}
  
* BWA_THREADS = -t N
+
*Common Options
** Fill in the N with the number of threads you want BWA to run with, default is 1
 
* SORT_MAX_MEM = 2000000000
 
** Maximum amount of memory used by samtools sort after running bwa
 
  
== Running the Alignment Pipeline ==
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --outdir ''path'' || OUT_DIR || output directory ||
 +
|-
 +
| --conf ''file'' || || configuration file to use ||
 +
|-
 +
|  || REF_DIR || where the reference/resource files are stored || gotcloud.ref subdirectory within the base GotCloud directory
 +
|-
 +
| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa
 +
|-
 +
| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF File|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz
 +
|-
 +
| || HM3_VCF || [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF File|HapMap3 VCF Files]] || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 +
|}
  
=== Command-Line Options ===  
+
==== Example Configuration File ====
* help - print usage
+
Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5
* test OUTPUT_DIR - run the test example placing the output in a user specified OUTPUT_DIR.  No other options are required.
+
BAM_LIST = /path/freeze5.bam.list
* outdir OUTPUT_DIR - directory for the output
+
OUT_DIR = /path/freeze5/output
** May also be specified via OUT_DIR in the configuration file  
+
REF_DIR = /path/reference/
** Required to be set either via command-line or configuration
+
REF = $(REF_DIR)/hs37d5.fa
* conf CONFIG_FILE - configuration file
+
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
* list FASTQ_LIST_FILE_NAME  - name of the fastq list file  
+
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz
** May also be specified via FASTQ_LIST in the configuration file
 
** Required to be set either via command-line or configuration
 
* ref_dir REFERENCE_DIR - value to set config key REF_DIR to, overriding other values, REF_DIR can then be used inside config files.
 
** May also be specified via REF_DIR in the configuration file
 
* ref_prefix REFERENCE_DIR - path to prepend to non-absolute REF paths.
 
** May also be specified via REF_PREFIX in the configuration file
 
* fastq_prefix FASTQ_PATH - prefix path to the fastq files specified in the FASTQ_LIST
 
** May also be specified via FASTQ_PREFIX in the configuration file
 
* base_prefix BASE_PATH - prefix path to the prepend to fastq/ref files without absolute paths
 
** May also be specified via BASE_PREFIX in the configuration file
 
* keepTmp - Do not remove the temporary files (removed by default)  
 
** May also be specified via KEEP_TMP in the configuration file
 
* keepLog - Do not remove the intermediate log files (removed by default)  
 
** May also be specified via KEEP_LOG in the configuration file
 
* numjobs N - Replace N with the number of samples that should be processed in parallel
 
* threads N - Replace N with the number of targets in each makefile that should be run in parallel
 
* dryrun - Create the Makefile, but do not run it
 
* maxlocaljobs N - Replace N with the maximum number of jobs that can be run locally (no batchtype specified). Default is 10.
 
* batchtype TYPE - Tells GotCloud the specified batch type to send jobs to the client nodes
 
** May also be specified via BATCH_TYPE in the configuration file
 
** Can be: mosix, slurm, slurmi, pbs, sge, sgei
 
* batchopts OPTS - Tells GotCloud the options to pass onto the batch system
 
** May also be specified via BATCH_OPTS in the configuration file
 
* noPhoneHome - disable the phone home logic
 
* gotcloudroot DIR - Specifies an alternate path to other gotcloud files rather than using the path to the gotcloud/align.pl.
 
Note: Command-line options take priority over configuration file settings
 
  
===Running the Alignment Pipeline===  
+
==== Example Command Line ====
Run <code>gotcloud align</code> with the appropriate command-line parameters.
+
gotcloud pipe –-name bamQC --numjobs <N>
  
Example:  
+
== bamQC_createIndex ==
gotcloud align --conf config.txt --outdir output
+
*What it does:  
 +
# creates a BAI file for any BAM that is missing it
 +
# qplot
 +
# verifyBamID
  
This step generates 1 Makefile per sample in the output/Makefiles/ directory and then automatically runs them.  The Makefiles contain all of the information to run each sample.
+
====Inputs====
 +
* Single merged, recalibrated, and deduped BAM file for each subject (stored in a [[#BAM_LIST File for bamQC_createIndex|BAM_LIST File]])
 +
* Reference files
 +
* (Optional) configuration file to override default options
  
If you only want to generate the makefiles and not run them, use the <code>--dryrun</code> option.  It will generate the Makefiles and print instructions for running the Makefiles.
+
=====BAM_LIST File for bamQC_createIndex=====
 +
* Each line of the BAM list file represents a single individual
  
Each Makefile is independent and can be run in parallel and across a cloud.
+
Columns:
 +
# sample id
 +
# comma separated population labels (optional column)
 +
# BAM File (preferable to have full paths to BAM files)
  
On success, you will see:
+
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE]
  Processing finished in nn secs with no errors reported
+
or
 +
  [SAMPLE_ID] [BAM_FILE]
  
If processing fails part way through, you can pick up where you left off by rerunning gotcloud or the make command.
+
* Notes:
 +
** tab delimited
 +
** population label is optional - it will default to <code>ALL</code>
 +
*** only used by Thunder (part of ldrefine pipeline)
 +
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.
  
=== Alignment Pipeline Output ===
+
====Outputs====
Upon successful completion of the alignment pipeline, you should see the following files/ subdirectories under the user specified output directory:  
+
Upon successful completion of the *bamQC_createIndex* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:
* '''bam.list''' - file containing sample->BAM mapping that can be used in other GotCloud pipelines
+
* A BAI file with the exact same path and name as the BAM file that was input, with *.bai on the end
* '''bams/''' - contains the final BAM and bai (BAM index) files
+
* '''QCFiles/''' - contains quality control results  
** '''*.recal.bam'''
 
** '''*.recal.bam.bai'''
 
** ''*.recal.bam.bai.done'' - temp file indicating this step completed successfully
 
** ''*.recal.bam.done'' - temp file indicating this step completed successfully
 
** *.recal.bam.metrics - dedup & recalibration log
 
** *.recal.bam.qemp - recalibration tables
 
* Makefiles/ - contains the Makefiles and logs used by GotCloud to run the alignment pipeline
 
* '''QCFiles/''' - contains quality control results if quality control is not disabled
 
 
** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information
 
** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information
*** *.genoCheck.depthRG - depth distribution of the sequence reads per read group
+
*** ''*/SAMPLE.genoCheck.depthRG'' - depth distribution of the sequence reads per read group
*** *.genoCheck.depthSM - depth distribution of the sequence reads per sample
+
*** ''*/SAMPLE.genoCheck.depthSM'' - depth distribution of the sequence reads per sample
*** ''*.genoCheck.done'' - temp file indicating this step completed successfully
+
*** ''*/SAMPLE.genoCheck.err'' - log file
*** *.genoCheck.selfRG - per-readGroup statistics describing how well each lane matches to the annotated sample
+
*** ''*/SAMPLE.genoCheck.log'' - log file
*** '''*.genoCheck.selfSM''' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample
+
*** ''*/SAMPLE.genoCheck.OK'' - temp file indicating the VerifyBAMID step completed successfully
 +
*** ''*/SAMPLE.genoCheck.selfRG'' - per-readGroup statistics describing how well each lane matches to the annotated sample
 +
*** '''''*/SAMPLE.genoCheck.selfSM'' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample'''
 
**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
 
**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
 
**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
 
**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
 
** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results
 
** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results
*** ''*.qplot.done'' - temp file indicating this step completed successfully
+
*** ''*/SAMPLE.qplot.OK'' - temp file indicating the qplot step completed successfully
*** '''*.qplot.R''' - Rscript that can be used to generate the pdf graphs
+
*** '''''*/SAMPLE.qplot.R'' - Rscript that can be used to generate the pdf graphs'''
*** '''*.qplot.stats''' - sample statistics
+
*** '''''*/SAMPLE.qplot.stats'' - sample statistics'''
* tmp/ - contains intermediate files (most are deleted unless --keepTmp is specified)
+
 
* *.OK - one OK file per sample; indicates the Sample successfully completed alignment
+
You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *bamQC_createIndex* sub-pipeline failed.
 +
 
 +
'''On success, the QCFiles/ folder contains the quality control output'''
 +
 
 +
===Command-Line and Configuration Options===
 +
 
 +
*Required Options
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for bamQC_createIndex|BAM_LIST File]] || $(OUT_DIR)/bam.list
 +
|-
 +
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
 +
|}
 +
 
 +
*Common Options
  
You should also see a <code>.OK</code> for each Sample in the index file.  
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --outdir ''path'' || OUT_DIR || output directory ||
 +
|-
 +
| --conf ''file'' || || configuration file to use ||
 +
|-
 +
|  || REF_DIR || where the reference/resource files are stored || gotcloud.ref subdirectory within the base GotCloud directory
 +
|-
 +
| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa
 +
|-
 +
| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF File|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz
 +
|-
 +
| || HM3_VCF || [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF File|HapMap3 VCF Files]] || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 +
|}
  
If you do not see these <code>.OK</code> files, then your Alignment Pipeline failed.  
+
==== Example Configuration File ====
 +
Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5
 +
BAM_LIST = /path/freeze5.bam.list
 +
OUT_DIR = /path/freeze5/output
 +
REF_DIR = /path/reference/
 +
REF = $(REF_DIR)/hs37d5.fa
 +
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
 +
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz
  
'''On success, the bams/ directory contains the final BAMs and bais.'''
+
==== Example Command Line ====
 +
gotcloud pipe –-name bamQC_createIndex --numjobs <N>

Latest revision as of 21:37, 18 March 2015

Back to parent: GotCloud


List of Alignment Sub-Pipelines

recab

This sub-pipeline takes in a list of bam files for each sample, merges the BAMs for samples that have multiple BAMs, dedups and recalibrates, and then indexes the recalibrated BAM.

recabQC

This sub-pipeline does everything that *recab* does (takes in a list of bam files for each sample, merges the BAMs for samples that have multiple BAMs, dedups and recalibrates, and then indexes the recalibrated BAM). It then goes the next step to perform quality control (running qplot and verifyBamID).

bamQC

This sub-pipeline takes in a single, recalibrated BAM file and its index file (.bai) and performs quality control (running qplot and verifyBamID). It differs from *bamQC_createIndex* in that it requires that the user already have .bai files for the recalibrated BAM files.

bamQC_createIndex

This sub-pipeline takes in a single, recalibrated BAM file, creates an index file for it, and performs quality control (running qplot and verifyBamID). It differs from *bamQC* in that it does not require that the user already have a .bai file for the recalibrated BAM file.

recab

  • What it does:
  1. merge BAMs for samples that have multiple BAMs
  2. dedup and recalibrate
  3. index the recalibrated BAM

Inputs

  • Bam files (stored in a BAM_LIST File)
  • Reference files
  • (Optional) configuration file to override default options
BAM_LIST File for recab
  • Each line of the BAM list file represents a single individual

Columns:

  1. sample id
  2. comma separated population labels (optional column)
  3. BAM File 1 (preferable to have full paths to BAM files)
  4. BAM File 2 (if more than 1 BAM per sample)
...
# BAM File N (if more than 1 BAM per sample)
[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...

or

[SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ...
  • Notes:
    • tab delimited
    • multiple BAMs per individual may be provided, but should all be on the same line of the list file
    • population label is optional - it will default to ALL
      • only used by Thunder (part of ldrefine pipeline)
      • if all samples are from the same population, population label can be skipped or you can just specify ALL for the population label for each sample.

Outputs

Upon successful completion of the *recab* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:

  • recab/mergedBams/
    • */SAMPLE.merged.bam - a merged BAM file
    • */SAMPLE.merged.bam.log - merge log
    • */SAMPLE.merged.bam.OK - temp file indicating the merge step completed successfully
  • recab/
    • */SAMPLE.recal.bam - a merged, recalibrated, and deduped BAM file
    • */SAMPLE.recal.bam.bai - an indexed version of the merged, recalibrated, and deduped BAM file
    • */SAMPLE.recal.bam.metrics - dedup & recalibration log
    • */SAMPLE.recal.bam.qemp - recalibration tables
    • */SAMPLE.recal.bam.done - temp file indicating the recalibration step completed successfully
    • */SAMPLE.recal.bam.bai.done - temp file indicating the indexing step completed successfully

You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *recab* sub-pipeline failed.

On success, the recab/ folder contains the final BAMs and bais.

Command-Line and Configuration Options

  • Required Options
Command-line Flag Configuration Key Value Description Default Value
--list/--bam_list/--bamlist file BAM_LIST path to the BAM_LIST File $(OUT_DIR)/bam.list
--numjobs # number of jobs to run in parallel 0 (generate Makefile of steps, but do not run)
  • Common Options
Command-line Flag Configuration Key Value Description Default Value
--outdir path OUT_DIR output directory
--conf file configuration file to use
REF_DIR where the reference/resource files are stored gotcloud.ref subdirectory within the base GotCloud directory
REF Reference fasta Files $(REF_DIR)/human.g1k.v37.fa
DBSNP_VCF DBSNP VCF Files $(REF_DIR)/dbsnp_135.b37.vcf.gz

Example Configuration File

Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5

BAM_LIST = /path/freeze5.bam.list
OUT_DIR = /path/freeze5/output
REF_DIR = /path/reference/
REF = $(REF_DIR)/hs37d5.fa
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz

Example Command Line

gotcloud pipe –-name recab --numjobs <N>

recabQC

  • What it does:
  1. merge BAMs for samples that have multiple BAMs
  2. dedup and recalibrate
  3. index the recalibrated BAM
  4. qplot
  5. verifyBamID

Inputs

  • Bam files (stored in a BAM_LIST file)
  • Reference files
  • (Optional) configuration file to override default options
BAM_LIST File for recabQC
  • Each line of the BAM list file represents a single individual

Columns:

  1. sample id
  2. comma separated population labels (optional column)
  3. BAM File 1 (preferable to have full paths to BAM files)
  4. BAM File 2 (if more than 1 BAM per sample)
...
# BAM File N (if more than 1 BAM per sample)
[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...

or

[SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ...
  • Notes:
    • tab delimited
    • multiple BAMs per individual may be provided, but should all be on the same line of the list file
    • population label is optional - it will default to ALL
      • only used by Thunder (part of ldrefine pipeline)
      • if all samples are from the same population, population label can be skipped or you can just specify ALL for the population label for each sample.

Outputs

Upon successful completion of the *recabQC* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:

  • recab/mergedBams/ - contains merge results
    • */SAMPLE.merged.bam - a merged BAM file
    • */SAMPLE.merged.bam.log - merge log
    • */SAMPLE.merged.bam.OK - temp file indicating the merge step completed successfully
  • recab/ - contains recalibration results
    • */SAMPLE.recal.bam - a merged, recalibrated, and deduped BAM file
    • */SAMPLE.recal.bam.bai - an indexed version of the merged, recalibrated, and deduped BAM file
    • */SAMPLE.recal.bam.metrics - dedup & recalibration log
    • */SAMPLE.recal.bam.qemp - recalibration tables
    • */SAMPLE.recal.bam.done - temp file indicating the recalibration step completed successfully
    • */SAMPLE.recal.bam.bai.done - temp file indicating the indexing step completed successfully
  • QCFiles/ - contains quality control results
    • VerifyBamID Output - see VerifyBamID: A guideline to interpret output files for more information
      • */SAMPLE.genoCheck.depthRG - depth distribution of the sequence reads per read group
      • */SAMPLE.genoCheck.depthSM - depth distribution of the sequence reads per sample
      • */SAMPLE.genoCheck.err - log file
      • */SAMPLE.genoCheck.log - log file
      • */SAMPLE.genoCheck.OK - temp file indicating the VerifyBAMID step completed successfully
      • */SAMPLE.genoCheck.selfRG - per-readGroup statistics describing how well each lane matches to the annotated sample
      • */SAMPLE.genoCheck.selfSM - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample
        • Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
        • If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
    • Qplot Output - see: QPLOT: Diagnose sequencing quality for more info on how to use QPLOT results
      • */SAMPLE.qplot.OK - temp file indicating the qplot step completed successfully
      • */SAMPLE.qplot.R - Rscript that can be used to generate the pdf graphs
      • */SAMPLE.qplot.stats - sample statistics

You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *recabQC* sub-pipeline failed.

On success, the recab/ folder contains the final BAMs and bais, while the QCFiles/ folder contains the quality control output

Command-Line and Configuration Options

  • Required Options
Command-line Flag Configuration Key Value Description Default Value
--list/--bam_list/--bamlist file BAM_LIST path to the BAM_LIST File $(OUT_DIR)/bam.list
--numjobs # number of jobs to run in parallel 0 (generate Makefile of steps, but do not run)
  • Common Options
Command-line Flag Configuration Key Value Description Default Value
--outdir path OUT_DIR output directory
--conf file configuration file to use
REF_DIR where the reference/resource files are stored gotcloud.ref subdirectory within the base GotCloud directory
REF Reference fasta Files $(REF_DIR)/human.g1k.v37.fa
DBSNP_VCF DBSNP VCF Files $(REF_DIR)/dbsnp_135.b37.vcf.gz
HM3_VCF HapMap3 VCF Files $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz

Example Configuration File

Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5

BAM_LIST = /path/freeze5.bam.list
OUT_DIR = /path/freeze5/output
REF_DIR = /path/reference/
REF = $(REF_DIR)/hs37d5.fa
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz

Example Command Line

gotcloud pipe –-name recabQC --numjobs <N>

bamQC

  • What it does:
  1. qplot
  2. verifyBamID

Inputs

  • Single merged, recalibrated, and deduped BAM file for each subject (stored in a BAM_LIST File)
  • BAI file for each subject
  • Reference files
  • (Optional) configuration file to override default options
BAM_LIST File for bamQC
  • Each line of the BAM list file represents a single individual

Columns:

  1. sample id
  2. comma separated population labels (optional column)
  3. BAM File (preferable to have full path to BAM file)
  4. BAI File (preferable to have full path to BAI file)
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE] [BAI_FILE] 

or

[SAMPLE_ID] [BAM_FILE] [BAI_FILE] 
  • Notes:
    • tab delimited
    • population label is optional - it will default to ALL
      • only used by Thunder (part of ldrefine pipeline)
      • if all samples are from the same population, population label can be skipped or you can just specify ALL for the population label for each sample.

Outputs

Upon successful completion of the *bamQC* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:

  • QCFiles/ - contains quality control results
    • VerifyBamID Output - see VerifyBamID: A guideline to interpret output files for more information
      • */SAMPLE.genoCheck.depthRG - depth distribution of the sequence reads per read group
      • */SAMPLE.genoCheck.depthSM - depth distribution of the sequence reads per sample
      • */SAMPLE.genoCheck.err - log file
      • */SAMPLE.genoCheck.log - log file
      • */SAMPLE.genoCheck.OK - temp file indicating the VerifyBAMID step completed successfully
      • */SAMPLE.genoCheck.selfRG - per-readGroup statistics describing how well each lane matches to the annotated sample
      • */SAMPLE.genoCheck.selfSM - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample
        • Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
        • If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
    • Qplot Output - see: QPLOT: Diagnose sequencing quality for more info on how to use QPLOT results
      • */SAMPLE.qplot.OK - temp file indicating the qplot step completed successfully
      • */SAMPLE.qplot.R - Rscript that can be used to generate the pdf graphs
      • */SAMPLE.qplot.stats - sample statistics

You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *bamQC* sub-pipeline failed.

On success, the QCFiles/ folder contains the quality control output

Command-Line and Configuration Options

  • Required Options
Command-line Flag Configuration Key Value Description Default Value
--list/--bam_list/--bamlist file BAM_LIST path to the BAM_LIST File $(OUT_DIR)/bam.list
--numjobs # number of jobs to run in parallel 0 (generate Makefile of steps, but do not run)
  • Common Options
Command-line Flag Configuration Key Value Description Default Value
--outdir path OUT_DIR output directory
--conf file configuration file to use
REF_DIR where the reference/resource files are stored gotcloud.ref subdirectory within the base GotCloud directory
REF Reference fasta Files $(REF_DIR)/human.g1k.v37.fa
DBSNP_VCF DBSNP VCF Files $(REF_DIR)/dbsnp_135.b37.vcf.gz
HM3_VCF HapMap3 VCF Files $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz

Example Configuration File

Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5

BAM_LIST = /path/freeze5.bam.list
OUT_DIR = /path/freeze5/output
REF_DIR = /path/reference/
REF = $(REF_DIR)/hs37d5.fa
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz

Example Command Line

gotcloud pipe –-name bamQC --numjobs <N>

bamQC_createIndex

  • What it does:
  1. creates a BAI file for any BAM that is missing it
  2. qplot
  3. verifyBamID

Inputs

  • Single merged, recalibrated, and deduped BAM file for each subject (stored in a BAM_LIST File)
  • Reference files
  • (Optional) configuration file to override default options
BAM_LIST File for bamQC_createIndex
  • Each line of the BAM list file represents a single individual

Columns:

  1. sample id
  2. comma separated population labels (optional column)
  3. BAM File (preferable to have full paths to BAM files)
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE] 

or

[SAMPLE_ID] [BAM_FILE] 
  • Notes:
    • tab delimited
    • population label is optional - it will default to ALL
      • only used by Thunder (part of ldrefine pipeline)
      • if all samples are from the same population, population label can be skipped or you can just specify ALL for the population label for each sample.

Outputs

Upon successful completion of the *bamQC_createIndex* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:

  • A BAI file with the exact same path and name as the BAM file that was input, with *.bai on the end
  • QCFiles/ - contains quality control results
    • VerifyBamID Output - see VerifyBamID: A guideline to interpret output files for more information
      • */SAMPLE.genoCheck.depthRG - depth distribution of the sequence reads per read group
      • */SAMPLE.genoCheck.depthSM - depth distribution of the sequence reads per sample
      • */SAMPLE.genoCheck.err - log file
      • */SAMPLE.genoCheck.log - log file
      • */SAMPLE.genoCheck.OK - temp file indicating the VerifyBAMID step completed successfully
      • */SAMPLE.genoCheck.selfRG - per-readGroup statistics describing how well each lane matches to the annotated sample
      • */SAMPLE.genoCheck.selfSM - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample
        • Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
        • If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
    • Qplot Output - see: QPLOT: Diagnose sequencing quality for more info on how to use QPLOT results
      • */SAMPLE.qplot.OK - temp file indicating the qplot step completed successfully
      • */SAMPLE.qplot.R - Rscript that can be used to generate the pdf graphs
      • */SAMPLE.qplot.stats - sample statistics

You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *bamQC_createIndex* sub-pipeline failed.

On success, the QCFiles/ folder contains the quality control output

Command-Line and Configuration Options

  • Required Options
Command-line Flag Configuration Key Value Description Default Value
--list/--bam_list/--bamlist file BAM_LIST path to the BAM_LIST File $(OUT_DIR)/bam.list
--numjobs # number of jobs to run in parallel 0 (generate Makefile of steps, but do not run)
  • Common Options
Command-line Flag Configuration Key Value Description Default Value
--outdir path OUT_DIR output directory
--conf file configuration file to use
REF_DIR where the reference/resource files are stored gotcloud.ref subdirectory within the base GotCloud directory
REF Reference fasta Files $(REF_DIR)/human.g1k.v37.fa
DBSNP_VCF DBSNP VCF Files $(REF_DIR)/dbsnp_135.b37.vcf.gz
HM3_VCF HapMap3 VCF Files $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz

Example Configuration File

Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5

BAM_LIST = /path/freeze5.bam.list
OUT_DIR = /path/freeze5/output
REF_DIR = /path/reference/
REF = $(REF_DIR)/hs37d5.fa
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz

Example Command Line

gotcloud pipe –-name bamQC_createIndex --numjobs <N>