Difference between revisions of "GotCloud: Alignment Sub-Pipelines"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(30 intermediate revisions by the same user not shown)
Line 28: Line 28:
  
 
====Inputs====
 
====Inputs====
* Bam files (stored in a [[#BAM_LIST|BAM_LIST]] file)
+
* Bam files (stored in a [[#BAM_LIST File for recab|BAM_LIST File]])
 
* Reference files
 
* Reference files
 
* (Optional) configuration file to override default options
 
* (Optional) configuration file to override default options
  
=====BAM_LIST File=====
+
=====BAM_LIST File for recab=====
 
* Each line of the BAM list file represents a single individual
 
* Each line of the BAM list file represents a single individual
  
Line 78: Line 78:
 
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 
|-
 
|-
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File|BAM_LIST File]] || $(OUT_DIR)/bam.list
+
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for recab|BAM_LIST File]] || $(OUT_DIR)/bam.list
 
|-
 
|-
 
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
 
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
Line 119: Line 119:
  
 
====Inputs====
 
====Inputs====
* Bam files (stored in a [[#BAM_LIST|BAM_LIST]] file)
+
* Bam files (stored in a [[#BAM_LIST File for recabQC|BAM_LIST]] file)
 
* Reference files
 
* Reference files
 
* (Optional) configuration file to override default options
 
* (Optional) configuration file to override default options
  
=====BAM_LIST File=====
+
=====BAM_LIST File for recabQC=====
 
* Each line of the BAM list file represents a single individual
 
* Each line of the BAM list file represents a single individual
  
Line 146: Line 146:
  
 
====Outputs====
 
====Outputs====
Upon successful completion of the *recab* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:
+
Upon successful completion of the *recabQC* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:
 
*'''recab/mergedBams/''' - contains merge results
 
*'''recab/mergedBams/''' - contains merge results
 
** '''''*/SAMPLE.merged.bam'' - a merged BAM file'''
 
** '''''*/SAMPLE.merged.bam'' - a merged BAM file'''
Line 162: Line 162:
 
* '''QCFiles/''' - contains quality control results  
 
* '''QCFiles/''' - contains quality control results  
 
** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information
 
** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information
*** *.genoCheck.depthRG - depth distribution of the sequence reads per read group
+
*** ''*/SAMPLE.genoCheck.depthRG'' - depth distribution of the sequence reads per read group
*** *.genoCheck.depthSM - depth distribution of the sequence reads per sample
+
*** ''*/SAMPLE.genoCheck.depthSM'' - depth distribution of the sequence reads per sample
*** ''*.genoCheck.done'' - temp file indicating this step completed successfully
+
*** ''*/SAMPLE.genoCheck.err'' - log file
*** *.genoCheck.selfRG - per-readGroup statistics describing how well each lane matches to the annotated sample
+
*** ''*/SAMPLE.genoCheck.log'' - log file
*** '''*.genoCheck.selfSM''' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample
+
*** ''*/SAMPLE.genoCheck.OK'' - temp file indicating the VerifyBAMID step completed successfully
 +
*** ''*/SAMPLE.genoCheck.selfRG'' - per-readGroup statistics describing how well each lane matches to the annotated sample
 +
*** '''''*/SAMPLE.genoCheck.selfSM'' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample'''
 +
**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
 +
**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
 +
** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results
 +
*** ''*/SAMPLE.qplot.OK'' - temp file indicating the qplot step completed successfully
 +
*** '''''*/SAMPLE.qplot.R'' - Rscript that can be used to generate the pdf graphs'''
 +
*** '''''*/SAMPLE.qplot.stats'' - sample statistics'''
 +
 
 +
You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *recabQC* sub-pipeline failed.
 +
 
 +
'''On success, the recab/ folder contains the final BAMs and bais, while the QCFiles/ folder contains the quality control output'''
 +
 
 +
===Command-Line and Configuration Options===
 +
 
 +
*Required Options
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for recabQC|BAM_LIST File]] || $(OUT_DIR)/bam.list
 +
|-
 +
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
 +
|}
 +
 
 +
*Common Options
 +
 
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --outdir ''path'' || OUT_DIR || output directory ||
 +
|-
 +
| --conf ''file'' || || configuration file to use ||
 +
|-
 +
|  || REF_DIR || where the reference/resource files are stored || gotcloud.ref subdirectory within the base GotCloud directory
 +
|-
 +
| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa
 +
|-
 +
| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF File|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz
 +
|-
 +
| || HM3_VCF || [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF File|HapMap3 VCF Files]] || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 +
|}
 +
 
 +
==== Example Configuration File ====
 +
Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5
 +
BAM_LIST = /path/freeze5.bam.list
 +
OUT_DIR = /path/freeze5/output
 +
REF_DIR = /path/reference/
 +
REF = $(REF_DIR)/hs37d5.fa
 +
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
 +
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz
 +
 
 +
==== Example Command Line ====
 +
gotcloud pipe –-name recabQC --numjobs <N>
 +
 
 +
== bamQC ==
 +
*What it does:
 +
# qplot
 +
# verifyBamID
 +
 
 +
====Inputs====
 +
* Single merged, recalibrated, and deduped BAM file for each subject (stored in a [[#BAM_LIST File for bamQC|BAM_LIST File]])
 +
* BAI file for each subject
 +
* Reference files
 +
* (Optional) configuration file to override default options
 +
 
 +
=====BAM_LIST File for bamQC=====
 +
* Each line of the BAM list file represents a single individual
 +
 
 +
Columns:
 +
# sample id
 +
# comma separated population labels (optional column)
 +
# BAM File (preferable to have full path to BAM file)
 +
# BAI File (preferable to have full path to BAI file)
 +
 
 +
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE] [BAI_FILE]
 +
or
 +
[SAMPLE_ID] [BAM_FILE] [BAI_FILE]
 +
 
 +
* Notes:
 +
** tab delimited
 +
** population label is optional - it will default to <code>ALL</code>
 +
*** only used by Thunder (part of ldrefine pipeline)
 +
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.
 +
 
 +
====Outputs====
 +
Upon successful completion of the *bamQC* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:
 +
 
 +
* '''QCFiles/''' - contains quality control results
 +
** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information
 +
*** ''*/SAMPLE.genoCheck.depthRG'' - depth distribution of the sequence reads per read group
 +
*** ''*/SAMPLE.genoCheck.depthSM'' - depth distribution of the sequence reads per sample
 +
*** ''*/SAMPLE.genoCheck.err'' - log file
 +
*** ''*/SAMPLE.genoCheck.log'' - log file
 +
*** ''*/SAMPLE.genoCheck.OK'' - temp file indicating the VerifyBAMID step completed successfully
 +
*** ''*/SAMPLE.genoCheck.selfRG'' - per-readGroup statistics describing how well each lane matches to the annotated sample
 +
*** '''''*/SAMPLE.genoCheck.selfSM'' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample'''
 
**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
 
**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
 
**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
 
**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
 
** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results
 
** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results
*** ''*.qplot.done'' - temp file indicating this step completed successfully
+
*** ''*/SAMPLE.qplot.OK'' - temp file indicating the qplot step completed successfully
*** '''*.qplot.R''' - Rscript that can be used to generate the pdf graphs
+
*** '''''*/SAMPLE.qplot.R'' - Rscript that can be used to generate the pdf graphs'''
*** '''*.qplot.stats''' - sample statistics
+
*** '''''*/SAMPLE.qplot.stats'' - sample statistics'''
  
You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *recab* sub-pipeline failed.
+
You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *bamQC* sub-pipeline failed.
  
'''On success, the recab/ folder contains the final BAMs and bais.'''
+
'''On success, the QCFiles/ folder contains the quality control output'''
  
 
===Command-Line and Configuration Options===
 
===Command-Line and Configuration Options===
Line 184: Line 280:
 
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 
|-
 
|-
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File|BAM_LIST File]] || $(OUT_DIR)/bam.list
+
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for bamQC|BAM_LIST File]] || $(OUT_DIR)/bam.list
 
|-
 
|-
 
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
 
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
Line 202: Line 298:
 
| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa
 
| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa
 
|-
 
|-
| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF Files|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz
+
| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF File|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz
 +
|-
 +
| || HM3_VCF || [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF File|HapMap3 VCF Files]] || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 
|}
 
|}
  
Line 212: Line 310:
 
  REF = $(REF_DIR)/hs37d5.fa
 
  REF = $(REF_DIR)/hs37d5.fa
 
  DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
 
  DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
 +
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz
  
 
==== Example Command Line ====
 
==== Example Command Line ====
  gotcloud pipe –-name recab --numjobs <N>
+
  gotcloud pipe –-name bamQC --numjobs <N>
  
== bamQC ==
 
 
== bamQC_createIndex ==
 
== bamQC_createIndex ==
 +
*What it does:
 +
# creates a BAI file for any BAM that is missing it
 +
# qplot
 +
# verifyBamID
 +
 +
====Inputs====
 +
* Single merged, recalibrated, and deduped BAM file for each subject (stored in a [[#BAM_LIST File for bamQC_createIndex|BAM_LIST File]])
 +
* Reference files
 +
* (Optional) configuration file to override default options
 +
 +
=====BAM_LIST File for bamQC_createIndex=====
 +
* Each line of the BAM list file represents a single individual
 +
 +
Columns:
 +
# sample id
 +
# comma separated population labels (optional column)
 +
# BAM File (preferable to have full paths to BAM files)
 +
 +
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE]
 +
or
 +
[SAMPLE_ID] [BAM_FILE]
 +
 +
* Notes:
 +
** tab delimited
 +
** population label is optional - it will default to <code>ALL</code>
 +
*** only used by Thunder (part of ldrefine pipeline)
 +
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.
 +
 +
====Outputs====
 +
Upon successful completion of the *bamQC_createIndex* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:
 +
* A BAI file with the exact same path and name as the BAM file that was input, with *.bai on the end
 +
* '''QCFiles/''' - contains quality control results
 +
** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information
 +
*** ''*/SAMPLE.genoCheck.depthRG'' - depth distribution of the sequence reads per read group
 +
*** ''*/SAMPLE.genoCheck.depthSM'' - depth distribution of the sequence reads per sample
 +
*** ''*/SAMPLE.genoCheck.err'' - log file
 +
*** ''*/SAMPLE.genoCheck.log'' - log file
 +
*** ''*/SAMPLE.genoCheck.OK'' - temp file indicating the VerifyBAMID step completed successfully
 +
*** ''*/SAMPLE.genoCheck.selfRG'' - per-readGroup statistics describing how well each lane matches to the annotated sample
 +
*** '''''*/SAMPLE.genoCheck.selfSM'' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample'''
 +
**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
 +
**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
 +
** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results
 +
*** ''*/SAMPLE.qplot.OK'' - temp file indicating the qplot step completed successfully
 +
*** '''''*/SAMPLE.qplot.R'' - Rscript that can be used to generate the pdf graphs'''
 +
*** '''''*/SAMPLE.qplot.stats'' - sample statistics'''
 +
 +
You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *bamQC_createIndex* sub-pipeline failed.
 +
 +
'''On success, the QCFiles/ folder contains the quality control output'''
 +
 +
===Command-Line and Configuration Options===
 +
 +
*Required Options
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for bamQC_createIndex|BAM_LIST File]] || $(OUT_DIR)/bam.list
 +
|-
 +
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
 +
|}
 +
 +
*Common Options
 +
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --outdir ''path'' || OUT_DIR || output directory ||
 +
|-
 +
| --conf ''file'' || || configuration file to use ||
 +
|-
 +
|  || REF_DIR || where the reference/resource files are stored || gotcloud.ref subdirectory within the base GotCloud directory
 +
|-
 +
| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa
 +
|-
 +
| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF File|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz
 +
|-
 +
| || HM3_VCF || [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF File|HapMap3 VCF Files]] || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 +
|}
 +
 +
==== Example Configuration File ====
 +
Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5
 +
BAM_LIST = /path/freeze5.bam.list
 +
OUT_DIR = /path/freeze5/output
 +
REF_DIR = /path/reference/
 +
REF = $(REF_DIR)/hs37d5.fa
 +
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
 +
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz
 +
 +
==== Example Command Line ====
 +
gotcloud pipe –-name bamQC_createIndex --numjobs <N>

Latest revision as of 21:37, 18 March 2015

Back to parent: GotCloud


List of Alignment Sub-Pipelines

recab

This sub-pipeline takes in a list of bam files for each sample, merges the BAMs for samples that have multiple BAMs, dedups and recalibrates, and then indexes the recalibrated BAM.

recabQC

This sub-pipeline does everything that *recab* does (takes in a list of bam files for each sample, merges the BAMs for samples that have multiple BAMs, dedups and recalibrates, and then indexes the recalibrated BAM). It then goes the next step to perform quality control (running qplot and verifyBamID).

bamQC

This sub-pipeline takes in a single, recalibrated BAM file and its index file (.bai) and performs quality control (running qplot and verifyBamID). It differs from *bamQC_createIndex* in that it requires that the user already have .bai files for the recalibrated BAM files.

bamQC_createIndex

This sub-pipeline takes in a single, recalibrated BAM file, creates an index file for it, and performs quality control (running qplot and verifyBamID). It differs from *bamQC* in that it does not require that the user already have a .bai file for the recalibrated BAM file.

recab

  • What it does:
  1. merge BAMs for samples that have multiple BAMs
  2. dedup and recalibrate
  3. index the recalibrated BAM

Inputs

  • Bam files (stored in a BAM_LIST File)
  • Reference files
  • (Optional) configuration file to override default options
BAM_LIST File for recab
  • Each line of the BAM list file represents a single individual

Columns:

  1. sample id
  2. comma separated population labels (optional column)
  3. BAM File 1 (preferable to have full paths to BAM files)
  4. BAM File 2 (if more than 1 BAM per sample)
...
# BAM File N (if more than 1 BAM per sample)
[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...

or

[SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ...
  • Notes:
    • tab delimited
    • multiple BAMs per individual may be provided, but should all be on the same line of the list file
    • population label is optional - it will default to ALL
      • only used by Thunder (part of ldrefine pipeline)
      • if all samples are from the same population, population label can be skipped or you can just specify ALL for the population label for each sample.

Outputs

Upon successful completion of the *recab* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:

  • recab/mergedBams/
    • */SAMPLE.merged.bam - a merged BAM file
    • */SAMPLE.merged.bam.log - merge log
    • */SAMPLE.merged.bam.OK - temp file indicating the merge step completed successfully
  • recab/
    • */SAMPLE.recal.bam - a merged, recalibrated, and deduped BAM file
    • */SAMPLE.recal.bam.bai - an indexed version of the merged, recalibrated, and deduped BAM file
    • */SAMPLE.recal.bam.metrics - dedup & recalibration log
    • */SAMPLE.recal.bam.qemp - recalibration tables
    • */SAMPLE.recal.bam.done - temp file indicating the recalibration step completed successfully
    • */SAMPLE.recal.bam.bai.done - temp file indicating the indexing step completed successfully

You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *recab* sub-pipeline failed.

On success, the recab/ folder contains the final BAMs and bais.

Command-Line and Configuration Options

  • Required Options
Command-line Flag Configuration Key Value Description Default Value
--list/--bam_list/--bamlist file BAM_LIST path to the BAM_LIST File $(OUT_DIR)/bam.list
--numjobs # number of jobs to run in parallel 0 (generate Makefile of steps, but do not run)
  • Common Options
Command-line Flag Configuration Key Value Description Default Value
--outdir path OUT_DIR output directory
--conf file configuration file to use
REF_DIR where the reference/resource files are stored gotcloud.ref subdirectory within the base GotCloud directory
REF Reference fasta Files $(REF_DIR)/human.g1k.v37.fa
DBSNP_VCF DBSNP VCF Files $(REF_DIR)/dbsnp_135.b37.vcf.gz

Example Configuration File

Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5

BAM_LIST = /path/freeze5.bam.list
OUT_DIR = /path/freeze5/output
REF_DIR = /path/reference/
REF = $(REF_DIR)/hs37d5.fa
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz

Example Command Line

gotcloud pipe –-name recab --numjobs <N>

recabQC

  • What it does:
  1. merge BAMs for samples that have multiple BAMs
  2. dedup and recalibrate
  3. index the recalibrated BAM
  4. qplot
  5. verifyBamID

Inputs

  • Bam files (stored in a BAM_LIST file)
  • Reference files
  • (Optional) configuration file to override default options
BAM_LIST File for recabQC
  • Each line of the BAM list file represents a single individual

Columns:

  1. sample id
  2. comma separated population labels (optional column)
  3. BAM File 1 (preferable to have full paths to BAM files)
  4. BAM File 2 (if more than 1 BAM per sample)
...
# BAM File N (if more than 1 BAM per sample)
[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...

or

[SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ...
  • Notes:
    • tab delimited
    • multiple BAMs per individual may be provided, but should all be on the same line of the list file
    • population label is optional - it will default to ALL
      • only used by Thunder (part of ldrefine pipeline)
      • if all samples are from the same population, population label can be skipped or you can just specify ALL for the population label for each sample.

Outputs

Upon successful completion of the *recabQC* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:

  • recab/mergedBams/ - contains merge results
    • */SAMPLE.merged.bam - a merged BAM file
    • */SAMPLE.merged.bam.log - merge log
    • */SAMPLE.merged.bam.OK - temp file indicating the merge step completed successfully
  • recab/ - contains recalibration results
    • */SAMPLE.recal.bam - a merged, recalibrated, and deduped BAM file
    • */SAMPLE.recal.bam.bai - an indexed version of the merged, recalibrated, and deduped BAM file
    • */SAMPLE.recal.bam.metrics - dedup & recalibration log
    • */SAMPLE.recal.bam.qemp - recalibration tables
    • */SAMPLE.recal.bam.done - temp file indicating the recalibration step completed successfully
    • */SAMPLE.recal.bam.bai.done - temp file indicating the indexing step completed successfully
  • QCFiles/ - contains quality control results
    • VerifyBamID Output - see VerifyBamID: A guideline to interpret output files for more information
      • */SAMPLE.genoCheck.depthRG - depth distribution of the sequence reads per read group
      • */SAMPLE.genoCheck.depthSM - depth distribution of the sequence reads per sample
      • */SAMPLE.genoCheck.err - log file
      • */SAMPLE.genoCheck.log - log file
      • */SAMPLE.genoCheck.OK - temp file indicating the VerifyBAMID step completed successfully
      • */SAMPLE.genoCheck.selfRG - per-readGroup statistics describing how well each lane matches to the annotated sample
      • */SAMPLE.genoCheck.selfSM - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample
        • Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
        • If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
    • Qplot Output - see: QPLOT: Diagnose sequencing quality for more info on how to use QPLOT results
      • */SAMPLE.qplot.OK - temp file indicating the qplot step completed successfully
      • */SAMPLE.qplot.R - Rscript that can be used to generate the pdf graphs
      • */SAMPLE.qplot.stats - sample statistics

You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *recabQC* sub-pipeline failed.

On success, the recab/ folder contains the final BAMs and bais, while the QCFiles/ folder contains the quality control output

Command-Line and Configuration Options

  • Required Options
Command-line Flag Configuration Key Value Description Default Value
--list/--bam_list/--bamlist file BAM_LIST path to the BAM_LIST File $(OUT_DIR)/bam.list
--numjobs # number of jobs to run in parallel 0 (generate Makefile of steps, but do not run)
  • Common Options
Command-line Flag Configuration Key Value Description Default Value
--outdir path OUT_DIR output directory
--conf file configuration file to use
REF_DIR where the reference/resource files are stored gotcloud.ref subdirectory within the base GotCloud directory
REF Reference fasta Files $(REF_DIR)/human.g1k.v37.fa
DBSNP_VCF DBSNP VCF Files $(REF_DIR)/dbsnp_135.b37.vcf.gz
HM3_VCF HapMap3 VCF Files $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz

Example Configuration File

Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5

BAM_LIST = /path/freeze5.bam.list
OUT_DIR = /path/freeze5/output
REF_DIR = /path/reference/
REF = $(REF_DIR)/hs37d5.fa
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz

Example Command Line

gotcloud pipe –-name recabQC --numjobs <N>

bamQC

  • What it does:
  1. qplot
  2. verifyBamID

Inputs

  • Single merged, recalibrated, and deduped BAM file for each subject (stored in a BAM_LIST File)
  • BAI file for each subject
  • Reference files
  • (Optional) configuration file to override default options
BAM_LIST File for bamQC
  • Each line of the BAM list file represents a single individual

Columns:

  1. sample id
  2. comma separated population labels (optional column)
  3. BAM File (preferable to have full path to BAM file)
  4. BAI File (preferable to have full path to BAI file)
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE] [BAI_FILE] 

or

[SAMPLE_ID] [BAM_FILE] [BAI_FILE] 
  • Notes:
    • tab delimited
    • population label is optional - it will default to ALL
      • only used by Thunder (part of ldrefine pipeline)
      • if all samples are from the same population, population label can be skipped or you can just specify ALL for the population label for each sample.

Outputs

Upon successful completion of the *bamQC* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:

  • QCFiles/ - contains quality control results
    • VerifyBamID Output - see VerifyBamID: A guideline to interpret output files for more information
      • */SAMPLE.genoCheck.depthRG - depth distribution of the sequence reads per read group
      • */SAMPLE.genoCheck.depthSM - depth distribution of the sequence reads per sample
      • */SAMPLE.genoCheck.err - log file
      • */SAMPLE.genoCheck.log - log file
      • */SAMPLE.genoCheck.OK - temp file indicating the VerifyBAMID step completed successfully
      • */SAMPLE.genoCheck.selfRG - per-readGroup statistics describing how well each lane matches to the annotated sample
      • */SAMPLE.genoCheck.selfSM - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample
        • Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
        • If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
    • Qplot Output - see: QPLOT: Diagnose sequencing quality for more info on how to use QPLOT results
      • */SAMPLE.qplot.OK - temp file indicating the qplot step completed successfully
      • */SAMPLE.qplot.R - Rscript that can be used to generate the pdf graphs
      • */SAMPLE.qplot.stats - sample statistics

You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *bamQC* sub-pipeline failed.

On success, the QCFiles/ folder contains the quality control output

Command-Line and Configuration Options

  • Required Options
Command-line Flag Configuration Key Value Description Default Value
--list/--bam_list/--bamlist file BAM_LIST path to the BAM_LIST File $(OUT_DIR)/bam.list
--numjobs # number of jobs to run in parallel 0 (generate Makefile of steps, but do not run)
  • Common Options
Command-line Flag Configuration Key Value Description Default Value
--outdir path OUT_DIR output directory
--conf file configuration file to use
REF_DIR where the reference/resource files are stored gotcloud.ref subdirectory within the base GotCloud directory
REF Reference fasta Files $(REF_DIR)/human.g1k.v37.fa
DBSNP_VCF DBSNP VCF Files $(REF_DIR)/dbsnp_135.b37.vcf.gz
HM3_VCF HapMap3 VCF Files $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz

Example Configuration File

Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5

BAM_LIST = /path/freeze5.bam.list
OUT_DIR = /path/freeze5/output
REF_DIR = /path/reference/
REF = $(REF_DIR)/hs37d5.fa
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz

Example Command Line

gotcloud pipe –-name bamQC --numjobs <N>

bamQC_createIndex

  • What it does:
  1. creates a BAI file for any BAM that is missing it
  2. qplot
  3. verifyBamID

Inputs

  • Single merged, recalibrated, and deduped BAM file for each subject (stored in a BAM_LIST File)
  • Reference files
  • (Optional) configuration file to override default options
BAM_LIST File for bamQC_createIndex
  • Each line of the BAM list file represents a single individual

Columns:

  1. sample id
  2. comma separated population labels (optional column)
  3. BAM File (preferable to have full paths to BAM files)
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE] 

or

[SAMPLE_ID] [BAM_FILE] 
  • Notes:
    • tab delimited
    • population label is optional - it will default to ALL
      • only used by Thunder (part of ldrefine pipeline)
      • if all samples are from the same population, population label can be skipped or you can just specify ALL for the population label for each sample.

Outputs

Upon successful completion of the *bamQC_createIndex* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:

  • A BAI file with the exact same path and name as the BAM file that was input, with *.bai on the end
  • QCFiles/ - contains quality control results
    • VerifyBamID Output - see VerifyBamID: A guideline to interpret output files for more information
      • */SAMPLE.genoCheck.depthRG - depth distribution of the sequence reads per read group
      • */SAMPLE.genoCheck.depthSM - depth distribution of the sequence reads per sample
      • */SAMPLE.genoCheck.err - log file
      • */SAMPLE.genoCheck.log - log file
      • */SAMPLE.genoCheck.OK - temp file indicating the VerifyBAMID step completed successfully
      • */SAMPLE.genoCheck.selfRG - per-readGroup statistics describing how well each lane matches to the annotated sample
      • */SAMPLE.genoCheck.selfSM - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample
        • Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
        • If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
    • Qplot Output - see: QPLOT: Diagnose sequencing quality for more info on how to use QPLOT results
      • */SAMPLE.qplot.OK - temp file indicating the qplot step completed successfully
      • */SAMPLE.qplot.R - Rscript that can be used to generate the pdf graphs
      • */SAMPLE.qplot.stats - sample statistics

You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *bamQC_createIndex* sub-pipeline failed.

On success, the QCFiles/ folder contains the quality control output

Command-Line and Configuration Options

  • Required Options
Command-line Flag Configuration Key Value Description Default Value
--list/--bam_list/--bamlist file BAM_LIST path to the BAM_LIST File $(OUT_DIR)/bam.list
--numjobs # number of jobs to run in parallel 0 (generate Makefile of steps, but do not run)
  • Common Options
Command-line Flag Configuration Key Value Description Default Value
--outdir path OUT_DIR output directory
--conf file configuration file to use
REF_DIR where the reference/resource files are stored gotcloud.ref subdirectory within the base GotCloud directory
REF Reference fasta Files $(REF_DIR)/human.g1k.v37.fa
DBSNP_VCF DBSNP VCF Files $(REF_DIR)/dbsnp_135.b37.vcf.gz
HM3_VCF HapMap3 VCF Files $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz

Example Configuration File

Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5

BAM_LIST = /path/freeze5.bam.list
OUT_DIR = /path/freeze5/output
REF_DIR = /path/reference/
REF = $(REF_DIR)/hs37d5.fa
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz

Example Command Line

gotcloud pipe –-name bamQC_createIndex --numjobs <N>