Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 28: Line 28:     
====Inputs====
 
====Inputs====
* Bam files (stored in a [[#BAM_LIST|BAM_LIST]] file)
+
* Bam files (stored in a [[#BAM_LIST File for recab|BAM_LIST File]])
 
* Reference files
 
* Reference files
 
* (Optional) configuration file to override default options
 
* (Optional) configuration file to override default options
   −
=====BAM_LIST File=====
+
=====BAM_LIST File for recab=====
 
* Each line of the BAM list file represents a single individual
 
* Each line of the BAM list file represents a single individual
   Line 78: Line 78:  
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 
|-
 
|-
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File|BAM_LIST File]] || $(OUT_DIR)/bam.list
+
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for recab|BAM_LIST File]] || $(OUT_DIR)/bam.list
 
|-
 
|-
 
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
 
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
Line 119: Line 119:     
====Inputs====
 
====Inputs====
* Bam files (stored in a [[#BAM_LIST|BAM_LIST]] file)
+
* Bam files (stored in a [[#BAM_LIST File for recabQC|BAM_LIST]] file)
 
* Reference files
 
* Reference files
 
* (Optional) configuration file to override default options
 
* (Optional) configuration file to override default options
   −
=====BAM_LIST File=====
+
=====BAM_LIST File for recabQC=====
 
* Each line of the BAM list file represents a single individual
 
* Each line of the BAM list file represents a single individual
   Line 146: Line 146:     
====Outputs====
 
====Outputs====
Upon successful completion of the *recab* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:
+
Upon successful completion of the *recabQC* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:
*'''recab/mergedBams/'''
+
*'''recab/mergedBams/''' - contains merge results
 
** '''''*/SAMPLE.merged.bam'' - a merged BAM file'''
 
** '''''*/SAMPLE.merged.bam'' - a merged BAM file'''
 
** ''*/SAMPLE.merged.bam.log'' - merge log
 
** ''*/SAMPLE.merged.bam.log'' - merge log
 
** ''*/SAMPLE.merged.bam.OK'' - temp file indicating the merge step completed successfully
 
** ''*/SAMPLE.merged.bam.OK'' - temp file indicating the merge step completed successfully
   −
* '''recab/'''
+
* '''recab/''' - contains recalibration results
 
** '''''*/SAMPLE.recal.bam'' - a merged, recalibrated, and deduped BAM file'''
 
** '''''*/SAMPLE.recal.bam'' - a merged, recalibrated, and deduped BAM file'''
 
** '''''*/SAMPLE.recal.bam.bai'' - an indexed version of the  merged, recalibrated, and deduped BAM file'''
 
** '''''*/SAMPLE.recal.bam.bai'' - an indexed version of the  merged, recalibrated, and deduped BAM file'''
Line 159: Line 159:  
** ''*/SAMPLE.recal.bam.done'' - temp file indicating the recalibration step completed successfully
 
** ''*/SAMPLE.recal.bam.done'' - temp file indicating the recalibration step completed successfully
 
** ''*/SAMPLE.recal.bam.bai.done'' - temp file indicating the indexing step completed successfully
 
** ''*/SAMPLE.recal.bam.bai.done'' - temp file indicating the indexing step completed successfully
You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *recab* sub-pipeline failed.
     −
'''On success, the recab/ folder contains the final BAMs and bais.'''
+
* '''QCFiles/''' - contains quality control results
 +
** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information
 +
*** ''*/SAMPLE.genoCheck.depthRG'' - depth distribution of the sequence reads per read group
 +
*** ''*/SAMPLE.genoCheck.depthSM'' - depth distribution of the sequence reads per sample
 +
*** ''*/SAMPLE.genoCheck.err'' - log file
 +
*** ''*/SAMPLE.genoCheck.log'' - log file
 +
*** ''*/SAMPLE.genoCheck.OK'' - temp file indicating the VerifyBAMID step completed successfully
 +
*** ''*/SAMPLE.genoCheck.selfRG'' - per-readGroup statistics describing how well each lane matches to the annotated sample
 +
*** '''''*/SAMPLE.genoCheck.selfSM'' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample'''
 +
**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
 +
**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
 +
** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results
 +
*** ''*/SAMPLE.qplot.OK'' - temp file indicating the qplot step completed successfully
 +
*** '''''*/SAMPLE.qplot.R'' - Rscript that can be used to generate the pdf graphs'''
 +
*** '''''*/SAMPLE.qplot.stats'' - sample statistics'''
 +
 
 +
You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *recabQC* sub-pipeline failed.
 +
 
 +
'''On success, the recab/ folder contains the final BAMs and bais, while the QCFiles/ folder contains the quality control output'''
 +
 
 +
===Command-Line and Configuration Options===
 +
 
 +
*Required Options
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for recabQC|BAM_LIST File]] || $(OUT_DIR)/bam.list
 +
|-
 +
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
 +
|}
 +
 
 +
*Common Options
 +
 
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --outdir ''path'' || OUT_DIR || output directory ||
 +
|-
 +
| --conf ''file'' || || configuration file to use ||
 +
|-
 +
|  || REF_DIR || where the reference/resource files are stored || gotcloud.ref subdirectory within the base GotCloud directory
 +
|-
 +
| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa
 +
|-
 +
| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF File|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz
 +
|-
 +
| || HM3_VCF || [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF File|HapMap3 VCF Files]] || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 +
|}
 +
 
 +
==== Example Configuration File ====
 +
Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5
 +
BAM_LIST = /path/freeze5.bam.list
 +
OUT_DIR = /path/freeze5/output
 +
REF_DIR = /path/reference/
 +
REF = $(REF_DIR)/hs37d5.fa
 +
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
 +
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz
 +
 
 +
==== Example Command Line ====
 +
gotcloud pipe –-name recabQC --numjobs <N>
 +
 
 +
== bamQC ==
 +
*What it does:
 +
# qplot
 +
# verifyBamID
 +
 
 +
====Inputs====
 +
* Single merged, recalibrated, and deduped BAM file for each subject (stored in a [[#BAM_LIST File for bamQC|BAM_LIST File]])
 +
* BAI file for each subject
 +
* Reference files
 +
* (Optional) configuration file to override default options
 +
 
 +
=====BAM_LIST File for bamQC=====
 +
* Each line of the BAM list file represents a single individual
 +
 
 +
Columns:
 +
# sample id
 +
# comma separated population labels (optional column)
 +
# BAM File (preferable to have full path to BAM file)
 +
# BAI File (preferable to have full path to BAI file)
 +
 
 +
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE] [BAI_FILE]
 +
or
 +
[SAMPLE_ID] [BAM_FILE] [BAI_FILE]
 +
 
 +
* Notes:
 +
** tab delimited
 +
** population label is optional - it will default to <code>ALL</code>
 +
*** only used by Thunder (part of ldrefine pipeline)
 +
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.
 +
 
 +
====Outputs====
 +
Upon successful completion of the *bamQC* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:
 +
 
 +
* '''QCFiles/''' - contains quality control results
 +
** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information
 +
*** ''*/SAMPLE.genoCheck.depthRG'' - depth distribution of the sequence reads per read group
 +
*** ''*/SAMPLE.genoCheck.depthSM'' - depth distribution of the sequence reads per sample
 +
*** ''*/SAMPLE.genoCheck.err'' - log file
 +
*** ''*/SAMPLE.genoCheck.log'' - log file
 +
*** ''*/SAMPLE.genoCheck.OK'' - temp file indicating the VerifyBAMID step completed successfully
 +
*** ''*/SAMPLE.genoCheck.selfRG'' - per-readGroup statistics describing how well each lane matches to the annotated sample
 +
*** '''''*/SAMPLE.genoCheck.selfSM'' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample'''
 +
**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
 +
**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
 +
** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results
 +
*** ''*/SAMPLE.qplot.OK'' - temp file indicating the qplot step completed successfully
 +
*** '''''*/SAMPLE.qplot.R'' - Rscript that can be used to generate the pdf graphs'''
 +
*** '''''*/SAMPLE.qplot.stats'' - sample statistics'''
 +
 
 +
You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *bamQC* sub-pipeline failed.
 +
 
 +
'''On success, the QCFiles/ folder contains the quality control output'''
    
===Command-Line and Configuration Options===
 
===Command-Line and Configuration Options===
Line 169: Line 280:  
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 
|-
 
|-
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File|BAM_LIST File]] || $(OUT_DIR)/bam.list
+
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for bamQC|BAM_LIST File]] || $(OUT_DIR)/bam.list
 
|-
 
|-
 
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
 
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
Line 187: Line 298:  
| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa
 
| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa
 
|-
 
|-
| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF Files|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz
+
| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF File|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz
 +
|-
 +
| || HM3_VCF || [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF File|HapMap3 VCF Files]] || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 
|}
 
|}
   Line 197: Line 310:  
  REF = $(REF_DIR)/hs37d5.fa
 
  REF = $(REF_DIR)/hs37d5.fa
 
  DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
 
  DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
 +
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz
    
==== Example Command Line ====
 
==== Example Command Line ====
  gotcloud pipe –-name recab --numjobs <N>
+
  gotcloud pipe –-name bamQC --numjobs <N>
   −
== bamQC ==
   
== bamQC_createIndex ==
 
== bamQC_createIndex ==
 +
*What it does:
 +
# creates a BAI file for any BAM that is missing it
 +
# qplot
 +
# verifyBamID
 +
 +
====Inputs====
 +
* Single merged, recalibrated, and deduped BAM file for each subject (stored in a [[#BAM_LIST File for bamQC_createIndex|BAM_LIST File]])
 +
* Reference files
 +
* (Optional) configuration file to override default options
 +
 +
=====BAM_LIST File for bamQC_createIndex=====
 +
* Each line of the BAM list file represents a single individual
 +
 +
Columns:
 +
# sample id
 +
# comma separated population labels (optional column)
 +
# BAM File (preferable to have full paths to BAM files)
 +
 +
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE]
 +
or
 +
[SAMPLE_ID] [BAM_FILE]
 +
 +
* Notes:
 +
** tab delimited
 +
** population label is optional - it will default to <code>ALL</code>
 +
*** only used by Thunder (part of ldrefine pipeline)
 +
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.
 +
 +
====Outputs====
 +
Upon successful completion of the *bamQC_createIndex* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:
 +
* A BAI file with the exact same path and name as the BAM file that was input, with *.bai on the end
 +
* '''QCFiles/''' - contains quality control results
 +
** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information
 +
*** ''*/SAMPLE.genoCheck.depthRG'' - depth distribution of the sequence reads per read group
 +
*** ''*/SAMPLE.genoCheck.depthSM'' - depth distribution of the sequence reads per sample
 +
*** ''*/SAMPLE.genoCheck.err'' - log file
 +
*** ''*/SAMPLE.genoCheck.log'' - log file
 +
*** ''*/SAMPLE.genoCheck.OK'' - temp file indicating the VerifyBAMID step completed successfully
 +
*** ''*/SAMPLE.genoCheck.selfRG'' - per-readGroup statistics describing how well each lane matches to the annotated sample
 +
*** '''''*/SAMPLE.genoCheck.selfSM'' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample'''
 +
**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better
 +
**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination
 +
** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results
 +
*** ''*/SAMPLE.qplot.OK'' - temp file indicating the qplot step completed successfully
 +
*** '''''*/SAMPLE.qplot.R'' - Rscript that can be used to generate the pdf graphs'''
 +
*** '''''*/SAMPLE.qplot.stats'' - sample statistics'''
 +
 +
You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *bamQC_createIndex* sub-pipeline failed.
 +
 +
'''On success, the QCFiles/ folder contains the quality control output'''
 +
 +
===Command-Line and Configuration Options===
 +
 +
*Required Options
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for bamQC_createIndex|BAM_LIST File]] || $(OUT_DIR)/bam.list
 +
|-
 +
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)
 +
|}
 +
 +
*Common Options
 +
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Command-line Flag !! Configuration Key !! Value Description !! Default Value
 +
|-
 +
| --outdir ''path'' || OUT_DIR || output directory ||
 +
|-
 +
| --conf ''file'' || || configuration file to use ||
 +
|-
 +
|  || REF_DIR || where the reference/resource files are stored || gotcloud.ref subdirectory within the base GotCloud directory
 +
|-
 +
| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa
 +
|-
 +
| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF File|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz
 +
|-
 +
| || HM3_VCF || [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF File|HapMap3 VCF Files]] || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 +
|}
 +
 +
==== Example Configuration File ====
 +
Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5
 +
BAM_LIST = /path/freeze5.bam.list
 +
OUT_DIR = /path/freeze5/output
 +
REF_DIR = /path/reference/
 +
REF = $(REF_DIR)/hs37d5.fa
 +
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
 +
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz
 +
 +
==== Example Command Line ====
 +
gotcloud pipe –-name bamQC_createIndex --numjobs <N>
87

edits

Navigation menu