Changes

11,437 bytes added , 21:37, 18 March 2015

→‎Example Command Line

Line 28: Line 28:

====Inputs====

−

* Bam files (stored in a [[#BAM_LIST|BAM_LIST]] ~~file~~)

+

* Bam files (stored in a [[#BAM_LIST File for recab|BAM_LIST File]])

* Reference files

* (Optional) configuration file to override default options

−

=====BAM_LIST File=====

+

=====BAM_LIST File for recab=====

* Each line of the BAM list file represents a single individual

Line 78: Line 78:

! Command-line Flag !! Configuration Key !! Value Description !! Default Value

|-

−

| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File|BAM_LIST File]] || $(OUT_DIR)/bam.list

+

| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for recab|BAM_LIST File]] || $(OUT_DIR)/bam.list

|-

| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)

Line 119: Line 119:

====Inputs====

−

* Bam files (stored in a [[#BAM_LIST|BAM_LIST]] file)

+

* Bam files (stored in a [[#BAM_LIST File for recabQC|BAM_LIST]] file)

* Reference files

* (Optional) configuration file to override default options

−

=====BAM_LIST File=====

+

=====BAM_LIST File for recabQC=====

* Each line of the BAM list file represents a single individual

Line 146: Line 146:

====Outputs====

−

Upon successful completion of the *~~recab~~* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:

+

Upon successful completion of the *recabQC* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:

−

*'''recab/mergedBams/'''

+

*'''recab/mergedBams/''' - contains merge results

** '''''*/SAMPLE.merged.bam'' - a merged BAM file'''

** ''*/SAMPLE.merged.bam.log'' - merge log

** ''*/SAMPLE.merged.bam.OK'' - temp file indicating the merge step completed successfully

−

* '''recab/'''

+

* '''recab/''' - contains recalibration results

** '''''*/SAMPLE.recal.bam'' - a merged, recalibrated, and deduped BAM file'''

** '''''*/SAMPLE.recal.bam.bai'' - an indexed version of the merged, recalibrated, and deduped BAM file'''

Line 159: Line 159:

** ''*/SAMPLE.recal.bam.done'' - temp file indicating the recalibration step completed successfully

** ''*/SAMPLE.recal.bam.bai.done'' - temp file indicating the indexing step completed successfully

−

~~You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *recab* sub-pipeline failed.~~

−

'''On success, the recab/ folder contains the final BAMs and bais.'''

+

* '''QCFiles/''' - contains quality control results

+

** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information

+

*** ''*/SAMPLE.genoCheck.depthRG'' - depth distribution of the sequence reads per read group

+

*** ''*/SAMPLE.genoCheck.depthSM'' - depth distribution of the sequence reads per sample

+

*** ''*/SAMPLE.genoCheck.err'' - log file

+

*** ''*/SAMPLE.genoCheck.log'' - log file

+

*** ''*/SAMPLE.genoCheck.OK'' - temp file indicating the VerifyBAMID step completed successfully

+

*** ''*/SAMPLE.genoCheck.selfRG'' - per-readGroup statistics describing how well each lane matches to the annotated sample

+

*** '''''*/SAMPLE.genoCheck.selfSM'' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample'''

+

**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better

+

**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination

+

** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results

+

*** ''*/SAMPLE.qplot.OK'' - temp file indicating the qplot step completed successfully

+

*** '''''*/SAMPLE.qplot.R'' - Rscript that can be used to generate the pdf graphs'''

+

*** '''''*/SAMPLE.qplot.stats'' - sample statistics'''

+

You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *recabQC* sub-pipeline failed.

+

'''On success, the recab/ folder contains the final BAMs and bais, while the QCFiles/ folder contains the quality control output'''

+

===Command-Line and Configuration Options===

+

*Required Options

+

{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"

+

! Command-line Flag !! Configuration Key !! Value Description !! Default Value

+

|-

+

| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for recabQC|BAM_LIST File]] || $(OUT_DIR)/bam.list

+

|-

+

| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)

+

|}

+

*Common Options

+

{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"

+

! Command-line Flag !! Configuration Key !! Value Description !! Default Value

+

|-

+

| --outdir ''path'' || OUT_DIR || output directory ||

+

|-

+

| --conf ''file'' || || configuration file to use ||

+

|-

+

| || REF_DIR || where the reference/resource files are stored || gotcloud.ref subdirectory within the base GotCloud directory

+

|-

+

| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa

+

|-

+

| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF File|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz

+

|-

+

| || HM3_VCF || [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF File|HapMap3 VCF Files]] || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz

+

|}

+

==== Example Configuration File ====

+

Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5

+

BAM_LIST = /path/freeze5.bam.list

+

OUT_DIR = /path/freeze5/output

+

REF_DIR = /path/reference/

+

REF = $(REF_DIR)/hs37d5.fa

+

DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz

+

HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz

+

==== Example Command Line ====

+

gotcloud pipe –-name recabQC --numjobs <N>

+

== bamQC ==

+

*What it does:

+

# qplot

+

# verifyBamID

+

====Inputs====

+

* Single merged, recalibrated, and deduped BAM file for each subject (stored in a [[#BAM_LIST File for bamQC|BAM_LIST File]])

+

* BAI file for each subject

+

* Reference files

+

* (Optional) configuration file to override default options

+

=====BAM_LIST File for bamQC=====

+

* Each line of the BAM list file represents a single individual

+

Columns:

+

# sample id

+

# comma separated population labels (optional column)

+

# BAM File (preferable to have full path to BAM file)

+

# BAI File (preferable to have full path to BAI file)

+

[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE] [BAI_FILE]

+

or

+

[SAMPLE_ID] [BAM_FILE] [BAI_FILE]

+

* Notes:

+

** tab delimited

+

** population label is optional - it will default to <code>ALL</code>

+

*** only used by Thunder (part of ldrefine pipeline)

+

*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.

+

====Outputs====

+

Upon successful completion of the *bamQC* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:

+

* '''QCFiles/''' - contains quality control results

+

** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information

+

*** ''*/SAMPLE.genoCheck.depthRG'' - depth distribution of the sequence reads per read group

+

*** ''*/SAMPLE.genoCheck.depthSM'' - depth distribution of the sequence reads per sample

+

*** ''*/SAMPLE.genoCheck.err'' - log file

+

*** ''*/SAMPLE.genoCheck.log'' - log file

+

*** ''*/SAMPLE.genoCheck.OK'' - temp file indicating the VerifyBAMID step completed successfully

+

*** ''*/SAMPLE.genoCheck.selfRG'' - per-readGroup statistics describing how well each lane matches to the annotated sample

+

*** '''''*/SAMPLE.genoCheck.selfSM'' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample'''

+

**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better

+

**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination

+

** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results

+

*** ''*/SAMPLE.qplot.OK'' - temp file indicating the qplot step completed successfully

+

*** '''''*/SAMPLE.qplot.R'' - Rscript that can be used to generate the pdf graphs'''

+

*** '''''*/SAMPLE.qplot.stats'' - sample statistics'''

+

You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *bamQC* sub-pipeline failed.

+

'''On success, the QCFiles/ folder contains the quality control output'''

===Command-Line and Configuration Options===

Line 169: Line 280:

! Command-line Flag !! Configuration Key !! Value Description !! Default Value

|-

−

| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File|BAM_LIST File]] || $(OUT_DIR)/bam.list

+

| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for bamQC|BAM_LIST File]] || $(OUT_DIR)/bam.list

|-

| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)

Line 187: Line 298:

| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa

|-

−

| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF ~~Files~~|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz

+

| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF File|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz

+

|-

+

| || HM3_VCF || [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF File|HapMap3 VCF Files]] || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz

|}

Line 197: Line 310:

REF = $(REF_DIR)/hs37d5.fa

DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz

+

HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz

==== Example Command Line ====

−

gotcloud pipe –-name ~~recab~~ --numjobs <N>

+

gotcloud pipe –-name bamQC --numjobs <N>

−

~~== bamQC ==~~

== bamQC_createIndex ==

+

*What it does:

+

# creates a BAI file for any BAM that is missing it

+

# qplot

+

# verifyBamID

+

====Inputs====

+

* Single merged, recalibrated, and deduped BAM file for each subject (stored in a [[#BAM_LIST File for bamQC_createIndex|BAM_LIST File]])

+

* Reference files

+

* (Optional) configuration file to override default options

+

=====BAM_LIST File for bamQC_createIndex=====

+

* Each line of the BAM list file represents a single individual

+

Columns:

+

# sample id

+

# comma separated population labels (optional column)

+

# BAM File (preferable to have full paths to BAM files)

+

[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE]

+

or

+

[SAMPLE_ID] [BAM_FILE]

+

* Notes:

+

** tab delimited

+

** population label is optional - it will default to <code>ALL</code>

+

*** only used by Thunder (part of ldrefine pipeline)

+

*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.

+

====Outputs====

+

Upon successful completion of the *bamQC_createIndex* sub-pipeline, you should see the following files/subdirectories under the user specified output directory:

+

* A BAI file with the exact same path and name as the BAM file that was input, with *.bai on the end

+

* '''QCFiles/''' - contains quality control results

+

** VerifyBamID Output - see [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information

+

*** ''*/SAMPLE.genoCheck.depthRG'' - depth distribution of the sequence reads per read group

+

*** ''*/SAMPLE.genoCheck.depthSM'' - depth distribution of the sequence reads per sample

+

*** ''*/SAMPLE.genoCheck.err'' - log file

+

*** ''*/SAMPLE.genoCheck.log'' - log file

+

*** ''*/SAMPLE.genoCheck.OK'' - temp file indicating the VerifyBAMID step completed successfully

+

*** ''*/SAMPLE.genoCheck.selfRG'' - per-readGroup statistics describing how well each lane matches to the annotated sample

+

*** '''''*/SAMPLE.genoCheck.selfSM'' - main output file containing the contamination estimate; per-sample statistics describing how well the sample matches to the annotated sample'''

+

**** Check the 'FREEMIX' column for genotype-free estimate of contamination 0-1 scale, the lower, the better

+

**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination

+

** Qplot Output - see: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results

+

*** ''*/SAMPLE.qplot.OK'' - temp file indicating the qplot step completed successfully

+

*** '''''*/SAMPLE.qplot.R'' - Rscript that can be used to generate the pdf graphs'''

+

*** '''''*/SAMPLE.qplot.stats'' - sample statistics'''

+

You should see .done and .OK files for each SAMPLE in the index file. If you do not see the .done and .OK files, then your *bamQC_createIndex* sub-pipeline failed.

+

'''On success, the QCFiles/ folder contains the quality control output'''

+

===Command-Line and Configuration Options===

+

*Required Options

+

{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"

+

! Command-line Flag !! Configuration Key !! Value Description !! Default Value

+

|-

+

| --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM_LIST File for bamQC_createIndex|BAM_LIST File]] || $(OUT_DIR)/bam.list

+

|-

+

| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run)

+

|}

+

*Common Options

+

{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"

+

! Command-line Flag !! Configuration Key !! Value Description !! Default Value

+

|-

+

| --outdir ''path'' || OUT_DIR || output directory ||

+

|-

+

| --conf ''file'' || || configuration file to use ||

+

|-

+

| || REF_DIR || where the reference/resource files are stored || gotcloud.ref subdirectory within the base GotCloud directory

+

|-

+

| || REF || [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]] || $(REF_DIR)/human.g1k.v37.fa

+

|-

+

| || DBSNP_VCF || [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF File|DBSNP VCF Files]] || $(REF_DIR)/dbsnp_135.b37.vcf.gz

+

|-

+

| || HM3_VCF || [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF File|HapMap3 VCF Files]] || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz

+

|}

+

==== Example Configuration File ====

+

Example configuration file where reference files happen to be stored in /path/reference, and bam list file is stored in in path/freeze5

+

BAM_LIST = /path/freeze5.bam.list

+

OUT_DIR = /path/freeze5/output

+

REF_DIR = /path/reference/

+

REF = $(REF_DIR)/hs37d5.fa

+

DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz

+

HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz

+

==== Example Command Line ====

+

gotcloud pipe –-name bamQC_createIndex --numjobs <N>

Kleckner

87

edits

Changes

GotCloud: Alignment Sub-Pipelines (view source)

Revision as of 21:37, 18 March 2015

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools