Changes

From Genome Analysis Wiki
Jump to navigationJump to search
no edit summary
Line 1: Line 1: −
= Mapping Pipeline =
+
= Alignment Pipeline =  
   −
Back to parent: [[GotCloud]]
+
Back to parent: [[GotCloud]]  
   −
The Mapping Pipeline takes FASTQ files and generates recalibrated BAM files from them.
+
The Alignment/Mapping Pipeline takes FASTQ files and generates recalibrated BAM files from them.  
   −
== Running the GotCloud Mapping Pipeline ==
+
== Running the GotCloud Alignment Pipeline ==  
   −
The mapping pipeline is run using the <code>gen_biopipeline.pl</code> script found in the <code>bin/</code> directory under the <code>gotcloud/</code> installation.
+
The alignment pipeline is run using the <code>align</code> option of the <code>gotcloud</code> script.  This option calls <code>align.pl</code> found in the <code>bin/</code> directory under the <code>gotcloud</code> installation.  
   −
===Running the Automatic Test===
+
===Running the Automated Test===  
   −
The automatic test runs the mapping pipeline on a small testset and checks the results against expected results validating that GotCloud is installed correctly.
+
The automated test runs the alignment pipeline on a small set of test data and checks that the results against expected results validating that GotCloud is installed correctly.  
   −
*Run alignment pipeline test:
+
*Run alignment pipeline test:  
  gen_biopipeline.pl --test OUTPUT_DIR
+
  gotcloud align --test OUTPUT_DIR  
where OUTPUT_DIR is the directory where you want to store the test results
+
where OUTPUT_DIR is the directory where you want to store the test results  
   −
If you see "Successfully ran the test case, congratulations!", then you are ready to align samples.
+
If you see "Successfully ran the test case, congratulations!", then you are ready to align samples.  
      −
== Overview of Mapping Pipeline Steps ==
+
== Overview of Alignment Pipeline Steps ==  
Here is an overview of the Mapping Pipeline:
+
Here is an overview of the Alignment Pipeline:  
   −
[[File:MappingSteps.png]]
+
[[File:MappingSteps.png]]  
   −
== Input Data:==
+
== Input Data:==  
*Raw Sequence (FASTQ) files
+
*Raw Sequence (FASTQ) files  
*Sequence Index file containing fastqs & RG info
+
*Sequence Index file containing fastqs & RG info  
*Reference files
+
*Reference files  
*(Optional) Configuration file to override default options
+
*(Optional) Configuration file to override default options  
   −
=== Raw Sequence (FASTQ) files ===
+
=== Raw Sequence (FASTQ) files ===  
   −
These are the FASTQ files that need to be mapped to BAM files.
+
These are the FASTQ files that need to be mapped to BAM files.  
   −
These files are specified in the [[#Sequence Index File|Sequence Index File]].
+
These files are specified in the [[#Sequence Index File|Sequence Index File]].  
   −
=== Sequence Index File ===
+
=== Sequence Index File ===  
This file specifies the FASTQ files that need to be processed and the Read Group information for them.
+
This file specifies the FASTQ files that need to be processed and the Read Group information for them.  
    
This file is specified either via the command line parameter <code>--index_file</code> or via the configuration file setting <code>INDEX_FILE</code>.   
 
This file is specified either via the command line parameter <code>--index_file</code> or via the configuration file setting <code>INDEX_FILE</code>.   
   −
The command-line setting takes precedence over the configuration file setting.
+
The command-line setting takes precedence over the configuration file setting.  
   −
The Sequence Index is a tab delimited file that starts with a header line.  The columns may be in any order.
+
The Sequence Index is a tab delimited file that starts with a header line.  The columns may be in any order.  
   −
Following the header line, there is one line per single-end read and one line per paired-end read (only 1 line per pair).
+
Following the header line, there is one line per single-end read and one line per paired-end read (only 1 line per pair).  
    
Required Column Names:  
 
Required Column Names:  
* MERGE_NAME - base name for the resulting BAM file for the sample (used to group multiple fastqs or fastq pairs into a single BAM)
+
* MERGE_NAME - base name for the resulting BAM file for the sample (used to group multiple fastqs or fastq pairs into a single BAM)  
* FASTQ1 - name of the fastq or the first in the pair if paired-end.  (Only 1 line per pair)
+
* FASTQ1 - name of the fastq or the first in the pair if paired-end.  (Only 1 line per pair)  
   −
Optional Column Names:
+
Optional Column Names:  
* FASTQ2 - name of the 2nd fastq in paired-end reads.  Specify '.' if the column exists, but this line is single-ended.
+
* FASTQ2 - name of the 2nd fastq in paired-end reads.  Specify '.' if the column exists, but this line is single-ended.  
* RGID - Read Group ID for this entry
+
* RGID - Read Group ID for this entry  
* SAMPLE - Sample Name for this entry
+
* SAMPLE - Sample Name for this entry  
* LIBRARY - Library for this entry
+
* LIBRARY - Library for this entry  
* CENTER - Center Name for this entry
+
* CENTER - Center Name for this entry  
* PLATFORM - Platform for this entry
+
* PLATFORM - Platform for this entry  
   −
The RGID, SAMPLE, LIBRARY, CENTER, and PLATFORM are used to populate the Read Group information for this entry.  These fields are optional.  Either leave the column header out of the file or specify '.' if the column header exists, but the data is N/A.  As long as the RGID field is specified non-N/A fields are added to the BAM file.
+
The RGID, SAMPLE, LIBRARY, CENTER, and PLATFORM are used to populate the Read Group information for this entry.  These fields are optional.  Either leave the column header out of the file or specify '.' if the column header exists, but the data is N/A.  As long as the RGID field is specified non-N/A fields are added to the BAM file.  
   −
  MERGE_NAME FASTQ1 FASTQ2 RGID SAMPLE LIBRARY CENTER PLATFORM
+
  MERGE_NAME FASTQ1 FASTQ2 RGID SAMPLE LIBRARY CENTER PLATFORM  
  Sample1 fastq/S1/F1_R1.fastq.gz fastq/S1/F1_R2.fastq.gz RGID1 SampleID1 Lib1 UM ILLUMINA
+
  Sample1 fastq/S1/F1_R1.fastq.gz fastq/S1/F1_R2.fastq.gz RGID1 SampleID1 Lib1 UM ILLUMINA  
  Sample1 fastq/S1/F2_R1.fastq.gz fastq/S1/F2_R2.fastq.gz RGID1a SampleID1 Lib1 UM ILLUMINA
+
  Sample1 fastq/S1/F2_R1.fastq.gz fastq/S1/F2_R2.fastq.gz RGID1a SampleID1 Lib1 UM ILLUMINA  
  Sample2 fastq/S2/F1_R1.fastq.gz fastq/S2/F1_R2.fastq.gz RGID2 SampleID2 Lib2 UM ILLUMINA
+
  Sample2 fastq/S2/F1_R1.fastq.gz fastq/S2/F1_R2.fastq.gz RGID2 SampleID2 Lib2 UM ILLUMINA  
  Sample2 fastq/S2/F2.fastq.gz . RGID2 SampleID2 Lib2 UM ILLUMINA
+
  Sample2 fastq/S2/F2.fastq.gz . RGID2 SampleID2 Lib2 UM ILLUMINA  
   −
The <code>--fastq</code>/<code>FASTQ</code> setting can be used to specify a prefix to the FASTQ1/FASTQ2 file paths that should be applied before using the files.
+
The <code>--fastq</code>/<code>FASTQ</code> setting can be used to specify a prefix to the FASTQ1/FASTQ2 file paths that should be applied before using the files.  
   −
=== Reference Files ===
+
=== Reference Files ===  
   −
The following Reference Files are required:
+
The following Reference Files are required:  
* Reference File fasta files
+
* Reference File fasta files  
** Files required: .fa, -bs.umfa, .GCContent, .amb, .ann, .bwt, .pac, .rbwt, .rpac, .rsa, .sa
+
** Files required: .fa, -bs.umfa, .GCContent, .amb, .ann, .bwt, .pac, .rbwt, .rpac, .rsa, .sa  
*** If you don't have the -bs.umfa file, the software will try to create it in the same directory as the reference fasta.
+
*** If you don't have the -bs.umfa file, the software will try to create it in the same directory as the reference fasta.  
*** .GCContent can be generated using qplot, see: [[QPLOT#Input_files| QPLOT: Input Files: --gccontent]] and name the resulting file as <code>.fa.GCcontent</code>
+
*** .GCContent can be generated using qplot, see: [[QPLOT#Input_files| QPLOT: Input Files: --gccontent]] and name the resulting file as <code>.fa.GCcontent</code>  
*** Use <code>bin/bwa index ref.fa</code> if you need to generate the bwa reference files (.amb, .ann, .bwt, .pac, .rbwt, .rpac, .rsa, .sa)
+
*** Use <code>bin/bwa index ref.fa</code> if you need to generate the bwa reference files (.amb, .ann, .bwt, .pac, .rbwt, .rpac, .rsa, .sa)  
** Configuration Name: FA_REF - specify the ref.fa/ref.fa.gz name
+
** Configuration Name: FA_REF - specify the ref.fa/ref.fa.gz name  
* DBSNP File
+
* DBSNP File  
** tab delimited file/VCF, can be compressed
+
** tab delimited file/VCF, can be compressed  
*** 1st column -> chromosome
+
*** 1st column -> chromosome  
*** 2nd column -> 1-based position
+
*** 2nd column -> 1-based position  
** Configuration Name: DBSNP_VCF
+
** Configuration Name: DBSNP_VCF  
* PLINK-compatible binary genotype files
+
* PLINK-compatible binary genotype files  
** Files required: .bed, .bin, .fam
+
** Files required: .bed, .bin, .fam  
** Configuration Name: PLINK
+
** Configuration Name: PLINK  
   −
=== Configuration File ===
+
=== Configuration File ===  
Configuration file contains the run-time options including the software binaries and command line arguments.  A default configuration file is automatically loaded.  Users may specify their own configuration file specifying just the values different than the defaults.  The configuration file is not required if there are no values to override.
+
Configuration file contains the run-time options including the software binaries and command line arguments.  A default configuration file is automatically loaded.  Users may specify their own configuration file specifying just the values different than the defaults.  The configuration file is not required if there are no values to override.  
   −
Comments begin with a <code>#</code>
+
Comments begin with a <code>#</code>  
   −
Format: KEY = value
+
Format: KEY = value  
   −
Where KEY is the item being set and value is its new value
+
Where KEY is the item being set and value is its new value  
   −
See [[#Command-Line Options|Command-Line Options]] for values that can be set either via command line or via configuration.
+
See [[#Command-Line Options|Command-Line Options]] for values that can be set either via command line or via configuration.  
   −
Note: Command-line options take priority over configuration file settings
+
Note: Command-line options take priority over configuration file settings  
   −
==== Required Settings ====
+
==== Required Settings ====  
See [[#Reference Files|Reference Files]] for the required reference file settings.
+
See [[#Reference Files|Reference Files]] for the required reference file settings.  
   −
See [[#Sequence Index File|Sequence Index File]] for how to set the index file either via command line options or via configuration.
+
See [[#Sequence Index File|Sequence Index File]] for how to set the index file either via command line options or via configuration.  
   −
==== Turning Off Optional Steps====
+
==== Turning Off Optional Steps====  
Quality Control steps can be disabled.
+
Quality Control steps can be disabled.  
   −
To Disable QPLOT, set:
+
To Disable QPLOT, set:  
  RUN_QPLOT = 0
+
  RUN_QPLOT = 0  
   −
To Disable VerifyBamID, set:
+
To Disable VerifyBamID, set:  
  RUN_VERIFY_BAM_ID = 0
+
  RUN_VERIFY_BAM_ID = 0  
   −
==== Optional Configurable Settings ====
+
==== Optional Configurable Settings ====  
You may want to adjust the amount of memory/threads that are used:
+
You may want to adjust the amount of memory/threads that are used:  
   −
There are additional configurable settings, but these are the ones most likely to be adjusted.
+
There are additional configurable settings, but these are the ones most likely to be adjusted.  
   −
* BWA_THREADS = -t N
+
* BWA_THREADS = -t N  
** Fill in the N with the number of threads you want BWA to run with, default is 1
+
** Fill in the N with the number of threads you want BWA to run with, default is 1  
* BWA_MAX_MEM = 2000000000
+
* BWA_MAX_MEM = 2000000000  
** Maximum amount of memory used by samtools sort after running bwa
+
** Maximum amount of memory used by samtools sort after running bwa  
* JAVA_MEM = -Xmx4g
+
* JAVA_MEM = -Xmx4g  
** Set the maximum size of the java memory allocation pool.  Default is 4g, adjust that as necessary.
+
** Set the maximum size of the java memory allocation pool.  Default is 4g, adjust that as necessary.  
      −
== Running the Mapping Pipeline ==
+
== Running the Alignment Pipeline ==  
   −
=== Command-Line Options ===
+
=== Command-Line Options ===  
* help - print usage
+
* help - print usage  
* test OUTPUT_DIR - run the test example placing the output in a user specified OUTPUT_DIR.  No other options are required.
+
* test OUTPUT_DIR - run the test example placing the output in a user specified OUTPUT_DIR.  No other options are required.  
* out_dir OUTPUT_DIR - directory for the output
+
* out_dir OUTPUT_DIR - directory for the output  
** May also be specified via OUT_DIR in the configuration file
+
** May also be specified via OUT_DIR in the configuration file  
** Required to be set either via command-line or configuration
+
** Required to be set either via command-line or configuration  
* conf CONFIG_FILE - configuration file
+
* conf CONFIG_FILE - configuration file  
* index_file INDEX_FILE_NAME  - name of the index file
+
* index_file INDEX_FILE_NAME  - name of the index file  
** May also be specified via INDEX_FILE in the configuration file
+
** May also be specified via INDEX_FILE in the configuration file  
** Required to be set either via command-line or configuration
+
** Required to be set either via command-line or configuration  
* ref_dir REFERENCE_DIR - value to set config key REF_DIR to, overriding other values, REF_DIR can then be used inside config files.
+
* ref_dir REFERENCE_DIR - value to set config key REF_DIR to, overriding other values, REF_DIR can then be used inside config files.  
** May also be specified via REF_DIR in the configuration file
+
** May also be specified via REF_DIR in the configuration file  
* fastq FASTQ_PATH - prefix path to the fastq files specified in the INDEX_FILE
+
* fastq FASTQ_PATH - prefix path to the fastq files specified in the INDEX_FILE  
** May also be specified via FASTQ in the configuration file
+
** May also be specified via FASTQ in the configuration file  
* keepTmp - Do not remove the temporary files (removed by default)
+
* keepTmp - Do not remove the temporary files (removed by default)  
** May also be specified via KEEP_TMP in the configuration file
+
** May also be specified via KEEP_TMP in the configuration file  
* numjobs N - Replace N with the number of jobs that should be run in parallel
+
* numjobs N - Replace N with the number of jobs that should be run in parallel  
   −
Note: Command-line options take priority over configuration file settings
+
Note: Command-line options take priority over configuration file settings  
   −
The mapping pipeline is currently a 2 step process.
     −
===Step 1: Generate the Makefiles, 1 per sample===
+
===Running the Alignment Pipeline===  
Run bin/gen_biopipeline.pl with the appropriate command-line parameters.
+
Run <code>gotcloud align</code> with the appropriate command-line parameters.  
   −
Example:
+
Example:  
  bin/gen_biopipeline.pl --conf config.txt --out_dir output
+
  gotcloud align --conf config.txt --outdir output  
   −
This step generates 1 Makefile per sample in the output/Makefiles/ directory, but does not run them.  These Makefiles contain all of the information to run each sample.
+
This step generates 1 Makefile per sample in the output/Makefiles/ directory and then automatically runs them.  The Makefiles contain all of the information to run each sample.  
   −
Instructions are printed for running the Makefiles.
+
If you only want to generate the makefiles and not run them, use the <code>--dryrun</code> option.  It will generate the Makefiles and print instructions for running the Makefiles.  
   −
===Step 2: Run the Samples===
+
Each Makefile is independent and can be run in parallel and across a cloud.  
Run make -f on each file in the Makefiles directory.  Each Makefile is independent and can be run in parallel and across a cloud.
     −
It is recommended that you redirect stdout and stderr to files to save the results.
+
On success, you will see:
 +
Processing finished in nn secs with no errors reported
 +
and should see the following subdirectories under the user specified output directory:
 +
* bams/
 +
* Makefiles/
 +
* QCFiles/ (if all quality control is not disabled)
 +
* tmp/
   −
On failure, the Makefile should report a message like:
+
You should see a <code>.OK</code> for each Sample in the index file.  
make: *** [...] Error 1
     −
Where ... is filled in with other text indicating what step failed.
+
If you do not see these <code>.OK</code> files, then your Alignment Pipeline failed.  
   −
On success, you should see the following subdirectories under the user specified output directory:
+
On success, the bams/ directory contains the final BAMs, bais, and md5s.
* alignment_recal/
  −
* Makefiles/
  −
* QCFiles/ (if all quality control is not disabled)
  −
* tmp/
     −
You should see a <code>.OK</code> for each Sample in the index file.
+
If processing fails part way through, you can pick up where you left off by rerunning gotcloud or the make command.
 
  −
If you do not see these <code>.OK</code> files, then your Mapping Pipeline failed.
  −
 
  −
On success, the alignment.recal/ directory contains the final BAMs, bais, and md5s.
  −
 
  −
If processing fails part way through, you can pick up where you left off by rerunning the make command.
 

Navigation menu