Difference between revisions of "GotCloud: Alignment Pipeline"

Revision as of 13:34, 6 November 2012

Back to the beginning [1]

The Mapping Pipeline takes FASTQ files and generates recalibrated BAM files from them.

Input Data:

Raw Sequence (FASTQ) files
Sequence Index file
Reference files
(Optional) Configuration file to override default options

Raw Sequence (FASTQ) files

These are the FASTQ files that need to be mapped to BAM files.

Sequence Index File

This file specifies the FASTQ files that need to be processed and the Read Group information for them.

The Sequence Index is a tab delimited file that starts with a header line. The columns may be in any order.

Following the header line, there is one line per single-end read and one line per paired-end read (only 1 line per pair).

Required Column Names:

MERGE_NAME - base name for the resulting BAM file for the sample (used to group multiple fastqs or fastq pairs into a single BAM)
FASTQ1 - name of the fastq or the first in the pair if paired-end. (Only 1 line per pair)

Optional Column Names:

FASTQ2 - name of the 2nd fastq in paired-end reads. Specify '.' if the column exists, but this line is single-ended.
RGID - Read Group ID for this entry
SAMPLE - Sample Name for this entry
LIBRARY - Library for this entry
CENTER - Center Name for this entry
PLATFORM - Platform for this entry

The RGID, SAMPLE, LIBRARY, CENTER, and PLATFORM are used to populate the Read Group information for this entry. These fields are optional. Either leave the column header out of the file or specify '.' if the column header exists, but the data is N/A. As long as the RGID field is specified non-N/A fields are added to the BAM file.

Reference Files

The following Reference Files are required:

Reference File fasta files
- Files required: .fa, -bs.umfa, .GCContent, .amb, .ann, .bwt, .pac, .rbwt, .rpac, .rsa, .sa
  - If you don't have the -bs.umfa file, the software will try to create it in the same directory as the reference fasta.
  - .GCContent can be generated using qplot, see: QPLOT: Input Files: --gccontent and name the resulting file as .fa.GCcontent
  - Use bin/bwa index ref.fa if you need to generate the bwa reference files (.amb, .ann, .bwt, .pac, .rbwt, .rpac, .rsa, .sa)
- Configuration Name: FA_REF - specify the ref.fa/ref.fa.gz name
DBSNP File
- tab delimited file/VCF, can be compressed
  - 1st column -> chromosome
  - 2nd column -> 1-based position
- Configuration Name: DBSNP_VCF
PLINK-compatible binary genotype files
- Files required: .bed, .bin, .fam
- Configuration Name: PLINK

Configuration File

Running the Mapping Pipeline

cd ~/myseq
 /usr/local/biopipe/bin/gen_biopipeline.pl -out aligner -index ???

Difference between revisions of "GotCloud: Alignment Pipeline"

Revision as of 13:34, 6 November 2012

Contents

Input Data:

Raw Sequence (FASTQ) files

Sequence Index File

Reference Files

Configuration File

Running the Mapping Pipeline

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools

Revision as of 23:49, 5 November 2012 (view source) Mktrost (talk \| contribs) ← Older edit	Revision as of 13:34, 6 November 2012 (view source) Mktrost (talk \| contribs) m (moved Alignment Pipeline to Mapping Pipeline) Newer edit →
(No difference)