Difference between revisions of "GotCloud: Alignment Pipeline"

From Genome Analysis Wiki
Jump to navigationJump to search
(No difference)

Revision as of 13:34, 6 November 2012

Back to the beginning [1]

The Mapping Pipeline takes FASTQ files and generates recalibrated BAM files from them.

Input Data:

  • Raw Sequence (FASTQ) files
  • Sequence Index file
  • Reference files
  • (Optional) Configuration file to override default options

Raw Sequence (FASTQ) files

These are the FASTQ files that need to be mapped to BAM files.


Sequence Index File

This file specifies the FASTQ files that need to be processed and the Read Group information for them.

The Sequence Index is a tab delimited file that starts with a header line. The columns may be in any order.

Following the header line, there is one line per single-end read and one line per paired-end read (only 1 line per pair).

Required Column Names:

  • MERGE_NAME - base name for the resulting BAM file for the sample (used to group multiple fastqs or fastq pairs into a single BAM)
  • FASTQ1 - name of the fastq or the first in the pair if paired-end. (Only 1 line per pair)

Optional Column Names:

  • FASTQ2 - name of the 2nd fastq in paired-end reads. Specify '.' if the column exists, but this line is single-ended.
  • RGID - Read Group ID for this entry
  • SAMPLE - Sample Name for this entry
  • LIBRARY - Library for this entry
  • CENTER - Center Name for this entry
  • PLATFORM - Platform for this entry

The RGID, SAMPLE, LIBRARY, CENTER, and PLATFORM are used to populate the Read Group information for this entry. These fields are optional. Either leave the column header out of the file or specify '.' if the column header exists, but the data is N/A. As long as the RGID field is specified non-N/A fields are added to the BAM file.

Reference Files

The following Reference Files are required:

  • Reference File fasta files
    • Files required: .fa, -bs.umfa, .GCContent, .amb, .ann, .bwt, .pac, .rbwt, .rpac, .rsa, .sa
      • If you don't have the -bs.umfa file, the software will try to create it in the same directory as the reference fasta.
      • .GCContent can be generated using qplot, see: QPLOT: Input Files: --gccontent and name the resulting file as .fa.GCcontent
      • Use bin/bwa index ref.fa if you need to generate the bwa reference files (.amb, .ann, .bwt, .pac, .rbwt, .rpac, .rsa, .sa)
    • Configuration Name: FA_REF - specify the ref.fa/ref.fa.gz name
  • DBSNP File
    • tab delimited file/VCF, can be compressed
      • 1st column -> chromosome
      • 2nd column -> 1-based position
    • Configuration Name: DBSNP_VCF
  • PLINK-compatible binary genotype files
    • Files required: .bed, .bin, .fam
    • Configuration Name: PLINK


Configuration File

Running the Mapping Pipeline

cd ~/myseq
 /usr/local/biopipe/bin/gen_biopipeline.pl -out aligner -index ???