Examples of Read Mapping with Karma and BWA

From Genome Analysis Wiki
Revision as of 11:36, 14 December 2009 by Goncalo (talk | contribs)
Jump to navigationJump to search


#  Some instructions for read mapping and variant calling using the 
#  University of Michigan tools and procedures.  
	#  -- Paul Anderson and Tom Blackwell, December 11, 2009  --
#  These instructions will cover three components:
#  (1)  read mapping using karma
#  (2)  read mapping using bwa
#  (3)  samtools processing from .sam to .glf
#  These instructions are specifically with reference to the CEU trio 
#  chromosome 20 test data set that has been distributed.  Please 
#  see some general discussion at the beginning of item 2.  Everything 
#  that's not commented out should be runnable code, although you 
#  will need to alter the directory and file names.

#  (1)  Command line procedure for karma read mapping of the CEU trio 
#  chromosome 20 test data set.  Paul Anderson writes:  
#  Start with four subdirectories:  
#  indiv  containing single and paired end Illumina .fastq sequence files, 
#  ab     containing AB SOLiD color space reads as in the test data set, 
#  k.ref  containing the gzipped human genome reference sequence,  and 
#  k.out  an empty directory for the resulting karma .sam and .stats files.
#  First, build karma's binary word index files:

cd k.ref zcat human_g1k_v37.fasta.gz > human_g1k_v37.fa karma --reference human_g1k_v37.fa --createIndex --occurrenceCutoff 5000

#  then map Illumina paired end or single end reads as, for example:

cd ../k.out karma --reference ../k.ref/human_g1k_v37.fa --pairedReads --maxInsert 2000 \ ../indiv/SRR010941_1.recal.fastq.gz ../indiv/SRR010941_2.recal.fastq.gz

karma --reference ../k.ref/human_g1k_v37.fa --maxInsert 2000 \ ../indiv/SRR010936.recal.fastq.gz

#  The read lengths in the Illumina data I saw are long enough so that the 
#  only parameter worth tweaking might be  --occurrenceCutoff  in the index 
#  structures.  I will need to study the mapping performance to see if this 
#  makes any appreciable difference in speed.
#  LS 454 fastq files can also be mapped using the same index, but the 
#  mapping success rate will be much lower.
#  For AB Solid data, a color space reference and set of index files must be 
#  created.  For the lengths of reads I see, we will create two -- one with a 
#  12-mer index, and the other with a 15-mer.

cd ../k.ref ln -s human_g1k_v37.fa human_g1k_v37_12CS.fa ln -s human_g1k_v37.fa human_g1k_v37_15CS.fa karma --reference human_g1k_v37_12CS.fa --colorSpace --createIndex --wordSize 12 karma --reference human_g1k_v37_15CS.fa --colorSpace --createIndex --wordSize 15

#  Then, short color space reads (25/26-mers) should be mapped using the 12-mer index:

cd ../k.out karma --reference ../k.ref/human_g1k_v37.fa \ --csreference ../k.ref/human_g1k_v37_12CS.fa \ --colorSpace --pairedReads --maxInsert 2000 \ ../ab/TG150_1.color.space.fastq.gz \ ../ab/TG150_2.color.space.fastq.gz

#  To map longer color space reads (>30-mer), use the bigger index:

karma --reference ../k.ref/human_g1k_v37.fa \ --csreference ../k.ref/human_g1k_v37_15CS.fa \ --colorSpace --pairedReads --maxInsert 2000 \ ../ab/TG152_1.color.space.fastq.gz \ ../ab/TG152_2.color.space.fastq.gz

#  You can use the shorter reference to map any length reads, but it is slower.  
#  The 15-mer color space index (human_g1k_v37_15CS.fa) is a good compromise 
#  for mapping reads 30 or more bases, and for machines with 24GB of RAM.
#  The maxInsert of 2000 is somewhat arbitrary - you want a number that includes 
#  the bulk of the paired end read insert distances you are interested in.  More 
#  is slower, so you don't want an arbitrarily large value.

#  (2)  Command line procedure for bwa read mapping of the CEU trio 
#  chromosome 20 test data set.
#  As a general principle, one wants to keep all sequence data separated 
#  by sequencing technology and by individual throughout the entire mapping 
#  and variant calling process.  Different technologies may have different 
#  requirements for read mapping and different characteristics for variant 
#  calling.  This can be done either using the directory structure or with a 
#  file tracking database.  
#  As a further complication, the Broad Institute Illumina sequencing runs 
#  in the current test data set benefit from removing leading and trailing Ns 
#  before read mapping with bwa.  All Illumina data for the CEU trio that 
#  have SRR identifiers are Broad Institute runs.  (There are no Broad 
#  runs for the YRI trio.  There, SRR identifiers will indicate either Beijing 
#  or Wash U runs.)  I have used exactly the same bwa command lines with 
#  or without N-trimming.
#  In what follows, I will use somewhat simplified directory and file names.
#  At this stage, it's easiest to have the command line prompt remain at top 
#  level in the directory structure, and refer to all files using relative 
#  pathnames, relative to the location of the prompt.  At the start, suppose 
#  that there are two subdirectories:  'bwa.ref'  containing the gzipped human 
#  genome reference sequence, and  'indiv'  containing many .fastq files for 
#  the target individual generated by the appropriate sequencing technology.  
#  The bwa program has an inconvenient habit of writing to std.err.  I routinely 
#  redirect that to a log file so that I can keep on working while bwa runs in 
#  the background.  
#  In general, bwa read mapping requires initial indexing of the genome 
#  reference sequence, followed by two passes for each .fastq file.  The 
#  software components of this process are:  
#  bwa  index  --  Do this only when a new version of the genome reference 
#                  sequence is released.  Takes just under two hours.
#  bwa   aln   --  Run this on every .fastq file individually.  Highly 
#                  variable timings -- takes between 10 minutes and many 
#                  hours per .fastq file.  Better data runs quicker.
#  bwa  samse  --  Converts a single .fastq / .sai pair to .sam alignment 
#                  format.  Usually under 1 minute per file.
#  bwa  sampe  --  Converts paired end .fastq / .sai pairs (four files total) 
#                  to a single .sam alignment file.  1 - 2 minutes per run.  
#                  (In general, the CEU trio chromosome 20 test data set 
#                  contains VERY small .fastq files.  Chromosome 20 is just 
#                  over 2% of the entire genome.)
#  Sample command lines.  This is csh syntax.  
#  The parameter string  " -n 0.002  -M 7  -R 25 "  used in bwa aln below 
#  is just my initial guess.  I have made an entire run using this string.  
#  I might make another run using Heng Li's default values for all three 
#  parameters and compare the results, but I have no answers from this yet.  

bwa index -a bwtsw -p bwa.ref/ncbi.v37.ref \ bwa.ref/human_g1k_v37.fasta.gz >>& logfile

mkdir bwa.sai bwa.sam

( nice +20 bwa aln -n 0.002 -M 7 -R 25 bwa.ref/ncbi.v37.ref \ indiv/SRRx_1.fastq.gz > bwa.sai/SRRx_1.sai ) >>& logfile

#  ... and the same for  SRRx_2.fastq.gz  or  SRRx.fastq.gz.
#  Then, depending on whether these are paired end reads or not, either:

( nice +20 bwa sampe bwa.ref/ncbi.v37.ref \ bwa.sai/SRRx_1.sai bwa.sai/SRRx_2.sai \ indiv/SRRx_1.fastq.gz indiv/SRRx_2.fastq.gz \ > bwa.sam/SRRx.pair.sam ) >>& logfile

#  or, for single end reads:

( nice +20 bwa samse bwa.ref/ncbi.v37.ref \ bwa.sai/SRRx.sai indiv/SRRx.fastq.gz \ > bwa.sam/SRRx.single.sam ) >>& logfile

#  For LS 454 sequence data, bwa read mapping is a one-step process.  
#  This uses the same genome reference sequence index as for Illumina 
#  data above, and does not use paired end information in the mapping.  
#  (The directory  bwa.sam  shown here should be separate from that 
#  created above for Illumina data.  Similarly for AB SOLiD below.)  

( nice +20 bwa dbwtsw bwa.ref/ncbi.v37.ref indiv/SRRx.fastq.gz \ > bwa.sam/SRRx.sam ) >>& logfile

#  For AB SOLiD data, one must build a separate index structure for 
#  the genome reference sequence, and (completely undocumented) one 
#  must rewrite all of the .fastq files replacing  0,1,2,3,"."  with 
#  A,C,G,T,N  in every line of color space sequence and omitting the 
#  first two characters from each line of converted sequence and from 
#  the base call quality strings.  (An awk script does this conversion 
#  really quickly.)

bwa index -a bwtsw -p bwa.ref/ncbi.color.ref -c \ bwa.ref/human_g1k_v37.fasta.gz >>& logfile

mkdir bwa.sai bwa.sam

( nice +20 bwa aln -n 0.002 -M 7 -R 25 -o 0 -c bwa.ref/ncbi.color.ref \ indiv/TG145_1.cvt.gz > bwa.sai/TG145_1.sai ) >>& logfile

#  ... and the same for  TG145_2.cvt.gz  or  TG145.cvt.gz.
#  Then, depending on whether these are paired end reads or not, either:

( nice +20 bwa sampe bwa.ref/ncbi.color.ref \ bwa.sai/TG145_1.sai bwa.sai/TG145_2.sai \ indiv/TG145_1.cvt.gz indiv/TG145_2.cvt.gz \ > bwa.sam/SRRx.pair.sam ) >>& logfile

#  or:

( nice +20 bwa samse bwa.ref/ncbi.color.ref \ bwa.sai/TG145.sai indiv/TG145.cvt.gz \ > bwa.sam/SRRx.single.sam ) >>& logfile

#  (3)  Here is the process which builds the .glf format files used for 
#  SNP calling out of .sam format files for individual sequencing runs 
#  produced using either read mapping algorithm.  This will use only 
#  samtools utilities and contains nothing specific to either read mapper.  
#  For illustration, I will assume two of the directories left over from 
#  bwa read mapping in the preceding section.  These are:  
#  bwa.ref  containing files  human_g1k_v37.fasta.gz, human_g1k_v37.fasta.fai, and 
#  bwa.sam  containing all of the single-end and paired-end .sam files generated 
#  	 	 	for one individual and sequencing technology.
#  The full set of command lines is:  (Note csh syntax again -- and this time, 
#     I will change directories for convenience in file naming.)

mkdir bwa.bam indiv.bam cd bwa.bam set minq=17

foreach file ( ../bwa.sam/*.sam )

  set unsorted=`basename $file .sam`.nosort.bam
  samtools view -bhuS  -o $unsorted  -q $minq  $file	>>& logfile
  samtools sort  $unsorted  `basename $file .sam`	>>& logfile
  rm $unsorted

unset minq

#  This is the point where one would insert steps to recalibrate 
#  base call quality values, check genotype identity or remove 
#  duplicate sequence reads.  Each of these steps is specific to 
#  an individual sequencing run.

cd ../indiv.bam samtools merge person.bam ../bwa.bam/*.bam samtools index person.bam samtools view person.bam 20 | samtools pileup \ -f ../bwa.ref/human_g1k_v37.fasta.gz \ -t ../bwa.ref/human_g1k_v37.fasta.fai \ -g - > person.glf