Difference between revisions of "Examples of Read Mapping with Karma and BWA"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 1: Line 1:
<source lang="csh">
<source lang="text">
  #  Some instructions for read mapping and variant calling using the  
  #  Some instructions for read mapping and variant calling using the  

Revision as of 11:37, 14 December 2009

 #  Some instructions for read mapping and variant calling using the 
 #  University of Michigan tools and procedures.  

 	#  -- Paul Anderson and Tom Blackwell, December 11, 2009  --

 #  These instructions will cover three components:

 #  (1)  read mapping using karma
 #  (2)  read mapping using bwa
 #  (3)  samtools processing from .sam to .glf

 #  These instructions are specifically with reference to the CEU trio 
 #  chromosome 20 test data set that has been distributed.  Please 
 #  see some general discussion at the beginning of item 2.  Everything 
 #  that's not commented out should be runnable code, although you 
 #  will need to alter the directory and file names.

 #  (1)  Command line procedure for karma read mapping of the CEU trio 
 #  chromosome 20 test data set.  Paul Anderson writes:  

 #  Start with four subdirectories:  

 #  indiv  containing single and paired end Illumina .fastq sequence files, 
 #  ab     containing AB SOLiD color space reads as in the test data set, 
 #  k.ref  containing the gzipped human genome reference sequence,  and 
 #  k.out  an empty directory for the resulting karma .sam and .stats files.

 #  First, build karma's binary word index files:

cd     k.ref
zcat   human_g1k_v37.fasta.gz  >  human_g1k_v37.fa
karma  --reference human_g1k_v37.fa  --createIndex  --occurrenceCutoff 5000

 #  then map Illumina paired end or single end reads as, for example:

cd	../k.out
karma	--reference ../k.ref/human_g1k_v37.fa  --pairedReads   --maxInsert 2000	  \
	../indiv/SRR010941_1.recal.fastq.gz  ../indiv/SRR010941_2.recal.fastq.gz

karma	--reference ../k.ref/human_g1k_v37.fa  --maxInsert 2000        	 	  \

 #  The read lengths in the Illumina data I saw are long enough so that the 
 #  only parameter worth tweaking might be  --occurrenceCutoff  in the index 
 #  structures.  I will need to study the mapping performance to see if this 
 #  makes any appreciable difference in speed.

 #  LS 454 fastq files can also be mapped using the same index, but the 
 #  mapping success rate will be much lower.

 #  For AB Solid data, a color space reference and set of index files must be 
 #  created.  For the lengths of reads I see, we will create two -- one with a 
 #  12-mer index, and the other with a 15-mer.

cd    ../k.ref
ln -s    human_g1k_v37.fa  human_g1k_v37_12CS.fa
ln -s    human_g1k_v37.fa  human_g1k_v37_15CS.fa
karma  --reference human_g1k_v37_12CS.fa  --colorSpace  --createIndex  --wordSize 12
karma  --reference human_g1k_v37_15CS.fa  --colorSpace  --createIndex  --wordSize 15

 #  Then, short color space reads (25/26-mers) should be mapped using the 12-mer index:

cd	../k.out
karma	--reference   ../k.ref/human_g1k_v37.fa   	\
	--csreference ../k.ref/human_g1k_v37_12CS.fa	\
	--colorSpace  --pairedReads  --maxInsert 2000	\
	../ab/TG150_1.color.space.fastq.gz     	 	\

 #  To map longer color space reads (>30-mer), use the bigger index:

karma	--reference   ../k.ref/human_g1k_v37.fa     	\
	--csreference ../k.ref/human_g1k_v37_15CS.fa	\
	--colorSpace  --pairedReads  --maxInsert 2000	\
	../ab/TG152_1.color.space.fastq.gz    	 	\

 #  You can use the shorter reference to map any length reads, but it is slower.  
 #  The 15-mer color space index (human_g1k_v37_15CS.fa) is a good compromise 
 #  for mapping reads 30 or more bases, and for machines with 24GB of RAM.

 #  The maxInsert of 2000 is somewhat arbitrary - you want a number that includes 
 #  the bulk of the paired end read insert distances you are interested in.  More 
 #  is slower, so you don't want an arbitrarily large value.

 #  (2)  Command line procedure for bwa read mapping of the CEU trio 
 #  chromosome 20 test data set.

 #  As a general principle, one wants to keep all sequence data separated 
 #  by sequencing technology and by individual throughout the entire mapping 
 #  and variant calling process.  Different technologies may have different 
 #  requirements for read mapping and different characteristics for variant 
 #  calling.  This can be done either using the directory structure or with a 
 #  file tracking database.  

 #  As a further complication, the Broad Institute Illumina sequencing runs 
 #  in the current test data set benefit from removing leading and trailing Ns 
 #  before read mapping with bwa.  All Illumina data for the CEU trio that 
 #  have SRR identifiers are Broad Institute runs.  (There are no Broad 
 #  runs for the YRI trio.  There, SRR identifiers will indicate either Beijing 
 #  or Wash U runs.)  I have used exactly the same bwa command lines with 
 #  or without N-trimming.

 #  In what follows, I will use somewhat simplified directory and file names.
 #  At this stage, it's easiest to have the command line prompt remain at top 
 #  level in the directory structure, and refer to all files using relative 
 #  pathnames, relative to the location of the prompt.  At the start, suppose 
 #  that there are two subdirectories:  'bwa.ref'  containing the gzipped human 
 #  genome reference sequence, and  'indiv'  containing many .fastq files for 
 #  the target individual generated by the appropriate sequencing technology.  
 #  The bwa program has an inconvenient habit of writing to std.err.  I routinely 
 #  redirect that to a log file so that I can keep on working while bwa runs in 
 #  the background.  

 #  In general, bwa read mapping requires initial indexing of the genome 
 #  reference sequence, followed by two passes for each .fastq file.  The 
 #  software components of this process are:  

 #  bwa  index  --  Do this only when a new version of the genome reference 
 #                  sequence is released.  Takes just under two hours.
 #  bwa   aln   --  Run this on every .fastq file individually.  Highly 
 #                  variable timings -- takes between 10 minutes and many 
 #                  hours per .fastq file.  Better data runs quicker.
 #  bwa  samse  --  Converts a single .fastq / .sai pair to .sam alignment 
 #                  format.  Usually under 1 minute per file.
 #  bwa  sampe  --  Converts paired end .fastq / .sai pairs (four files total) 
 #                  to a single .sam alignment file.  1 - 2 minutes per run.  
 #                  (In general, the CEU trio chromosome 20 test data set 
 #                  contains VERY small .fastq files.  Chromosome 20 is just 
 #                  over 2% of the entire genome.)

 #  Sample command lines.  This is csh syntax.  

 #  The parameter string  " -n 0.002  -M 7  -R 25 "  used in bwa aln below 
 #  is just my initial guess.  I have made an entire run using this string.  
 #  I might make another run using Heng Li's default values for all three 
 #  parameters and compare the results, but I have no answers from this yet.  

bwa index  -a bwtsw  -p	bwa.ref/ncbi.v37.ref	 	 	 	\
	 	 	bwa.ref/human_g1k_v37.fasta.gz  >>& logfile

mkdir bwa.sai bwa.sam

( nice +20  bwa aln  -n 0.002  -M 7  -R 25  bwa.ref/ncbi.v37.ref	\
	    indiv/SRRx_1.fastq.gz  >  bwa.sai/SRRx_1.sai ) >>& logfile

 #  ... and the same for  SRRx_2.fastq.gz  or  SRRx.fastq.gz.
 #  Then, depending on whether these are paired end reads or not, either:

( nice +20  bwa sampe  bwa.ref/ncbi.v37.ref	 	 	\
	    bwa.sai/SRRx_1.sai	   bwa.sai/SRRx_2.sai	 	\
	    indiv/SRRx_1.fastq.gz  indiv/SRRx_2.fastq.gz	\
	      >  bwa.sam/SRRx.pair.sam )  >>& logfile

 #  or, for single end reads:

( nice +20  bwa samse  bwa.ref/ncbi.v37.ref	 	 	\
	    bwa.sai/SRRx.sai	   indiv/SRRx.fastq.gz	 	\
	      >  bwa.sam/SRRx.single.sam )  >>& logfile

 #  For LS 454 sequence data, bwa read mapping is a one-step process.  
 #  This uses the same genome reference sequence index as for Illumina 
 #  data above, and does not use paired end information in the mapping.  
 #  (The directory  bwa.sam  shown here should be separate from that 
 #  created above for Illumina data.  Similarly for AB SOLiD below.)  

( nice +20  bwa dbwtsw  bwa.ref/ncbi.v37.ref  indiv/SRRx.fastq.gz  \
	      >  bwa.sam/SRRx.sam )  >>& logfile

 #  For AB SOLiD data, one must build a separate index structure for 
 #  the genome reference sequence, and (completely undocumented) one 
 #  must rewrite all of the .fastq files replacing  0,1,2,3,"."  with 
 #  A,C,G,T,N  in every line of color space sequence and omitting the 
 #  first two characters from each line of converted sequence and from 
 #  the base call quality strings.  (An awk script does this conversion 
 #  really quickly.)

bwa index  -a bwtsw  -p	bwa.ref/ncbi.color.ref  -c	 	 	\
	 	 	bwa.ref/human_g1k_v37.fasta.gz  >>& logfile

mkdir bwa.sai bwa.sam

( nice +20  bwa aln -n 0.002 -M 7 -R 25 -o 0 -c  bwa.ref/ncbi.color.ref	\
	    indiv/TG145_1.cvt.gz  >  bwa.sai/TG145_1.sai ) >>& logfile

 #  ... and the same for  TG145_2.cvt.gz  or  TG145.cvt.gz.
 #  Then, depending on whether these are paired end reads or not, either:

( nice +20  bwa sampe  bwa.ref/ncbi.color.ref	 	 	\
	    bwa.sai/TG145_1.sai	   bwa.sai/TG145_2.sai	 	\
	    indiv/TG145_1.cvt.gz   indiv/TG145_2.cvt.gz	 	\
	      >  bwa.sam/SRRx.pair.sam )  >>& logfile

 #  or:

( nice +20  bwa samse  bwa.ref/ncbi.color.ref	 	 	\
	    bwa.sai/TG145.sai	   indiv/TG145.cvt.gz	 	\
	      >  bwa.sam/SRRx.single.sam )  >>& logfile

 #  (3)  Here is the process which builds the .glf format files used for 
 #  SNP calling out of .sam format files for individual sequencing runs 
 #  produced using either read mapping algorithm.  This will use only 
 #  samtools utilities and contains nothing specific to either read mapper.  

 #  For illustration, I will assume two of the directories left over from 
 #  bwa read mapping in the preceding section.  These are:  

 #  bwa.ref  containing files  human_g1k_v37.fasta.gz, human_g1k_v37.fasta.fai, and 
 #  bwa.sam  containing all of the single-end and paired-end .sam files generated 
 #  	 	 	for one individual and sequencing technology.

 #  The full set of command lines is:  (Note csh syntax again -- and this time, 
 #     I will change directories for convenience in file naming.)

mkdir  bwa.bam  indiv.bam
cd     bwa.bam
set    minq=17

foreach  file  ( ../bwa.sam/*.sam )
   set unsorted=`basename $file .sam`.nosort.bam
   samtools view -bhuS  -o $unsorted  -q $minq  $file	>>& logfile
   samtools sort  $unsorted  `basename $file .sam`	>>& logfile
   rm $unsorted

unset  minq

 #  This is the point where one would insert steps to recalibrate 
 #  base call quality values, check genotype identity or remove 
 #  duplicate sequence reads.  Each of these steps is specific to 
 #  an individual sequencing run.

cd  ../indiv.bam
samtools  merge  person.bam  ../bwa.bam/*.bam
samtools  index  person.bam
samtools  view   person.bam  20 | samtools pileup	\
	  -f ../bwa.ref/human_g1k_v37.fasta.gz	 	\
	  -t ../bwa.ref/human_g1k_v37.fasta.fai	 	\
	  -g  -  >  person.glf