Difference between revisions of "Examples of Read Mapping with Karma and BWA"

From Genome Analysis Wiki
Jump to navigationJump to search
(3 intermediate revisions by one other user not shown)
Line 1: Line 1:
<!--        BANNER ACROSS TOP OF PAGE        -->
{| style="width:100%; background:#ffb6c1; margin-top:1.2em; border:1px solid #ccc;" |
| style="width:100%; text-align:center; white-space:nowrap; color:#000;" |
<div style="font-size:162%; border:none; margin:0; padding:.1em; color:#000;">KARMA is obsolete and not maintained</div>
<source lang="text">
  #  Some instructions for read mapping and variant calling using the  
  #  Some instructions for read mapping and variant calling using the  
Line 250: Line 256:
  -g  -  >  person.glf
  -g  -  >  person.glf

Latest revision as of 16:50, 23 September 2014

KARMA is obsolete and not maintained
 #  Some instructions for read mapping and variant calling using the 
 #  University of Michigan tools and procedures.  

 	#  -- Paul Anderson and Tom Blackwell, December 11, 2009  --

 #  These instructions will cover three components:

 #  (1)  read mapping using karma
 #  (2)  read mapping using bwa
 #  (3)  samtools processing from .sam to .glf

 #  These instructions are specifically with reference to the CEU trio 
 #  chromosome 20 test data set that has been distributed.  Please 
 #  see some general discussion at the beginning of item 2.  Everything 
 #  that's not commented out should be runnable code, although you 
 #  will need to alter the directory and file names.

 #  (1)  Command line procedure for karma read mapping of the CEU trio 
 #  chromosome 20 test data set.  Paul Anderson writes:  

 #  Start with four subdirectories:  

 #  indiv  containing single and paired end Illumina .fastq sequence files, 
 #  ab     containing AB SOLiD color space reads as in the test data set, 
 #  k.ref  containing the gzipped human genome reference sequence,  and 
 #  k.out  an empty directory for the resulting karma .sam and .stats files.

 #  First, build karma's binary word index files:

cd     k.ref
zcat   human_g1k_v37.fasta.gz  >  human_g1k_v37.fa
karma  --reference human_g1k_v37.fa  --createIndex  --occurrenceCutoff 5000

 #  then map Illumina paired end or single end reads as, for example:

cd	../k.out
karma	--reference ../k.ref/human_g1k_v37.fa  --pairedReads   --maxInsert 2000	  \
	../indiv/SRR010941_1.recal.fastq.gz  ../indiv/SRR010941_2.recal.fastq.gz

karma	--reference ../k.ref/human_g1k_v37.fa  --maxInsert 2000        	 	  \

 #  The read lengths in the Illumina data I saw are long enough so that the 
 #  only parameter worth tweaking might be  --occurrenceCutoff  in the index 
 #  structures.  I will need to study the mapping performance to see if this 
 #  makes any appreciable difference in speed.

 #  LS 454 fastq files can also be mapped using the same index, but the 
 #  mapping success rate will be much lower.

 #  For AB Solid data, a color space reference and set of index files must be 
 #  created.  For the lengths of reads I see, we will create two -- one with a 
 #  12-mer index, and the other with a 15-mer.

cd    ../k.ref
ln -s    human_g1k_v37.fa  human_g1k_v37_12CS.fa
ln -s    human_g1k_v37.fa  human_g1k_v37_15CS.fa
karma  --reference human_g1k_v37_12CS.fa  --colorSpace  --createIndex  --wordSize 12
karma  --reference human_g1k_v37_15CS.fa  --colorSpace  --createIndex  --wordSize 15

 #  Then, short color space reads (25/26-mers) should be mapped using the 12-mer index:

cd	../k.out
karma	--reference   ../k.ref/human_g1k_v37.fa   	\
	--csreference ../k.ref/human_g1k_v37_12CS.fa	\
	--colorSpace  --pairedReads  --maxInsert 2000	\
	../ab/TG150_1.color.space.fastq.gz     	 	\

 #  To map longer color space reads (>30-mer), use the bigger index:

karma	--reference   ../k.ref/human_g1k_v37.fa     	\
	--csreference ../k.ref/human_g1k_v37_15CS.fa	\
	--colorSpace  --pairedReads  --maxInsert 2000	\
	../ab/TG152_1.color.space.fastq.gz    	 	\

 #  You can use the shorter reference to map any length reads, but it is slower.  
 #  The 15-mer color space index (human_g1k_v37_15CS.fa) is a good compromise 
 #  for mapping reads 30 or more bases, and for machines with 24GB of RAM.

 #  The maxInsert of 2000 is somewhat arbitrary - you want a number that includes 
 #  the bulk of the paired end read insert distances you are interested in.  More 
 #  is slower, so you don't want an arbitrarily large value.

 #  (2)  Command line procedure for bwa read mapping of the CEU trio 
 #  chromosome 20 test data set.

 #  As a general principle, one wants to keep all sequence data separated 
 #  by sequencing technology and by individual throughout the entire mapping 
 #  and variant calling process.  Different technologies may have different 
 #  requirements for read mapping and different characteristics for variant 
 #  calling.  This can be done either using the directory structure or with a 
 #  file tracking database.  

 #  As a further complication, the Broad Institute Illumina sequencing runs 
 #  in the current test data set benefit from removing leading and trailing Ns 
 #  before read mapping with bwa.  All Illumina data for the CEU trio that 
 #  have SRR identifiers are Broad Institute runs.  (There are no Broad 
 #  runs for the YRI trio.  There, SRR identifiers will indicate either Beijing 
 #  or Wash U runs.)  I have used exactly the same bwa command lines with 
 #  or without N-trimming.

 #  In what follows, I will use somewhat simplified directory and file names.
 #  At this stage, it's easiest to have the command line prompt remain at top 
 #  level in the directory structure, and refer to all files using relative 
 #  pathnames, relative to the location of the prompt.  At the start, suppose 
 #  that there are two subdirectories:  'bwa.ref'  containing the gzipped human 
 #  genome reference sequence, and  'indiv'  containing many .fastq files for 
 #  the target individual generated by the appropriate sequencing technology.  
 #  The bwa program has an inconvenient habit of writing to std.err.  I routinely 
 #  redirect that to a log file so that I can keep on working while bwa runs in 
 #  the background.  

 #  In general, bwa read mapping requires initial indexing of the genome 
 #  reference sequence, followed by two passes for each .fastq file.  The 
 #  software components of this process are:  

 #  bwa  index  --  Do this only when a new version of the genome reference 
 #                  sequence is released.  Takes just under two hours.
 #  bwa   aln   --  Run this on every .fastq file individually.  Highly 
 #                  variable timings -- takes between 10 minutes and many 
 #                  hours per .fastq file.  Better data runs quicker.
 #  bwa  samse  --  Converts a single .fastq / .sai pair to .sam alignment 
 #                  format.  Usually under 1 minute per file.
 #  bwa  sampe  --  Converts paired end .fastq / .sai pairs (four files total) 
 #                  to a single .sam alignment file.  1 - 2 minutes per run.  
 #                  (In general, the CEU trio chromosome 20 test data set 
 #                  contains VERY small .fastq files.  Chromosome 20 is just 
 #                  over 2% of the entire genome.)

 #  Sample command lines.  This is csh syntax.  

 #  The parameter string  " -n 0.002  -M 7  -R 25 "  used in bwa aln below 
 #  is just my initial guess.  I have made an entire run using this string.  
 #  I might make another run using Heng Li's default values for all three 
 #  parameters and compare the results, but I have no answers from this yet.  

bwa index  -a bwtsw  -p	bwa.ref/ncbi.v37.ref	 	 	 	\
	 	 	bwa.ref/human_g1k_v37.fasta.gz  >>& logfile

mkdir bwa.sai bwa.sam

( nice +20  bwa aln  -n 0.002  -M 7  -R 25  bwa.ref/ncbi.v37.ref	\
	    indiv/SRRx_1.fastq.gz  >  bwa.sai/SRRx_1.sai ) >>& logfile

 #  ... and the same for  SRRx_2.fastq.gz  or  SRRx.fastq.gz.
 #  Then, depending on whether these are paired end reads or not, either:

( nice +20  bwa sampe  bwa.ref/ncbi.v37.ref	 	 	\
	    bwa.sai/SRRx_1.sai	   bwa.sai/SRRx_2.sai	 	\
	    indiv/SRRx_1.fastq.gz  indiv/SRRx_2.fastq.gz	\
	      >  bwa.sam/SRRx.pair.sam )  >>& logfile

 #  or, for single end reads:

( nice +20  bwa samse  bwa.ref/ncbi.v37.ref	 	 	\
	    bwa.sai/SRRx.sai	   indiv/SRRx.fastq.gz	 	\
	      >  bwa.sam/SRRx.single.sam )  >>& logfile

 #  For LS 454 sequence data, bwa read mapping is a one-step process.  
 #  This uses the same genome reference sequence index as for Illumina 
 #  data above, and does not use paired end information in the mapping.  
 #  (The directory  bwa.sam  shown here should be separate from that 
 #  created above for Illumina data.  Similarly for AB SOLiD below.)  

( nice +20  bwa dbwtsw  bwa.ref/ncbi.v37.ref  indiv/SRRx.fastq.gz  \
	      >  bwa.sam/SRRx.sam )  >>& logfile

 #  For AB SOLiD data, one must build a separate index structure for 
 #  the genome reference sequence, and (completely undocumented) one 
 #  must rewrite all of the .fastq files replacing  0,1,2,3,"."  with 
 #  A,C,G,T,N  in every line of color space sequence and omitting the 
 #  first two characters from each line of converted sequence and from 
 #  the base call quality strings.  (An awk script does this conversion 
 #  really quickly.)

bwa index  -a bwtsw  -p	bwa.ref/ncbi.color.ref  -c	 	 	\
	 	 	bwa.ref/human_g1k_v37.fasta.gz  >>& logfile

mkdir bwa.sai bwa.sam

( nice +20  bwa aln -n 0.002 -M 7 -R 25 -o 0 -c  bwa.ref/ncbi.color.ref	\
	    indiv/TG145_1.cvt.gz  >  bwa.sai/TG145_1.sai ) >>& logfile

 #  ... and the same for  TG145_2.cvt.gz  or  TG145.cvt.gz.
 #  Then, depending on whether these are paired end reads or not, either:

( nice +20  bwa sampe  bwa.ref/ncbi.color.ref	 	 	\
	    bwa.sai/TG145_1.sai	   bwa.sai/TG145_2.sai	 	\
	    indiv/TG145_1.cvt.gz   indiv/TG145_2.cvt.gz	 	\
	      >  bwa.sam/SRRx.pair.sam )  >>& logfile

 #  or:

( nice +20  bwa samse  bwa.ref/ncbi.color.ref	 	 	\
	    bwa.sai/TG145.sai	   indiv/TG145.cvt.gz	 	\
	      >  bwa.sam/SRRx.single.sam )  >>& logfile

 #  (3)  Here is the process which builds the .glf format files used for 
 #  SNP calling out of .sam format files for individual sequencing runs 
 #  produced using either read mapping algorithm.  This will use only 
 #  samtools utilities and contains nothing specific to either read mapper.  

 #  For illustration, I will assume two of the directories left over from 
 #  bwa read mapping in the preceding section.  These are:  

 #  bwa.ref  containing files  human_g1k_v37.fasta.gz, human_g1k_v37.fasta.fai, and 
 #  bwa.sam  containing all of the single-end and paired-end .sam files generated 
 #  	 	 	for one individual and sequencing technology.

 #  The full set of command lines is:  (Note csh syntax again -- and this time, 
 #     I will change directories for convenience in file naming.)

mkdir  bwa.bam  indiv.bam
cd     bwa.bam
set    minq=17

foreach  file  ( ../bwa.sam/*.sam )
   set unsorted=`basename $file .sam`.nosort.bam
   samtools view -bhuS  -o $unsorted  -q $minq  $file	>>& logfile
   samtools sort  $unsorted  `basename $file .sam`	>>& logfile
   rm $unsorted

unset  minq

 #  This is the point where one would insert steps to recalibrate 
 #  base call quality values, check genotype identity or remove 
 #  duplicate sequence reads.  Each of these steps is specific to 
 #  an individual sequencing run.

cd  ../indiv.bam
samtools  merge  person.bam  ../bwa.bam/*.bam
samtools  index  person.bam
samtools  view   person.bam  20 | samtools pileup	\
	  -f ../bwa.ref/human_g1k_v37.fasta.gz	 	\
	  -t ../bwa.ref/human_g1k_v37.fasta.fai	 	\
	  -g  -  >  person.glf