Examples of Read Mapping with Karma and BWA
# Some instructions for read mapping and variant calling using the # University of Michigan tools and procedures.
# -- Paul Anderson and Tom Blackwell, December 11, 2009 --
# These instructions will cover three components:
# (1) read mapping using karma # (2) read mapping using bwa # (3) samtools processing from .sam to .glf
# These instructions are specifically with reference to the CEU trio # chromosome 20 test data set that has been distributed. Please # see some general discussion at the beginning of item 2. Everything # that's not commented out should be runnable code, although you # will need to alter the directory and file names.
# (1) Command line procedure for karma read mapping of the CEU trio # chromosome 20 test data set. Paul Anderson writes:
# Start with four subdirectories:
# indiv containing single and paired end Illumina .fastq sequence files, # ab containing AB SOLiD color space reads as in the test data set, # k.ref containing the gzipped human genome reference sequence, and # k.out an empty directory for the resulting karma .sam and .stats files.
# First, build karma's binary word index files:
cd k.ref zcat human_g1k_v37.fasta.gz > human_g1k_v37.fa karma --reference human_g1k_v37.fa --createIndex --occurrenceCutoff 5000
# then map Illumina paired end or single end reads as, for example:
cd ../k.out karma --reference ../k.ref/human_g1k_v37.fa --pairedReads --maxInsert 2000 \ ../indiv/SRR010941_1.recal.fastq.gz ../indiv/SRR010941_2.recal.fastq.gz
karma --reference ../k.ref/human_g1k_v37.fa --maxInsert 2000 \ ../indiv/SRR010936.recal.fastq.gz
# The read lengths in the Illumina data I saw are long enough so that the # only parameter worth tweaking might be --occurrenceCutoff in the index # structures. I will need to study the mapping performance to see if this # makes any appreciable difference in speed.
# LS 454 fastq files can also be mapped using the same index, but the # mapping success rate will be much lower.
# For AB Solid data, a color space reference and set of index files must be # created. For the lengths of reads I see, we will create two -- one with a # 12-mer index, and the other with a 15-mer.
cd ../k.ref ln -s human_g1k_v37.fa human_g1k_v37_12CS.fa ln -s human_g1k_v37.fa human_g1k_v37_15CS.fa karma --reference human_g1k_v37_12CS.fa --colorSpace --createIndex --wordSize 12 karma --reference human_g1k_v37_15CS.fa --colorSpace --createIndex --wordSize 15
# Then, short color space reads (25/26-mers) should be mapped using the 12-mer index:
cd ../k.out karma --reference ../k.ref/human_g1k_v37.fa \ --csreference ../k.ref/human_g1k_v37_12CS.fa \ --colorSpace --pairedReads --maxInsert 2000 \ ../ab/TG150_1.color.space.fastq.gz \ ../ab/TG150_2.color.space.fastq.gz
# To map longer color space reads (>30-mer), use the bigger index:
karma --reference ../k.ref/human_g1k_v37.fa \ --csreference ../k.ref/human_g1k_v37_15CS.fa \ --colorSpace --pairedReads --maxInsert 2000 \ ../ab/TG152_1.color.space.fastq.gz \ ../ab/TG152_2.color.space.fastq.gz
# You can use the shorter reference to map any length reads, but it is slower. # The 15-mer color space index (human_g1k_v37_15CS.fa) is a good compromise # for mapping reads 30 or more bases, and for machines with 24GB of RAM.
# The maxInsert of 2000 is somewhat arbitrary - you want a number that includes # the bulk of the paired end read insert distances you are interested in. More # is slower, so you don't want an arbitrarily large value.
# (2) Command line procedure for bwa read mapping of the CEU trio # chromosome 20 test data set.
# As a general principle, one wants to keep all sequence data separated # by sequencing technology and by individual throughout the entire mapping # and variant calling process. Different technologies may have different # requirements for read mapping and different characteristics for variant # calling. This can be done either using the directory structure or with a # file tracking database.
# As a further complication, the Broad Institute Illumina sequencing runs # in the current test data set benefit from removing leading and trailing Ns # before read mapping with bwa. All Illumina data for the CEU trio that # have SRR identifiers are Broad Institute runs. (There are no Broad # runs for the YRI trio. There, SRR identifiers will indicate either Beijing # or Wash U runs.) I have used exactly the same bwa command lines with # or without N-trimming.
# In what follows, I will use somewhat simplified directory and file names. # At this stage, it's easiest to have the command line prompt remain at top # level in the directory structure, and refer to all files using relative # pathnames, relative to the location of the prompt. At the start, suppose # that there are two subdirectories: 'bwa.ref' containing the gzipped human # genome reference sequence, and 'indiv' containing many .fastq files for # the target individual generated by the appropriate sequencing technology. # The bwa program has an inconvenient habit of writing to std.err. I routinely # redirect that to a log file so that I can keep on working while bwa runs in # the background.
# In general, bwa read mapping requires initial indexing of the genome # reference sequence, followed by two passes for each .fastq file. The # software components of this process are:
# bwa index # Do this only when a new version of the genome reference
# sequence is released. Takes just under two hours.
# bwa aln # Run this on every .fastq file individually. Highly
# variable timings -- takes between 10 minutes and many # hours per .fastq file. Better data runs quicker.
# bwa samse # Converts a single .fastq / .sai pair to .sam alignment
# format. Usually under 1 minute per file.
# bwa sampe # Converts paired end .fastq / .sai pairs (four files total)
# to a single .sam alignment file. 1 - 2 minutes per run. # (In general, the CEU trio chromosome 20 test data set # contains VERY small .fastq files. Chromosome 20 is just # over 2% of the entire genome.)
# Sample command lines. This is csh syntax.
# The parameter string " -n 0.002 -M 7 -R 25 " used in bwa aln below # is just my initial guess. I have made an entire run using this string. # I might make another run using Heng Li's default values for all three # parameters and compare the results, but I have no answers from this yet.
bwa index -a bwtsw -p bwa.ref/ncbi.v37.ref \ bwa.ref/human_g1k_v37.fasta.gz >>& logfile
mkdir bwa.sai bwa.sam
( nice +20 bwa aln -n 0.002 -M 7 -R 25 bwa.ref/ncbi.v37.ref \ indiv/SRRx_1.fastq.gz > bwa.sai/SRRx_1.sai ) >>& logfile
# ... and the same for SRRx_2.fastq.gz or SRRx.fastq.gz. # Then, depending on whether these are paired end reads or not, either:
( nice +20 bwa sampe bwa.ref/ncbi.v37.ref \ bwa.sai/SRRx_1.sai bwa.sai/SRRx_2.sai \ indiv/SRRx_1.fastq.gz indiv/SRRx_2.fastq.gz \ > bwa.sam/SRRx.pair.sam ) >>& logfile
# or, for single end reads:
( nice +20 bwa samse bwa.ref/ncbi.v37.ref \ bwa.sai/SRRx.sai indiv/SRRx.fastq.gz \ > bwa.sam/SRRx.single.sam ) >>& logfile
# For LS 454 sequence data, bwa read mapping is a one-step process. # This uses the same genome reference sequence index as for Illumina # data above, and does not use paired end information in the mapping. # (The directory bwa.sam shown here should be separate from that # created above for Illumina data. Similarly for AB SOLiD below.)
( nice +20 bwa dbwtsw bwa.ref/ncbi.v37.ref indiv/SRRx.fastq.gz \ > bwa.sam/SRRx.sam ) >>& logfile
# For AB SOLiD data, one must build a separate index structure for # the genome reference sequence, and (completely undocumented) one # must rewrite all of the .fastq files replacing 0,1,2,3,"." with # A,C,G,T,N in every line of color space sequence and omitting the # first two characters from each line of converted sequence and from # the base call quality strings. (An awk script does this conversion # really quickly.)
bwa index -a bwtsw -p bwa.ref/ncbi.color.ref -c \ bwa.ref/human_g1k_v37.fasta.gz >>& logfile
mkdir bwa.sai bwa.sam
( nice +20 bwa aln -n 0.002 -M 7 -R 25 -o 0 -c bwa.ref/ncbi.color.ref \ indiv/TG145_1.cvt.gz > bwa.sai/TG145_1.sai ) >>& logfile
# ... and the same for TG145_2.cvt.gz or TG145.cvt.gz. # Then, depending on whether these are paired end reads or not, either:
( nice +20 bwa sampe bwa.ref/ncbi.color.ref \ bwa.sai/TG145_1.sai bwa.sai/TG145_2.sai \ indiv/TG145_1.cvt.gz indiv/TG145_2.cvt.gz \ > bwa.sam/SRRx.pair.sam ) >>& logfile
# or:
( nice +20 bwa samse bwa.ref/ncbi.color.ref \ bwa.sai/TG145.sai indiv/TG145.cvt.gz \ > bwa.sam/SRRx.single.sam ) >>& logfile
# (3) Here is the process which builds the .glf format files used for # SNP calling out of .sam format files for individual sequencing runs # produced using either read mapping algorithm. This will use only # samtools utilities and contains nothing specific to either read mapper.
# For illustration, I will assume two of the directories left over from # bwa read mapping in the preceding section. These are:
# bwa.ref containing files human_g1k_v37.fasta.gz, human_g1k_v37.fasta.fai, and # bwa.sam containing all of the single-end and paired-end .sam files generated # for one individual and sequencing technology.
# The full set of command lines is: (Note csh syntax again -- and this time, # I will change directories for convenience in file naming.)
mkdir bwa.bam indiv.bam cd bwa.bam set minq=17
foreach file ( ../bwa.sam/*.sam )
set unsorted=`basename $file .sam`.nosort.bam samtools view -bhuS -o $unsorted -q $minq $file >>& logfile samtools sort $unsorted `basename $file .sam` >>& logfile rm $unsorted end
unset minq
# This is the point where one would insert steps to recalibrate # base call quality values, check genotype identity or remove # duplicate sequence reads. Each of these steps is specific to # an individual sequencing run.
cd ../indiv.bam samtools merge person.bam ../bwa.bam/*.bam samtools index person.bam samtools view person.bam 20 | samtools pileup \ -f ../bwa.ref/human_g1k_v37.fasta.gz \ -t ../bwa.ref/human_g1k_v37.fasta.fai \ -g - > person.glf