Examples of Read Mapping with Karma and BWA
From Genome Analysis WikiJump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
# Some instructions for read mapping and variant calling using the # University of Michigan tools and procedures. # -- Paul Anderson and Tom Blackwell, December 11, 2009 -- # These instructions will cover three components: # (1) read mapping using karma # (2) read mapping using bwa # (3) samtools processing from .sam to .glf # These instructions are specifically with reference to the CEU trio # chromosome 20 test data set that has been distributed. Please # see some general discussion at the beginning of item 2. Everything # that's not commented out should be runnable code, although you # will need to alter the directory and file names. # (1) Command line procedure for karma read mapping of the CEU trio # chromosome 20 test data set. Paul Anderson writes: # Start with four subdirectories: # indiv containing single and paired end Illumina .fastq sequence files, # ab containing AB SOLiD color space reads as in the test data set, # k.ref containing the gzipped human genome reference sequence, and # k.out an empty directory for the resulting karma .sam and .stats files. # First, build karma's binary word index files: cd k.ref zcat human_g1k_v37.fasta.gz > human_g1k_v37.fa karma --reference human_g1k_v37.fa --createIndex --occurrenceCutoff 5000 # then map Illumina paired end or single end reads as, for example: cd ../k.out karma --reference ../k.ref/human_g1k_v37.fa --pairedReads --maxInsert 2000 \ ../indiv/SRR010941_1.recal.fastq.gz ../indiv/SRR010941_2.recal.fastq.gz karma --reference ../k.ref/human_g1k_v37.fa --maxInsert 2000 \ ../indiv/SRR010936.recal.fastq.gz # The read lengths in the Illumina data I saw are long enough so that the # only parameter worth tweaking might be --occurrenceCutoff in the index # structures. I will need to study the mapping performance to see if this # makes any appreciable difference in speed. # LS 454 fastq files can also be mapped using the same index, but the # mapping success rate will be much lower. # For AB Solid data, a color space reference and set of index files must be # created. For the lengths of reads I see, we will create two -- one with a # 12-mer index, and the other with a 15-mer. cd ../k.ref ln -s human_g1k_v37.fa human_g1k_v37_12CS.fa ln -s human_g1k_v37.fa human_g1k_v37_15CS.fa karma --reference human_g1k_v37_12CS.fa --colorSpace --createIndex --wordSize 12 karma --reference human_g1k_v37_15CS.fa --colorSpace --createIndex --wordSize 15 # Then, short color space reads (25/26-mers) should be mapped using the 12-mer index: cd ../k.out karma --reference ../k.ref/human_g1k_v37.fa \ --csreference ../k.ref/human_g1k_v37_12CS.fa \ --colorSpace --pairedReads --maxInsert 2000 \ ../ab/TG150_1.color.space.fastq.gz \ ../ab/TG150_2.color.space.fastq.gz # To map longer color space reads (>30-mer), use the bigger index: karma --reference ../k.ref/human_g1k_v37.fa \ --csreference ../k.ref/human_g1k_v37_15CS.fa \ --colorSpace --pairedReads --maxInsert 2000 \ ../ab/TG152_1.color.space.fastq.gz \ ../ab/TG152_2.color.space.fastq.gz # You can use the shorter reference to map any length reads, but it is slower. # The 15-mer color space index (human_g1k_v37_15CS.fa) is a good compromise # for mapping reads 30 or more bases, and for machines with 24GB of RAM. # The maxInsert of 2000 is somewhat arbitrary - you want a number that includes # the bulk of the paired end read insert distances you are interested in. More # is slower, so you don't want an arbitrarily large value. # (2) Command line procedure for bwa read mapping of the CEU trio # chromosome 20 test data set. # As a general principle, one wants to keep all sequence data separated # by sequencing technology and by individual throughout the entire mapping # and variant calling process. Different technologies may have different # requirements for read mapping and different characteristics for variant # calling. This can be done either using the directory structure or with a # file tracking database. # As a further complication, the Broad Institute Illumina sequencing runs # in the current test data set benefit from removing leading and trailing Ns # before read mapping with bwa. All Illumina data for the CEU trio that # have SRR identifiers are Broad Institute runs. (There are no Broad # runs for the YRI trio. There, SRR identifiers will indicate either Beijing # or Wash U runs.) I have used exactly the same bwa command lines with # or without N-trimming. # In what follows, I will use somewhat simplified directory and file names. # At this stage, it's easiest to have the command line prompt remain at top # level in the directory structure, and refer to all files using relative # pathnames, relative to the location of the prompt. At the start, suppose # that there are two subdirectories: 'bwa.ref' containing the gzipped human # genome reference sequence, and 'indiv' containing many .fastq files for # the target individual generated by the appropriate sequencing technology. # The bwa program has an inconvenient habit of writing to std.err. I routinely # redirect that to a log file so that I can keep on working while bwa runs in # the background. # In general, bwa read mapping requires initial indexing of the genome # reference sequence, followed by two passes for each .fastq file. The # software components of this process are: # bwa index -- Do this only when a new version of the genome reference # sequence is released. Takes just under two hours. # bwa aln -- Run this on every .fastq file individually. Highly # variable timings -- takes between 10 minutes and many # hours per .fastq file. Better data runs quicker. # bwa samse -- Converts a single .fastq / .sai pair to .sam alignment # format. Usually under 1 minute per file. # bwa sampe -- Converts paired end .fastq / .sai pairs (four files total) # to a single .sam alignment file. 1 - 2 minutes per run. # (In general, the CEU trio chromosome 20 test data set # contains VERY small .fastq files. Chromosome 20 is just # over 2% of the entire genome.) # Sample command lines. This is csh syntax. # The parameter string " -n 0.002 -M 7 -R 25 " used in bwa aln below # is just my initial guess. I have made an entire run using this string. # I might make another run using Heng Li's default values for all three # parameters and compare the results, but I have no answers from this yet. bwa index -a bwtsw -p bwa.ref/ncbi.v37.ref \ bwa.ref/human_g1k_v37.fasta.gz >>& logfile mkdir bwa.sai bwa.sam ( nice +20 bwa aln -n 0.002 -M 7 -R 25 bwa.ref/ncbi.v37.ref \ indiv/SRRx_1.fastq.gz > bwa.sai/SRRx_1.sai ) >>& logfile # ... and the same for SRRx_2.fastq.gz or SRRx.fastq.gz. # Then, depending on whether these are paired end reads or not, either: ( nice +20 bwa sampe bwa.ref/ncbi.v37.ref \ bwa.sai/SRRx_1.sai bwa.sai/SRRx_2.sai \ indiv/SRRx_1.fastq.gz indiv/SRRx_2.fastq.gz \ > bwa.sam/SRRx.pair.sam ) >>& logfile # or, for single end reads: ( nice +20 bwa samse bwa.ref/ncbi.v37.ref \ bwa.sai/SRRx.sai indiv/SRRx.fastq.gz \ > bwa.sam/SRRx.single.sam ) >>& logfile # For LS 454 sequence data, bwa read mapping is a one-step process. # This uses the same genome reference sequence index as for Illumina # data above, and does not use paired end information in the mapping. # (The directory bwa.sam shown here should be separate from that # created above for Illumina data. Similarly for AB SOLiD below.) ( nice +20 bwa dbwtsw bwa.ref/ncbi.v37.ref indiv/SRRx.fastq.gz \ > bwa.sam/SRRx.sam ) >>& logfile # For AB SOLiD data, one must build a separate index structure for # the genome reference sequence, and (completely undocumented) one # must rewrite all of the .fastq files replacing 0,1,2,3,"." with # A,C,G,T,N in every line of color space sequence and omitting the # first two characters from each line of converted sequence and from # the base call quality strings. (An awk script does this conversion # really quickly.) bwa index -a bwtsw -p bwa.ref/ncbi.color.ref -c \ bwa.ref/human_g1k_v37.fasta.gz >>& logfile mkdir bwa.sai bwa.sam ( nice +20 bwa aln -n 0.002 -M 7 -R 25 -o 0 -c bwa.ref/ncbi.color.ref \ indiv/TG145_1.cvt.gz > bwa.sai/TG145_1.sai ) >>& logfile # ... and the same for TG145_2.cvt.gz or TG145.cvt.gz. # Then, depending on whether these are paired end reads or not, either: ( nice +20 bwa sampe bwa.ref/ncbi.color.ref \ bwa.sai/TG145_1.sai bwa.sai/TG145_2.sai \ indiv/TG145_1.cvt.gz indiv/TG145_2.cvt.gz \ > bwa.sam/SRRx.pair.sam ) >>& logfile # or: ( nice +20 bwa samse bwa.ref/ncbi.color.ref \ bwa.sai/TG145.sai indiv/TG145.cvt.gz \ > bwa.sam/SRRx.single.sam ) >>& logfile # (3) Here is the process which builds the .glf format files used for # SNP calling out of .sam format files for individual sequencing runs # produced using either read mapping algorithm. This will use only # samtools utilities and contains nothing specific to either read mapper. # For illustration, I will assume two of the directories left over from # bwa read mapping in the preceding section. These are: # bwa.ref containing files human_g1k_v37.fasta.gz, human_g1k_v37.fasta.fai, and # bwa.sam containing all of the single-end and paired-end .sam files generated # for one individual and sequencing technology. # The full set of command lines is: (Note csh syntax again -- and this time, # I will change directories for convenience in file naming.) mkdir bwa.bam indiv.bam cd bwa.bam set minq=17 foreach file ( ../bwa.sam/*.sam ) set unsorted=`basename $file .sam`.nosort.bam samtools view -bhuS -o $unsorted -q $minq $file >>& logfile samtools sort $unsorted `basename $file .sam` >>& logfile rm $unsorted end unset minq # This is the point where one would insert steps to recalibrate # base call quality values, check genotype identity or remove # duplicate sequence reads. Each of these steps is specific to # an individual sequencing run. cd ../indiv.bam samtools merge person.bam ../bwa.bam/*.bam samtools index person.bam samtools view person.bam 20 | samtools pileup \ -f ../bwa.ref/human_g1k_v37.fasta.gz \ -t ../bwa.ref/human_g1k_v37.fasta.fai \ -g - > person.glf