Difference between revisions of "Examples of Read Mapping with Karma and BWA"
From Genome Analysis WikiJump to navigationJump to search
Revision as of 07:21, 23 November 2010
# Some instructions for read mapping and variant calling using the # University of Michigan tools and procedures. # -- Paul Anderson and Tom Blackwell, December 11, 2009 -- # These instructions will cover three components: # (1) read mapping using karma # (2) read mapping using bwa # (3) samtools processing from .sam to .glf # These instructions are specifically with reference to the CEU trio # chromosome 20 test data set that has been distributed. Please # see some general discussion at the beginning of item 2. Everything # that's not commented out should be runnable code, although you # will need to alter the directory and file names. # (1) Command line procedure for karma read mapping of the CEU trio # chromosome 20 test data set. Paul Anderson writes: # Start with four subdirectories: # indiv containing single and paired end Illumina .fastq sequence files, # ab containing AB SOLiD color space reads as in the test data set, # k.ref containing the gzipped human genome reference sequence, and # k.out an empty directory for the resulting karma .sam and .stats files. # First, build karma's binary word index files: cd k.ref zcat human_g1k_v37.fasta.gz > human_g1k_v37.fa karma --reference human_g1k_v37.fa --createIndex --occurrenceCutoff 5000 # then map Illumina paired end or single end reads as, for example: cd ../k.out karma --reference ../k.ref/human_g1k_v37.fa --pairedReads --maxInsert 2000 \ ../indiv/SRR010941_1.recal.fastq.gz ../indiv/SRR010941_2.recal.fastq.gz karma --reference ../k.ref/human_g1k_v37.fa --maxInsert 2000 \ ../indiv/SRR010936.recal.fastq.gz # The read lengths in the Illumina data I saw are long enough so that the # only parameter worth tweaking might be --occurrenceCutoff in the index # structures. I will need to study the mapping performance to see if this # makes any appreciable difference in speed. # LS 454 fastq files can also be mapped using the same index, but the # mapping success rate will be much lower. # For AB Solid data, a color space reference and set of index files must be # created. For the lengths of reads I see, we will create two -- one with a # 12-mer index, and the other with a 15-mer. cd ../k.ref ln -s human_g1k_v37.fa human_g1k_v37_12CS.fa ln -s human_g1k_v37.fa human_g1k_v37_15CS.fa karma --reference human_g1k_v37_12CS.fa --colorSpace --createIndex --wordSize 12 karma --reference human_g1k_v37_15CS.fa --colorSpace --createIndex --wordSize 15 # Then, short color space reads (25/26-mers) should be mapped using the 12-mer index: cd ../k.out karma --reference ../k.ref/human_g1k_v37.fa \ --csreference ../k.ref/human_g1k_v37_12CS.fa \ --colorSpace --pairedReads --maxInsert 2000 \ ../ab/TG150_1.color.space.fastq.gz \ ../ab/TG150_2.color.space.fastq.gz # To map longer color space reads (>30-mer), use the bigger index: karma --reference ../k.ref/human_g1k_v37.fa \ --csreference ../k.ref/human_g1k_v37_15CS.fa \ --colorSpace --pairedReads --maxInsert 2000 \ ../ab/TG152_1.color.space.fastq.gz \ ../ab/TG152_2.color.space.fastq.gz # You can use the shorter reference to map any length reads, but it is slower. # The 15-mer color space index (human_g1k_v37_15CS.fa) is a good compromise # for mapping reads 30 or more bases, and for machines with 24GB of RAM. # The maxInsert of 2000 is somewhat arbitrary - you want a number that includes # the bulk of the paired end read insert distances you are interested in. More # is slower, so you don't want an arbitrarily large value. # (2) Command line procedure for bwa read mapping of the CEU trio # chromosome 20 test data set. # As a general principle, one wants to keep all sequence data separated # by sequencing technology and by individual throughout the entire mapping # and variant calling process. Different technologies may have different # requirements for read mapping and different characteristics for variant # calling. This can be done either using the directory structure or with a # file tracking database. # As a further complication, the Broad Institute Illumina sequencing runs # in the current test data set benefit from removing leading and trailing Ns # before read mapping with bwa. All Illumina data for the CEU trio that # have SRR identifiers are Broad Institute runs. (There are no Broad # runs for the YRI trio. There, SRR identifiers will indicate either Beijing # or Wash U runs.) I have used exactly the same bwa command lines with # or without N-trimming. # In what follows, I will use somewhat simplified directory and file names. # At this stage, it's easiest to have the command line prompt remain at top # level in the directory structure, and refer to all files using relative # pathnames, relative to the location of the prompt. At the start, suppose # that there are two subdirectories: 'bwa.ref' containing the gzipped human # genome reference sequence, and 'indiv' containing many .fastq files for # the target individual generated by the appropriate sequencing technology. # The bwa program has an inconvenient habit of writing to std.err. I routinely # redirect that to a log file so that I can keep on working while bwa runs in # the background. # In general, bwa read mapping requires initial indexing of the genome # reference sequence, followed by two passes for each .fastq file. The # software components of this process are: # bwa index -- Do this only when a new version of the genome reference # sequence is released. Takes just under two hours. # bwa aln -- Run this on every .fastq file individually. Highly # variable timings -- takes between 10 minutes and many # hours per .fastq file. Better data runs quicker. # bwa samse -- Converts a single .fastq / .sai pair to .sam alignment # format. Usually under 1 minute per file. # bwa sampe -- Converts paired end .fastq / .sai pairs (four files total) # to a single .sam alignment file. 1 - 2 minutes per run. # (In general, the CEU trio chromosome 20 test data set # contains VERY small .fastq files. Chromosome 20 is just # over 2% of the entire genome.) # Sample command lines. This is csh syntax. # The parameter string " -n 0.002 -M 7 -R 25 " used in bwa aln below # is just my initial guess. I have made an entire run using this string. # I might make another run using Heng Li's default values for all three # parameters and compare the results, but I have no answers from this yet. bwa index -a bwtsw -p bwa.ref/ncbi.v37.ref \ bwa.ref/human_g1k_v37.fasta.gz >>& logfile mkdir bwa.sai bwa.sam ( nice +20 bwa aln -n 0.002 -M 7 -R 25 bwa.ref/ncbi.v37.ref \ indiv/SRRx_1.fastq.gz > bwa.sai/SRRx_1.sai ) >>& logfile # ... and the same for SRRx_2.fastq.gz or SRRx.fastq.gz. # Then, depending on whether these are paired end reads or not, either: ( nice +20 bwa sampe bwa.ref/ncbi.v37.ref \ bwa.sai/SRRx_1.sai bwa.sai/SRRx_2.sai \ indiv/SRRx_1.fastq.gz indiv/SRRx_2.fastq.gz \ > bwa.sam/SRRx.pair.sam ) >>& logfile # or, for single end reads: ( nice +20 bwa samse bwa.ref/ncbi.v37.ref \ bwa.sai/SRRx.sai indiv/SRRx.fastq.gz \ > bwa.sam/SRRx.single.sam ) >>& logfile # For LS 454 sequence data, bwa read mapping is a one-step process. # This uses the same genome reference sequence index as for Illumina # data above, and does not use paired end information in the mapping. # (The directory bwa.sam shown here should be separate from that # created above for Illumina data. Similarly for AB SOLiD below.) ( nice +20 bwa dbwtsw bwa.ref/ncbi.v37.ref indiv/SRRx.fastq.gz \ > bwa.sam/SRRx.sam ) >>& logfile # For AB SOLiD data, one must build a separate index structure for # the genome reference sequence, and (completely undocumented) one # must rewrite all of the .fastq files replacing 0,1,2,3,"." with # A,C,G,T,N in every line of color space sequence and omitting the # first two characters from each line of converted sequence and from # the base call quality strings. (An awk script does this conversion # really quickly.) bwa index -a bwtsw -p bwa.ref/ncbi.color.ref -c \ bwa.ref/human_g1k_v37.fasta.gz >>& logfile mkdir bwa.sai bwa.sam ( nice +20 bwa aln -n 0.002 -M 7 -R 25 -o 0 -c bwa.ref/ncbi.color.ref \ indiv/TG145_1.cvt.gz > bwa.sai/TG145_1.sai ) >>& logfile # ... and the same for TG145_2.cvt.gz or TG145.cvt.gz. # Then, depending on whether these are paired end reads or not, either: ( nice +20 bwa sampe bwa.ref/ncbi.color.ref \ bwa.sai/TG145_1.sai bwa.sai/TG145_2.sai \ indiv/TG145_1.cvt.gz indiv/TG145_2.cvt.gz \ > bwa.sam/SRRx.pair.sam ) >>& logfile # or: ( nice +20 bwa samse bwa.ref/ncbi.color.ref \ bwa.sai/TG145.sai indiv/TG145.cvt.gz \ > bwa.sam/SRRx.single.sam ) >>& logfile # (3) Here is the process which builds the .glf format files used for # SNP calling out of .sam format files for individual sequencing runs # produced using either read mapping algorithm. This will use only # samtools utilities and contains nothing specific to either read mapper. # For illustration, I will assume two of the directories left over from # bwa read mapping in the preceding section. These are: # bwa.ref containing files human_g1k_v37.fasta.gz, human_g1k_v37.fasta.fai, and # bwa.sam containing all of the single-end and paired-end .sam files generated # for one individual and sequencing technology. # The full set of command lines is: (Note csh syntax again -- and this time, # I will change directories for convenience in file naming.) mkdir bwa.bam indiv.bam cd bwa.bam set minq=17 foreach file ( ../bwa.sam/*.sam ) set unsorted=`basename $file .sam`.nosort.bam samtools view -bhuS -o $unsorted -q $minq $file >>& logfile samtools sort $unsorted `basename $file .sam` >>& logfile rm $unsorted end unset minq # This is the point where one would insert steps to recalibrate # base call quality values, check genotype identity or remove # duplicate sequence reads. Each of these steps is specific to # an individual sequencing run. cd ../indiv.bam samtools merge person.bam ../bwa.bam/*.bam samtools index person.bam samtools view person.bam 20 | samtools pileup \ -f ../bwa.ref/human_g1k_v37.fasta.gz \ -t ../bwa.ref/human_g1k_v37.fasta.fai \ -g - > person.glf