Difference between revisions of "Examples of Read Mapping with Karma and BWA"
From Genome Analysis Wiki
Jump to navigationJump to search(4 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
− | <source> | + | <!-- BANNER ACROSS TOP OF PAGE --> |
+ | {| style="width:100%; background:#ffb6c1; margin-top:1.2em; border:1px solid #ccc;" | | ||
+ | | style="width:100%; text-align:center; white-space:nowrap; color:#000;" | | ||
+ | <div style="font-size:162%; border:none; margin:0; padding:.1em; color:#000;">KARMA is obsolete and not maintained</div> | ||
+ | |} | ||
+ | |||
+ | <source lang="text"> | ||
# Some instructions for read mapping and variant calling using the | # Some instructions for read mapping and variant calling using the |
Latest revision as of 15:50, 23 September 2014
KARMA is obsolete and not maintained
|
# Some instructions for read mapping and variant calling using the
# University of Michigan tools and procedures.
# -- Paul Anderson and Tom Blackwell, December 11, 2009 --
# These instructions will cover three components:
# (1) read mapping using karma
# (2) read mapping using bwa
# (3) samtools processing from .sam to .glf
# These instructions are specifically with reference to the CEU trio
# chromosome 20 test data set that has been distributed. Please
# see some general discussion at the beginning of item 2. Everything
# that's not commented out should be runnable code, although you
# will need to alter the directory and file names.
# (1) Command line procedure for karma read mapping of the CEU trio
# chromosome 20 test data set. Paul Anderson writes:
# Start with four subdirectories:
# indiv containing single and paired end Illumina .fastq sequence files,
# ab containing AB SOLiD color space reads as in the test data set,
# k.ref containing the gzipped human genome reference sequence, and
# k.out an empty directory for the resulting karma .sam and .stats files.
# First, build karma's binary word index files:
cd k.ref
zcat human_g1k_v37.fasta.gz > human_g1k_v37.fa
karma --reference human_g1k_v37.fa --createIndex --occurrenceCutoff 5000
# then map Illumina paired end or single end reads as, for example:
cd ../k.out
karma --reference ../k.ref/human_g1k_v37.fa --pairedReads --maxInsert 2000 \
../indiv/SRR010941_1.recal.fastq.gz ../indiv/SRR010941_2.recal.fastq.gz
karma --reference ../k.ref/human_g1k_v37.fa --maxInsert 2000 \
../indiv/SRR010936.recal.fastq.gz
# The read lengths in the Illumina data I saw are long enough so that the
# only parameter worth tweaking might be --occurrenceCutoff in the index
# structures. I will need to study the mapping performance to see if this
# makes any appreciable difference in speed.
# LS 454 fastq files can also be mapped using the same index, but the
# mapping success rate will be much lower.
# For AB Solid data, a color space reference and set of index files must be
# created. For the lengths of reads I see, we will create two -- one with a
# 12-mer index, and the other with a 15-mer.
cd ../k.ref
ln -s human_g1k_v37.fa human_g1k_v37_12CS.fa
ln -s human_g1k_v37.fa human_g1k_v37_15CS.fa
karma --reference human_g1k_v37_12CS.fa --colorSpace --createIndex --wordSize 12
karma --reference human_g1k_v37_15CS.fa --colorSpace --createIndex --wordSize 15
# Then, short color space reads (25/26-mers) should be mapped using the 12-mer index:
cd ../k.out
karma --reference ../k.ref/human_g1k_v37.fa \
--csreference ../k.ref/human_g1k_v37_12CS.fa \
--colorSpace --pairedReads --maxInsert 2000 \
../ab/TG150_1.color.space.fastq.gz \
../ab/TG150_2.color.space.fastq.gz
# To map longer color space reads (>30-mer), use the bigger index:
karma --reference ../k.ref/human_g1k_v37.fa \
--csreference ../k.ref/human_g1k_v37_15CS.fa \
--colorSpace --pairedReads --maxInsert 2000 \
../ab/TG152_1.color.space.fastq.gz \
../ab/TG152_2.color.space.fastq.gz
# You can use the shorter reference to map any length reads, but it is slower.
# The 15-mer color space index (human_g1k_v37_15CS.fa) is a good compromise
# for mapping reads 30 or more bases, and for machines with 24GB of RAM.
# The maxInsert of 2000 is somewhat arbitrary - you want a number that includes
# the bulk of the paired end read insert distances you are interested in. More
# is slower, so you don't want an arbitrarily large value.
# (2) Command line procedure for bwa read mapping of the CEU trio
# chromosome 20 test data set.
# As a general principle, one wants to keep all sequence data separated
# by sequencing technology and by individual throughout the entire mapping
# and variant calling process. Different technologies may have different
# requirements for read mapping and different characteristics for variant
# calling. This can be done either using the directory structure or with a
# file tracking database.
# As a further complication, the Broad Institute Illumina sequencing runs
# in the current test data set benefit from removing leading and trailing Ns
# before read mapping with bwa. All Illumina data for the CEU trio that
# have SRR identifiers are Broad Institute runs. (There are no Broad
# runs for the YRI trio. There, SRR identifiers will indicate either Beijing
# or Wash U runs.) I have used exactly the same bwa command lines with
# or without N-trimming.
# In what follows, I will use somewhat simplified directory and file names.
# At this stage, it's easiest to have the command line prompt remain at top
# level in the directory structure, and refer to all files using relative
# pathnames, relative to the location of the prompt. At the start, suppose
# that there are two subdirectories: 'bwa.ref' containing the gzipped human
# genome reference sequence, and 'indiv' containing many .fastq files for
# the target individual generated by the appropriate sequencing technology.
# The bwa program has an inconvenient habit of writing to std.err. I routinely
# redirect that to a log file so that I can keep on working while bwa runs in
# the background.
# In general, bwa read mapping requires initial indexing of the genome
# reference sequence, followed by two passes for each .fastq file. The
# software components of this process are:
# bwa index -- Do this only when a new version of the genome reference
# sequence is released. Takes just under two hours.
# bwa aln -- Run this on every .fastq file individually. Highly
# variable timings -- takes between 10 minutes and many
# hours per .fastq file. Better data runs quicker.
# bwa samse -- Converts a single .fastq / .sai pair to .sam alignment
# format. Usually under 1 minute per file.
# bwa sampe -- Converts paired end .fastq / .sai pairs (four files total)
# to a single .sam alignment file. 1 - 2 minutes per run.
# (In general, the CEU trio chromosome 20 test data set
# contains VERY small .fastq files. Chromosome 20 is just
# over 2% of the entire genome.)
# Sample command lines. This is csh syntax.
# The parameter string " -n 0.002 -M 7 -R 25 " used in bwa aln below
# is just my initial guess. I have made an entire run using this string.
# I might make another run using Heng Li's default values for all three
# parameters and compare the results, but I have no answers from this yet.
bwa index -a bwtsw -p bwa.ref/ncbi.v37.ref \
bwa.ref/human_g1k_v37.fasta.gz >>& logfile
mkdir bwa.sai bwa.sam
( nice +20 bwa aln -n 0.002 -M 7 -R 25 bwa.ref/ncbi.v37.ref \
indiv/SRRx_1.fastq.gz > bwa.sai/SRRx_1.sai ) >>& logfile
# ... and the same for SRRx_2.fastq.gz or SRRx.fastq.gz.
# Then, depending on whether these are paired end reads or not, either:
( nice +20 bwa sampe bwa.ref/ncbi.v37.ref \
bwa.sai/SRRx_1.sai bwa.sai/SRRx_2.sai \
indiv/SRRx_1.fastq.gz indiv/SRRx_2.fastq.gz \
> bwa.sam/SRRx.pair.sam ) >>& logfile
# or, for single end reads:
( nice +20 bwa samse bwa.ref/ncbi.v37.ref \
bwa.sai/SRRx.sai indiv/SRRx.fastq.gz \
> bwa.sam/SRRx.single.sam ) >>& logfile
# For LS 454 sequence data, bwa read mapping is a one-step process.
# This uses the same genome reference sequence index as for Illumina
# data above, and does not use paired end information in the mapping.
# (The directory bwa.sam shown here should be separate from that
# created above for Illumina data. Similarly for AB SOLiD below.)
( nice +20 bwa dbwtsw bwa.ref/ncbi.v37.ref indiv/SRRx.fastq.gz \
> bwa.sam/SRRx.sam ) >>& logfile
# For AB SOLiD data, one must build a separate index structure for
# the genome reference sequence, and (completely undocumented) one
# must rewrite all of the .fastq files replacing 0,1,2,3,"." with
# A,C,G,T,N in every line of color space sequence and omitting the
# first two characters from each line of converted sequence and from
# the base call quality strings. (An awk script does this conversion
# really quickly.)
bwa index -a bwtsw -p bwa.ref/ncbi.color.ref -c \
bwa.ref/human_g1k_v37.fasta.gz >>& logfile
mkdir bwa.sai bwa.sam
( nice +20 bwa aln -n 0.002 -M 7 -R 25 -o 0 -c bwa.ref/ncbi.color.ref \
indiv/TG145_1.cvt.gz > bwa.sai/TG145_1.sai ) >>& logfile
# ... and the same for TG145_2.cvt.gz or TG145.cvt.gz.
# Then, depending on whether these are paired end reads or not, either:
( nice +20 bwa sampe bwa.ref/ncbi.color.ref \
bwa.sai/TG145_1.sai bwa.sai/TG145_2.sai \
indiv/TG145_1.cvt.gz indiv/TG145_2.cvt.gz \
> bwa.sam/SRRx.pair.sam ) >>& logfile
# or:
( nice +20 bwa samse bwa.ref/ncbi.color.ref \
bwa.sai/TG145.sai indiv/TG145.cvt.gz \
> bwa.sam/SRRx.single.sam ) >>& logfile
# (3) Here is the process which builds the .glf format files used for
# SNP calling out of .sam format files for individual sequencing runs
# produced using either read mapping algorithm. This will use only
# samtools utilities and contains nothing specific to either read mapper.
# For illustration, I will assume two of the directories left over from
# bwa read mapping in the preceding section. These are:
# bwa.ref containing files human_g1k_v37.fasta.gz, human_g1k_v37.fasta.fai, and
# bwa.sam containing all of the single-end and paired-end .sam files generated
# for one individual and sequencing technology.
# The full set of command lines is: (Note csh syntax again -- and this time,
# I will change directories for convenience in file naming.)
mkdir bwa.bam indiv.bam
cd bwa.bam
set minq=17
foreach file ( ../bwa.sam/*.sam )
set unsorted=`basename $file .sam`.nosort.bam
samtools view -bhuS -o $unsorted -q $minq $file >>& logfile
samtools sort $unsorted `basename $file .sam` >>& logfile
rm $unsorted
end
unset minq
# This is the point where one would insert steps to recalibrate
# base call quality values, check genotype identity or remove
# duplicate sequence reads. Each of these steps is specific to
# an individual sequencing run.
cd ../indiv.bam
samtools merge person.bam ../bwa.bam/*.bam
samtools index person.bam
samtools view person.bam 20 | samtools pileup \
-f ../bwa.ref/human_g1k_v37.fasta.gz \
-t ../bwa.ref/human_g1k_v37.fasta.fai \
-g - > person.glf