Changes

From Genome Analysis Wiki
Jump to navigationJump to search
172 bytes removed ,  12:13, 3 February 2012
Line 37: Line 37:  
Here, we will simply use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped.  
 
Here, we will simply use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped.  
   −
   bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/NA20589.fastq.gz > bwa.sai/NA20589.sai
+
   bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai
   −
The file NA20589.fastq.gz contains DNA sequence reads for sample NA20589. To conserve disk space, the file has been compressed with gzip but, since fastq is a simple text format, you can easily view the contents of the file using a command like:
+
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1.  
 
  −
  zcat NA20589.fastq.gz | more
      
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character (for compactness) and can be decoded using an [http://www.google.com/search?q=ascii+table] - you should look up the ascii code for each base and subtract 33 to get base quality. By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities (is base quality typically higher at the start or end of each read).
 
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character (for compactness) and can be decoded using an [http://www.google.com/search?q=ascii+table] - you should look up the ascii code for each base and subtract 33 to get base quality. By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities (is base quality typically higher at the start or end of each read).
Line 49: Line 47:  
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.
 
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.
   −
   bwa samse ref/human_g1k_v37_chr20.fa bwa.sai/NA20589.sai fastq/NA20589.fastq.gz | \
+
   bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \
      samtools view -uhS - | samtools sort -m 2000000000 - bams/NA20589
+
    samtools view -uhS - | samtools sort -m 2000000000 - bams/SAMPLE1
    
The result BAM file uses a compact binary format to represent the  
 
The result BAM file uses a compact binary format to represent the  
Line 56: Line 54:  
of the file using the <code>samtools view</code> command, like so:
 
of the file using the <code>samtools view</code> command, like so:
   −
   samtools view bams/NA20589.bam | more
+
   samtools view bams/SAMPLE.bam | more
    
The text representation of the alignemt produced by <code>samtools view</code> describes
 
The text representation of the alignemt produced by <code>samtools view</code> describes
Line 80: Line 78:  
genome location. We do this with the <code>samtools index</code> command, like so:
 
genome location. We do this with the <code>samtools index</code> command, like so:
   −
   samtools index bams/NA20589.bam
+
   samtools index bams/SAMPLE1.bam
    
=== Browsing Alignment Results ===
 
=== Browsing Alignment Results ===
533

edits

Navigation menu