Difference between revisions of "BaseQualityCheck"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
'''Base Quality Check'''
 
'''Base Quality Check'''
  
'''(May 11, 2010 - Paul, Xiaowei)'''
+
'''(Aug 27, 2010 - Paul, Xiaowei)'''
  
  
Line 9: Line 9:
  
 
It read SAM/BAM file line by line. Then according to CIGAR string, it compares the alignment to reference genome (base by base) and record match and mismatch frequencies grouped by observed base quality.  
 
It read SAM/BAM file line by line. Then according to CIGAR string, it compares the alignment to reference genome (base by base) and record match and mismatch frequencies grouped by observed base quality.  
The output will be observed quality (generated by Illumina machine) and empirical quality (calculated by Prob(Mismatch bases | base quality Q) = (Total number of mismatched bases with base quality Q) / (Total number of bases with base quality Q)), both in Phred quality score.
+
The output will be observed quality (generated by Illumina machine) and empirical quality (calculated by Prob(Mismatch bases | base quality Q) = (Total number of mismatched bases with base quality Q) / (Total number of bases with base quality Q)), both in Phred quality score. For convenience, you can pipe the output by '| Rscript --vanilla -' to obtain a graph.
  
 
By default, we omit soft clips, insertion and deletion.  
 
By default, we omit soft clips, insertion and deletion.  
Line 17: Line 17:
 
'''Syntax''':
 
'''Syntax''':
  
  baseQualityCheck [-c max record count] [-q minimumMapQuality] [-r reference] [-s dbSNP file] [-v]
+
  baseQualityCheck [-c max record count] [-q minimumMapQuality] [-r reference] [-s dbSNP file] [-v] [-g or -R] [-2]
  -c -> only process first (max record count).
+
  -c -> only process first (max record count) of alignment.
 
  -q -> alignment with less than minimum mapping quality will not be counted
 
  -q -> alignment with less than minimum mapping quality will not be counted
 
  -r -> reference genome (in KARMA format)
 
  -r -> reference genome (in KARMA format)
 
  -s -> load SNP positions from the file.  It may either be a text file with chr/index pairs, using 1-index position, one per line, or you may use a file created from mkgenomevector (binary memory mapped file). For NCBI 37, a sample dbSNP file is located in /home/bingshan/data/db/dbSNP130.UCSC.coordinates.tbl
 
  -s -> load SNP positions from the file.  It may either be a text file with chr/index pairs, using 1-index position, one per line, or you may use a file created from mkgenomevector (binary memory mapped file). For NCBI 37, a sample dbSNP file is located in /home/bingshan/data/db/dbSNP130.UCSC.coordinates.tbl
 
  -v -> output SAM record in which mismatched bases exist
 
  -v -> output SAM record in which mismatched bases exist
   
+
  -g -> output in GNU Plot code, you can pipe the output using '|gnuplot'
 +
-R -> output in R code, you can pipe the output using '|Rscript --vanilla - '
 +
-2 -> use SNP position for color space reads
 +
 
 +
Example:
 +
Check first 20000 lines of abc.sam, using /data/local/ref/karma.ref/human.g1k.v37.fa as reference genome, excluding SNP sites specified in /home/bingshan/data/db/dbSNP130.UCSC.coordinates.tbl
 +
baseQualityCheck -c 20000 -r /data/local/ref/karma.ref/human.g1k.v37.fa -s /home/bingshan/data/db/dbSNP130.UCSC.coordinates.tbl abc.sam
 +
 
 +
 
 
Thank '''Bingshan''' for his qPlot program and his input to finish this program.
 
Thank '''Bingshan''' for his qPlot program and his input to finish this program.

Latest revision as of 04:35, 27 August 2010

Base Quality Check

(Aug 27, 2010 - Paul, Xiaowei)


Location: $(repository)/baseQualityCheck/baseQualityCheck

Algorithm:

It read SAM/BAM file line by line. Then according to CIGAR string, it compares the alignment to reference genome (base by base) and record match and mismatch frequencies grouped by observed base quality. The output will be observed quality (generated by Illumina machine) and empirical quality (calculated by Prob(Mismatch bases | base quality Q) = (Total number of mismatched bases with base quality Q) / (Total number of bases with base quality Q)), both in Phred quality score. For convenience, you can pipe the output by '| Rscript --vanilla -' to obtain a graph.

By default, we omit soft clips, insertion and deletion.

Note, we strongly recommend excluding dbSNP sites, or else you are likely to underestimate empirical quality. The program can read a plain text file to specify dbSNP positions via '-s' option.

Syntax:

baseQualityCheck [-c max record count] [-q minimumMapQuality] [-r reference] [-s dbSNP file] [-v] [-g or -R] [-2]
-c -> only process first (max record count) of alignment.
-q -> alignment with less than minimum mapping quality will not be counted
-r -> reference genome (in KARMA format)
-s -> load SNP positions from the file.  It may either be a text file with chr/index pairs, using 1-index position, one per line, or you may use a file created from mkgenomevector (binary memory mapped file). For NCBI 37, a sample dbSNP file is located in /home/bingshan/data/db/dbSNP130.UCSC.coordinates.tbl
-v -> output SAM record in which mismatched bases exist
-g -> output in GNU Plot code, you can pipe the output using '|gnuplot'
-R -> output in R code, you can pipe the output using '|Rscript --vanilla - '
-2 -> use SNP position for color space reads
Example:
Check first 20000 lines of abc.sam, using /data/local/ref/karma.ref/human.g1k.v37.fa as reference genome, excluding SNP sites specified in /home/bingshan/data/db/dbSNP130.UCSC.coordinates.tbl
baseQualityCheck -c 20000 -r /data/local/ref/karma.ref/human.g1k.v37.fa -s /home/bingshan/data/db/dbSNP130.UCSC.coordinates.tbl abc.sam


Thank Bingshan for his qPlot program and his input to finish this program.