Changes

From Genome Analysis Wiki
Jump to navigationJump to search
no edit summary
Line 1: Line 1: −
== Grouping ==
+
== Grouping ==
   −
When evaluating read mappers, we should always focus on well defined sets of reads:
+
When evaluating read mappers, we should always focus on well defined sets of reads:  
   −
* Reads with no polymorphisms.
+
*Reads with no polymorphisms.  
* Reads with 1, 2, 3 or more SNPs.
+
*Reads with 1, 2, 3 or more SNPs.  
* Reads with specific types of short indels (<10bp).
+
*Reads with specific types of short indels (&lt;10bp).  
* Reads with larger structural variants (>100bp).
+
*Reads with larger structural variants (&gt;100bp).
   −
SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.
+
SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read.  
   −
Should also be grouped according to whether reads are '''paired-end''' or '''single-end''' and according to '''read-length'''.
+
Should also be grouped according to whether reads are '''paired-end''' or '''single-end''' and according to '''read-length'''.  
   −
== Bulk Statistics ==
+
== Bulk Statistics ==
   −
* Speed (millions of reads per hour)
+
*Speed (millions of reads per hour)  
* Memory requirements
+
*Memory requirements  
* Size of output files
+
*Size of output files  
* Raw count of mapped reads
+
*Raw count of mapped reads
   −
== Mapping Accuracy ==
+
== Mapping Accuracy ==
   −
The key quantities are:
+
The key quantities are:  
   −
* How many reads were not mapped at all?
+
*How many reads were not mapped at all?  
* How many reads were mapped incorrectly? '''This is the least desirable outcome'''.
+
*How many reads were mapped incorrectly? '''This is the least desirable outcome'''.  
* How many reads were mapped correctly?
+
*How many reads were mapped correctly?
   −
Correct mapping should be defined as:
+
Correct mapping should be defined as:  
   −
* Most stringent: matches simulated location and CIGAR string.
+
*Most stringent: matches simulated location and CIGAR string.  
* Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.
+
*Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ.  
* Incorrect: Doesn't overlap simulated location.
+
*Incorrect: Doesn't overlap simulated location.
   −
== Mapping Qualities ==
+
== Mapping Qualities ==
   −
We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.
+
We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.  
   −
== Available Test Datasets ==
+
== Available Test Datasets ==
   −
*Location: wonderland:~zhanxw/BigSimulation
+
*Location: wonderland:~zhanxw/BigSimulation  
*Scenarios:  
+
*Scenarios:
no polymorphism ;
  −
1, 2, 3 SNP ;
  −
Deletion 5, 30, 200;
  −
Insertion 5, 30
  −
* Quality String
  −
Picked the 75 percentile of Sanger Iluumina 108 mer test data set
  −
* Format
  −
both base space and color space
  −
both single end and paired end, and paired end reads are given insert size 1500.
     −
* Program (generator)  
+
no polymorphism&nbsp;; 1, 2, 3 SNP&nbsp;; Deletion 5, 30, 200; Insertion 5, 30
 +
 
 +
*Quality String
 +
 
 +
Picked the 75 percentile of Sanger Iluumina 108 mer test data set
 +
 
 +
*Format
 +
 
 +
both base space and color space both single end and paired end, and paired end reads are given insert size 1500.
 +
 
 +
*Program (generator)
 +
 
 +
Usage:
   −
Usage:
   
         generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize
 
         generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize
 
         exact: Accurate sample from reference genome
 
         exact: Accurate sample from reference genome
Line 61: Line 63:  
         e.g. ./generator bs se exact -n 100 -l 35
 
         e.g. ./generator bs se exact -n 100 -l 35
   −
* Output
+
*Output
Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read.  
+
 
For each read, the tag was named in a similar way to Sanger's.
+
Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read. For each read, the tag was named in a similar way to Sanger's.  
 +
 
 +
<br>
 +
 
 +
= Bulk statistics result  =
 +
 
 +
<br>
 +
 
 +
BWA(second) Karma(second) Scenarios
 +
2594 7182 BS_SE_DEL200_1000000_50.fastq
 +
2641 -1 BS_SE_DEL30_1000000_50.fastq
 +
2355 -1 BS_SE_DEL5_1000000_50.fastq
 +
441 7941 BS_SE_EXACT_1000000_50.fastq
 +
809 282 BS_SE_INDEL30_1000000_50.fastq
 +
2217 -1 BS_SE_INDEL5_1000000_50.fastq
 +
645 7206 BS_SE_SNP1_1000000_50.fastq
 +
1102 -1 BS_SE_SNP2_1000000_50.fastq
 +
1142 -1 BS_SE_SNP3_1000000_50.fastq
 +
6536 8874 BS_PE_DEL200_1000000_50_?.fastq
 +
6699 9017 BS_PE_DEL30_1000000_50_?.fastq
 +
6468 9033 BS_PE_DEL5_1000000_50_?.fastq
 +
1743 10112 BS_PE_EXACT_1000000_50_?.fastq
 +
2305 231 BS_PE_INDEL30_1000000_50_?.fastq
 +
5703 2989 BS_PE_INDEL5_1000000_50_?.fastq
 +
1974 3718 BS_PE_SNP1_1000000_50_?.fastq
 +
2396 3339 BS_PE_SNP2_1000000_50_?.fastq
 +
2817 3131 BS_PE_SNP3_1000000_50_?.fastq
 +
4362 16074 CS_PE_DEL200_1000000_50_?.fastq
 +
4385 -1 CS_PE_DEL30_1000000_50_?.fastq
 +
4373 9287 CS_PE_DEL5_1000000_50_?.fastq
 +
773 -1 CS_PE_EXACT_1000000_50_?.fastq
 +
1735 3142 CS_PE_INDEL30_1000000_50_?.fastq
 +
4023 8591 CS_PE_INDEL5_1000000_50_?.fastq
 +
1034 10528 CS_PE_SNP1_1000000_50_?.fastq
 +
2236 -1 CS_PE_SNP2_1000000_50_?.fastq
 +
3810 6617 CS_PE_SNP3_1000000_50_?.fastq
 +
7129 1493 CS_SE_DEL200_1000000_50.fastq
 +
7115 1513 CS_SE_DEL30_1000000_50.fastq
 +
7065 1542 CS_SE_DEL5_1000000_50.fastq
 +
1544 1666 CS_SE_EXACT_1000000_50.fastq
 +
2954 289 CS_SE_INDEL30_1000000_50.fastq
 +
6547 1390 CS_SE_INDEL5_1000000_50.fastq
 +
1690 1661 CS_SE_SNP1_1000000_50.fastq
 +
2853 1449 CS_SE_SNP2_1000000_50.fastq
 +
4039 1237 CS_SE_SNP3_1000000_50.fastq
255

edits

Navigation menu