Line 1: |
Line 1: |
− | == Grouping == | + | == Grouping == |
| | | |
− | When evaluating read mappers, we should always focus on well defined sets of reads: | + | When evaluating read mappers, we should always focus on well defined sets of reads: |
| | | |
− | * Reads with no polymorphisms. | + | *Reads with no polymorphisms. |
− | * Reads with 1, 2, 3 or more SNPs. | + | *Reads with 1, 2, 3 or more SNPs. |
− | * Reads with specific types of short indels (<10bp). | + | *Reads with specific types of short indels (<10bp). |
− | * Reads with larger structural variants (>100bp). | + | *Reads with larger structural variants (>100bp). |
| | | |
− | SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read. | + | SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read. |
| | | |
− | Should also be grouped according to whether reads are '''paired-end''' or '''single-end''' and according to '''read-length'''. | + | Should also be grouped according to whether reads are '''paired-end''' or '''single-end''' and according to '''read-length'''. |
| | | |
− | == Bulk Statistics == | + | == Bulk Statistics == |
| | | |
− | * Speed (millions of reads per hour) | + | *Speed (millions of reads per hour) |
− | * Memory requirements | + | *Memory requirements |
− | * Size of output files | + | *Size of output files |
− | * Raw count of mapped reads | + | *Raw count of mapped reads |
| | | |
− | == Mapping Accuracy == | + | == Mapping Accuracy == |
| | | |
− | The key quantities are: | + | The key quantities are: |
| | | |
− | * How many reads were not mapped at all? | + | *How many reads were not mapped at all? |
− | * How many reads were mapped incorrectly? '''This is the least desirable outcome'''. | + | *How many reads were mapped incorrectly? '''This is the least desirable outcome'''. |
− | * How many reads were mapped correctly? | + | *How many reads were mapped correctly? |
| | | |
− | Correct mapping should be defined as: | + | Correct mapping should be defined as: |
| | | |
− | * Most stringent: matches simulated location and CIGAR string. | + | *Most stringent: matches simulated location and CIGAR string. |
− | * Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ. | + | *Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ. |
− | * Incorrect: Doesn't overlap simulated location. | + | *Incorrect: Doesn't overlap simulated location. |
| | | |
− | == Mapping Qualities == | + | == Mapping Qualities == |
| | | |
| We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads. | | We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads. |
− |
| |
− | == Available Test Datasets ==
| |
− |
| |
− | *Location: wonderland:~zhanxw/BigSimulation
| |
− | *Scenarios:
| |
− | no polymorphism ;
| |
− | 1, 2, 3 SNP ;
| |
− | Deletion 5, 30, 200;
| |
− | Insertion 5, 30
| |
− | * Quality String
| |
− | Picked the 75 percentile of Sanger Iluumina 108 mer test data set
| |
− | * Format
| |
− | both base space and color space
| |
− | both single end and paired end, and paired end reads are given insert size 1500.
| |
− |
| |
− | * Program (generator)
| |
− |
| |
− | Usage:
| |
− | generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize
| |
− | exact: Accurate sample from reference genome
| |
− | snpXX: Bring total XXX SNP for a single read or a pair of reads
| |
− | indelXX: Insert a random XX-length piece for a single read, or at the same position for a paired reads
| |
− | delXX: Delete a random XX-length piece for a single read, or at the same position for a paired reads
| |
− |
| |
− | e.g. ./generator bs se exact -n 100 -l 35
| |
− |
| |
− | * Output
| |
− | Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read.
| |
− | For each read, the tag was named in a similar way to Sanger's.
| |