Line 1: |
Line 1: |
− | == Grouping == | + | == Grouping == |
| | | |
− | When evaluating read mappers, we should always focus on well defined sets of reads: | + | When evaluating read mappers, we should always focus on well defined sets of reads: |
| | | |
− | * Reads with no polymorphisms. | + | *Reads with no polymorphisms. |
− | * Reads with 1, 2, 3 or more SNPs. | + | *Reads with 1, 2, 3 or more SNPs. |
− | * Reads with specific types of short indels (<10bp). | + | *Reads with specific types of short indels (<10bp). |
− | * Reads with larger structural variants (>100bp). | + | *Reads with larger structural variants (>100bp). |
| | | |
− | SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read. | + | SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read. |
| | | |
− | Should also be grouped according to whether reads are '''paired-end''' or '''single-end''' and according to '''read-length'''. | + | Should also be grouped according to whether reads are '''paired-end''' or '''single-end''' and according to '''read-length'''. |
| | | |
− | == Bulk Statistics == | + | == Bulk Statistics == |
| | | |
− | * Speed (millions of reads per hour) | + | *Speed (millions of reads per hour) |
− | * Memory requirements | + | *Memory requirements |
− | * Size of output files | + | *Size of output files |
− | * Raw count of mapped reads | + | *Raw count of mapped reads |
| | | |
− | == Mapping Accuracy == | + | == Mapping Accuracy == |
| | | |
− | The key quantities are: | + | The key quantities are: |
| | | |
− | * How many reads were not mapped at all? | + | *How many reads were not mapped at all? |
− | * How many reads were mapped incorrectly? '''This is the least desirable outcome'''. | + | *How many reads were mapped incorrectly? '''This is the least desirable outcome'''. |
− | * How many reads were mapped correctly? | + | *How many reads were mapped correctly? |
| | | |
− | Correct mapping should be defined as: | + | Correct mapping should be defined as: |
| | | |
− | * Most stringent: matches simulated location and CIGAR string. | + | *Most stringent: matches simulated location and CIGAR string. |
− | * Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ. | + | *Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ. |
− | * Incorrect: Doesn't overlap simulated location. | + | *Incorrect: Doesn't overlap simulated location. |
| | | |
− | == Mapping Qualities == | + | == Mapping Qualities == |
| | | |
| We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads. | | We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads. |
− |
| |
− | == Available Test Datasets ==
| |