Line 1: |
Line 1: |
− | == Grouping == | + | == Grouping == |
| | | |
− | When evaluating read mappers, we should always focus on well defined sets of reads: | + | When evaluating read mappers, we should always focus on well defined sets of reads: |
| | | |
− | * Reads with no polymorphisms. | + | *Reads with no polymorphisms. |
− | * Reads with 1, 2, 3 or more SNPs. | + | *Reads with 1, 2, 3 or more SNPs. |
− | * Reads with specific types of short indels (<10bp). | + | *Reads with specific types of short indels (<10bp). |
− | * Reads with larger structural variants (>100bp). | + | *Reads with larger structural variants (>100bp). |
| | | |
− | SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read. | + | SNPs and errors are different because SNPs can lead to mismatches in high-quality bases. In addition to integrating according to the metrics above, we could separate results by the number of errors in each read. |
| | | |
− | Should also be grouped according to whether reads are '''paired-end''' or '''single-end''' and according to '''read-length'''. | + | Should also be grouped according to whether reads are '''paired-end''' or '''single-end''' and according to '''read-length'''. |
| | | |
− | == Bulk Statistics == | + | == Bulk Statistics == |
| | | |
− | * Speed (millions of reads per hour) | + | *Speed (millions of reads per hour) |
− | * Memory requirements | + | *Memory requirements |
− | * Size of output files | + | *Size of output files |
− | * Raw count of mapped reads | + | *Raw count of mapped reads |
| | | |
− | == Mapping Accuracy == | + | == Mapping Accuracy == |
| | | |
− | The key quantities are: | + | The key quantities are: |
| | | |
− | * How many reads were not mapped at all? | + | *How many reads were not mapped at all? |
− | * How many reads were mapped incorrectly? '''This is the least desirable outcome'''. | + | *How many reads were mapped incorrectly? '''This is the least desirable outcome'''. |
− | * How many reads were mapped correctly? | + | *How many reads were mapped correctly? |
| | | |
− | Correct mapping should be defined as: | + | Correct mapping should be defined as: |
| | | |
− | * Most stringent: matches simulated location and CIGAR string. | + | *Most stringent: matches simulated location and CIGAR string. |
− | * Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ. | + | *Less stringent: overlaps simulated location at base-pair level, CIGAR string and end positions may differ. |
− | * Incorrect: Doesn't overlap simulated location. | + | *Incorrect: Doesn't overlap simulated location. |
| | | |
− | == Mapping Qualities == | + | == Mapping Qualities == |
| | | |
− | We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads. | + | We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads. |
| | | |
− | == Available Test Datasets == | + | == Available Test Datasets == |
| | | |
− | *Location: wonderland:~zhanxw/BigSimulation | + | *Location: wonderland:~zhanxw/BigSimulation |
− | *Scenarios: | + | *Scenarios: |
− | no polymorphism ;
| |
− | 1, 2, 3 SNP ;
| |
− | Deletion 5, 30, 200;
| |
− | Insertion 5, 30
| |
− | * Quality String
| |
− | Picked the 75 percentile of Sanger Iluumina 108 mer test data set
| |
− | * Format
| |
− | both base space and color space
| |
− | both single end and paired end, and paired end reads are given insert size 1500.
| |
| | | |
− | * Program (generator) | + | no polymorphism ; 1, 2, 3 SNP ; Deletion 5, 30, 200; Insertion 5, 30 |
| + | |
| + | *Quality String |
| + | |
| + | Picked the 75 percentile of Sanger Iluumina 108 mer test data set |
| + | |
| + | *Format |
| + | |
| + | both base space and color space both single end and paired end, and paired end reads are given insert size 1500. |
| + | |
| + | *Program (generator) |
| + | |
| + | Usage: |
| | | |
− | Usage:
| |
| generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize | | generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize |
| exact: Accurate sample from reference genome | | exact: Accurate sample from reference genome |
Line 61: |
Line 63: |
| e.g. ./generator bs se exact -n 100 -l 35 | | e.g. ./generator bs se exact -n 100 -l 35 |
| | | |
− | * Output | + | *Output |
− | Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read. | + | |
− | For each read, the tag was named in a similar way to Sanger's. | + | Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read. For each read, the tag was named in a similar way to Sanger's. |
| + | |
| + | <br> |
| + | |
| + | = Bulk statistics result = |
| + | |
| + | <br> |
| + | |
| + | BWA(second) Karma(second) Scenarios |
| + | 2594 7182 BS_SE_DEL200_1000000_50.fastq |
| + | 2641 -1 BS_SE_DEL30_1000000_50.fastq |
| + | 2355 -1 BS_SE_DEL5_1000000_50.fastq |
| + | 441 7941 BS_SE_EXACT_1000000_50.fastq |
| + | 809 282 BS_SE_INDEL30_1000000_50.fastq |
| + | 2217 -1 BS_SE_INDEL5_1000000_50.fastq |
| + | 645 7206 BS_SE_SNP1_1000000_50.fastq |
| + | 1102 -1 BS_SE_SNP2_1000000_50.fastq |
| + | 1142 -1 BS_SE_SNP3_1000000_50.fastq |
| + | 6536 8874 BS_PE_DEL200_1000000_50_?.fastq |
| + | 6699 9017 BS_PE_DEL30_1000000_50_?.fastq |
| + | 6468 9033 BS_PE_DEL5_1000000_50_?.fastq |
| + | 1743 10112 BS_PE_EXACT_1000000_50_?.fastq |
| + | 2305 231 BS_PE_INDEL30_1000000_50_?.fastq |
| + | 5703 2989 BS_PE_INDEL5_1000000_50_?.fastq |
| + | 1974 3718 BS_PE_SNP1_1000000_50_?.fastq |
| + | 2396 3339 BS_PE_SNP2_1000000_50_?.fastq |
| + | 2817 3131 BS_PE_SNP3_1000000_50_?.fastq |
| + | 4362 16074 CS_PE_DEL200_1000000_50_?.fastq |
| + | 4385 -1 CS_PE_DEL30_1000000_50_?.fastq |
| + | 4373 9287 CS_PE_DEL5_1000000_50_?.fastq |
| + | 773 -1 CS_PE_EXACT_1000000_50_?.fastq |
| + | 1735 3142 CS_PE_INDEL30_1000000_50_?.fastq |
| + | 4023 8591 CS_PE_INDEL5_1000000_50_?.fastq |
| + | 1034 10528 CS_PE_SNP1_1000000_50_?.fastq |
| + | 2236 -1 CS_PE_SNP2_1000000_50_?.fastq |
| + | 3810 6617 CS_PE_SNP3_1000000_50_?.fastq |
| + | 7129 1493 CS_SE_DEL200_1000000_50.fastq |
| + | 7115 1513 CS_SE_DEL30_1000000_50.fastq |
| + | 7065 1542 CS_SE_DEL5_1000000_50.fastq |
| + | 1544 1666 CS_SE_EXACT_1000000_50.fastq |
| + | 2954 289 CS_SE_INDEL30_1000000_50.fastq |
| + | 6547 1390 CS_SE_INDEL5_1000000_50.fastq |
| + | 1690 1661 CS_SE_SNP1_1000000_50.fastq |
| + | 2853 1449 CS_SE_SNP2_1000000_50.fastq |
| + | 4039 1237 CS_SE_SNP3_1000000_50.fastq |