Line 35: |
Line 35: |
| == Mapping Qualities == | | == Mapping Qualities == |
| | | |
− | We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads. | + | We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads. |
− | | |
− | == Available Test Datasets ==
| |
− | | |
− | *Location: wonderland:~zhanxw/BigSimulation
| |
− | *Scenarios:
| |
− | | |
− | no polymorphism ; 1, 2, 3 SNP ; Deletion 5, 30, 200; Insertion 5, 30
| |
− | | |
− | *Quality String
| |
− | | |
− | Picked the 75 percentile of Sanger Iluumina 108 mer test data set
| |
− | | |
− | *Format
| |
− | | |
− | ; Both base space and color space
| |
− | ; Both single end and paired end, and paired end reads are given insert size 1500.
| |
− | ; Forward strand and reverse strand are randomly assign with probability 1/2
| |
− | | |
− | * Tag
| |
− | @2:12345:F:SE:Exact
| |
− | @2:12345:F:SE:SNP:2,12345,A,G;2,12346,T,C
| |
− | @2:12345:F:PE+offset:SNP:2,12345,A,G (ref is A, read is G)
| |
− | @2:12345:F:PE+offset:Indel:25M30D5M
| |
− | | |
− | * File Naming
| |
− | BS_SE_EXACT_1M_50
| |
− | BS_SE_SNP1_1M_50
| |
− | CS_SE_INDEL1_1M
| |
− | CS_SE_INDEL30_1M
| |
− | CS_SE_INDEL200_1M
| |
− | CS_SE_DEL1_1M
| |
− | | |
− | For PE, appending "_1" and "_2", e.g.:
| |
− | PE_EXACT_1M_1
| |
− | PE_EXACT_1M_2
| |
− | | |
− | *Program (generator)
| |
− | | |
− | Usage:
| |
− | | |
− | generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize
| |
− | exact: Accurate sample from reference genome
| |
− | snpXX: Bring total XXX SNP for a single read or a pair of reads
| |
− | indelXX: Insert a random XX-length piece for a single read, or at the same position for a paired reads
| |
− | delXX: Delete a random XX-length piece for a single read, or at the same position for a paired reads
| |
− | e.g. ./generator bs se exact -n 100 -l 35
| |
− | | |
− | *Output
| |
− | | |
− | Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read. For each read, the tag was named in a similar way to Sanger's.
| |
− | | |
− | * Example
| |
− | For illumina (from Sanger, 108mer hap1 test file):
| |
− | Example:
| |
− | <pre>
| |
− | _1 file:
| |
− | @20:14812275:F:217;None;None/1
| |
− | AGTTGTTTACTTTCCTTTCCTACCTGGCTGCATCTGTCACATGCATATAGTGTCCCCTGACATGAAGCTCTGATATTGATCTGGAGCCCTATTGGTCTGCAAGTGACT
| |
− | +
| |
− | %27::2:::<70<<::95<<6/8<.)3;::9-,3:6/67731/.+)66;;53'31;9<815.%%%+%4-%%%90-)./26<831))(.%%%%%%%)%0%2%%%%%+%%
| |
− | | |
− | @15:59364621:R:-118;None;None/1
| |
− | TGTTCAACCCACTATTAAGCCAGTATTAAATTGTTAATATCAGTTATTATACTTTTATTTCTAAAATTTCTATTTGATCCCTTTTTTTATAAACTCCAATGCATTCTC
| |
− | +
| |
− | %%2=;28>>>>=><>>>>>=>>=>>>;>=>9<1%+,//0+)<<91<4=;;<.%)2::8;;/9<;;;;8647<<;8;;066:<:4628;;;;5:9<<0/25752:3482
| |
− | | |
− | _2 file:
| |
− | @20:14812275:F:217;None;None/2
| |
− | CACTGGAGGGAATCCAATCCCAAATTAATATAACAAAACCAGAAGCTTGCTTAAAAAATATTTTATCAGATTCCAAAGTTGAGCTTGTGTTAGGGTGTACTGGAACTC
| |
− | +
| |
− | %%0;+250::-863486::599<9679/2%%))%+80%--7<;9/1%33,-%%)28/),3,67-8;56<1%)0/%%8;<;59/%%,())%%1%%+%).%099'4;+%-
| |
− | | |
− | @15:59364621:R:-118;None;None/2
| |
− | AGAAATAAGACCACATGACAATGTTAAAAATAAAACAGGCAATAGCAATAGTCCCAGAGGTGGTTACAATATGATTTCATGCTCCAGAAAGTATAGGAGAAGACAAAG
| |
− | +
| |
− | %3===;==;7<<;7<5;==<<4<;9=8==<====:<<<<<;<==:=<58;===;:8'8:<===:.9:38908:=;;7;57)%.+%)967%%-%%'6:-%)7);<;0+%
| |
− | </pre>
| |
− | | |
− | Conclusion:
| |
− | If the first read is forward, then itself is the same as reference sequence and the second read is reverse complement to the reference sequence.
| |
− | If the first read is backward, then itself is reverse complement to the reference genome and the second read is the same as the reference sequence.
| |
− | The first strand always position can always obtain from tag, first two fields (seperated by colon).
| |
− | The second strand position is first strand position plus the offset.
| |
− | | |
− | For SOLiD (from Sanger, 50 mer hap1 test file)
| |
− | e.g.
| |
− | | |
− | <pre>
| |
− | _1 file:
| |
− | >2:67043752:F:1445;2,67043761,A,G;None
| |
− | T12221203021201200302123102221322000012301300211213
| |
− | 22212031230012003021211022213220000123013022112123 (ref)
| |
− | >4:125830377:R:-1541;None;None
| |
− | T30002222300330113020203010322111010300030003230320
| |
− | | |
− | _2 file:
| |
− | >2:67043752:F:1445;2,67043761,A,G;None
| |
− | G13031223023023012201210020003310110111111203310211
| |
− | 30312230230230122012100200033121201111113033112112 (ref)
| |
− |
| |
− | >4:125830377:R:-1541;None;None
| |
− | G13311131230200010201210032223330120312000301230032
| |
− | | |
− | </pre>
| |
− | | |
− | Conclusion:
| |
− | The first strand and second strand have the same direction (both either same as the reference genome, or reverse complement to reference genome),
| |
− | where their positions are the same as Illumina reads.
| |
− | | |
− | <br>
| |
− | | |
− | = Bulk statistics result =
| |
− | Running time (all submitted to the MOSIX client nodes)
| |
− | <br>
| |
− | Calculated by "./parseRunbatch.py batch2.log |cutrange 0,-1|charrange :-1".
| |
− | | |
− | Log file is from runbatch.pl and negative time means unfinished (at the moment of editing).
| |
− | | |
− | TODO: Add file size comparison; add link to memory page summarized by Dharknes.
| |
− | <pre>
| |
− | BWA(second) Karma(second) Scenarios
| |
− | 2594 7182 BS_SE_DEL200_1000000_50.fastq
| |
− | 2641 -1 BS_SE_DEL30_1000000_50.fastq
| |
− | 2355 -1 BS_SE_DEL5_1000000_50.fastq
| |
− | 441 7941 BS_SE_EXACT_1000000_50.fastq
| |
− | 809 282 BS_SE_INDEL30_1000000_50.fastq
| |
− | 2217 -1 BS_SE_INDEL5_1000000_50.fastq
| |
− | 645 7206 BS_SE_SNP1_1000000_50.fastq
| |
− | 1102 -1 BS_SE_SNP2_1000000_50.fastq
| |
− | 1142 -1 BS_SE_SNP3_1000000_50.fastq
| |
− | 6536 8874 BS_PE_DEL200_1000000_50_?.fastq
| |
− | 6699 9017 BS_PE_DEL30_1000000_50_?.fastq
| |
− | 6468 9033 BS_PE_DEL5_1000000_50_?.fastq
| |
− | 1743 10112 BS_PE_EXACT_1000000_50_?.fastq
| |
− | 2305 231 BS_PE_INDEL30_1000000_50_?.fastq
| |
− | 5703 2989 BS_PE_INDEL5_1000000_50_?.fastq
| |
− | 1974 3718 BS_PE_SNP1_1000000_50_?.fastq
| |
− | 2396 3339 BS_PE_SNP2_1000000_50_?.fastq
| |
− | 2817 3131 BS_PE_SNP3_1000000_50_?.fastq
| |
− | 4362 16074 CS_PE_DEL200_1000000_50_?.fastq
| |
− | 4385 -1 CS_PE_DEL30_1000000_50_?.fastq
| |
− | 4373 9287 CS_PE_DEL5_1000000_50_?.fastq
| |
− | 773 -1 CS_PE_EXACT_1000000_50_?.fastq
| |
− | 1735 3142 CS_PE_INDEL30_1000000_50_?.fastq
| |
− | 4023 8591 CS_PE_INDEL5_1000000_50_?.fastq
| |
− | 1034 10528 CS_PE_SNP1_1000000_50_?.fastq
| |
− | 2236 -1 CS_PE_SNP2_1000000_50_?.fastq
| |
− | 3810 6617 CS_PE_SNP3_1000000_50_?.fastq
| |
− | 7129 1493 CS_SE_DEL200_1000000_50.fastq
| |
− | 7115 1513 CS_SE_DEL30_1000000_50.fastq
| |
− | 7065 1542 CS_SE_DEL5_1000000_50.fastq
| |
− | 1544 1666 CS_SE_EXACT_1000000_50.fastq
| |
− | 2954 289 CS_SE_INDEL30_1000000_50.fastq
| |
− | 6547 1390 CS_SE_INDEL5_1000000_50.fastq
| |
− | 1690 1661 CS_SE_SNP1_1000000_50.fastq
| |
− | 2853 1449 CS_SE_SNP2_1000000_50.fastq
| |
− | 4039 1237 CS_SE_SNP3_1000000_50.fastq
| |
− | </pre>
| |