Line 35: |
Line 35: |
| == Mapping Qualities == | | == Mapping Qualities == |
| | | |
− | We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads. | + | We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads. |
− | | |
− | == Available Test Datasets ==
| |
− | | |
− | *Location: wonderland:~zhanxw/BigSimulation
| |
− | *Scenarios:
| |
− | | |
− | no polymorphism ; 1, 2, 3 SNP ; Deletion 5, 30, 200; Insertion 5, 30
| |
− | | |
− | *Quality String
| |
− | | |
− | Picked the 75 percentile of Sanger Iluumina 108 mer test data set
| |
− | BCCCCBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBAAAAAAAAAA@@@@@@@@@@@@@@@???????????>>>>>>>>>>>>=========<<<<<<<<<<;;";
| |
− | | |
− | | |
− | *Format
| |
− | | |
− | Both base space and color space
| |
− | | |
− | Both single end and paired end, and paired end reads are given insert size 1500.
| |
− | | |
− | Forward strand and reverse strand are randomly assign with probability 1/2
| |
− | | |
− | * Tag
| |
− | | |
− | @2:12345:F:SE:Exact
| |
− | | |
− | @2:12345:F:SE:SNP:2,12345,A,G;2,12346,T,C
| |
− | | |
− | @2:12345:F:PE+offset:SNP:2,12345,A,G (ref is A, read is G)
| |
− | | |
− | @2:12345:F:PE+offset:Indel:25M30D5M
| |
− | | |
− | * File Naming
| |
− | | |
− | BS_SE_EXACT_1M_50
| |
− | | |
− | BS_SE_SNP1_1M_50
| |
− | | |
− | CS_SE_INDEL1_1M
| |
− | | |
− | CS_SE_INDEL30_1M
| |
− | | |
− | CS_SE_INDEL200_1M
| |
− | | |
− | CS_SE_DEL1_1M
| |
− | | |
− | For PE, appending "_1" and "_2", e.g.:
| |
− | | |
− | PE_EXACT_1M_1
| |
− | | |
− | PE_EXACT_1M_2
| |
− | | |
− | *Program (generator)
| |
− | | |
− | Usage:
| |
− | | |
− | generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize
| |
− | exact: Accurate sample from reference genome
| |
− | snpXX: Bring total XXX SNP for a single read or a pair of reads
| |
− | indelXX: Insert a random XX-length piece for a single read, or at the same position for a paired reads
| |
− | delXX: Delete a random XX-length piece for a single read, or at the same position for a paired reads
| |
− | e.g. ./generator bs se exact -n 100 -l 35
| |
− | | |
− | *Output
| |
− | | |
− | Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read. For each read, the tag was named in a similar way to Sanger's.
| |
− | | |
− | * Example
| |
− | | |
− | For illumina (from Sanger, 108mer hap1 test file):
| |
− | | |
− | Example:
| |
− | <pre>
| |
− | _1 file:
| |
− | @20:14812275:F:217;None;None/1
| |
− | AGTTGTTTACTTTCCTTTCCTACCTGGCTGCATCTGTCACATGCATATAGTGTCCCCTGACATGAAGCTCTGATATTGATCTGGAGCCCTATTGGTCTGCAAGTGACT
| |
− | +
| |
− | %27::2:::<70<<::95<<6/8<.)3;::9-,3:6/67731/.+)66;;53'31;9<815.%%%+%4-%%%90-)./26<831))(.%%%%%%%)%0%2%%%%%+%%
| |
− | | |
− | @15:59364621:R:-118;None;None/1
| |
− | TGTTCAACCCACTATTAAGCCAGTATTAAATTGTTAATATCAGTTATTATACTTTTATTTCTAAAATTTCTATTTGATCCCTTTTTTTATAAACTCCAATGCATTCTC
| |
− | +
| |
− | %%2=;28>>>>=><>>>>>=>>=>>>;>=>9<1%+,//0+)<<91<4=;;<.%)2::8;;/9<;;;;8647<<;8;;066:<:4628;;;;5:9<<0/25752:3482
| |
− | | |
− | _2 file:
| |
− | @20:14812275:F:217;None;None/2
| |
− | CACTGGAGGGAATCCAATCCCAAATTAATATAACAAAACCAGAAGCTTGCTTAAAAAATATTTTATCAGATTCCAAAGTTGAGCTTGTGTTAGGGTGTACTGGAACTC
| |
− | +
| |
− | %%0;+250::-863486::599<9679/2%%))%+80%--7<;9/1%33,-%%)28/),3,67-8;56<1%)0/%%8;<;59/%%,())%%1%%+%).%099'4;+%-
| |
− | | |
− | @15:59364621:R:-118;None;None/2
| |
− | AGAAATAAGACCACATGACAATGTTAAAAATAAAACAGGCAATAGCAATAGTCCCAGAGGTGGTTACAATATGATTTCATGCTCCAGAAAGTATAGGAGAAGACAAAG
| |
− | +
| |
− | %3===;==;7<<;7<5;==<<4<;9=8==<====:<<<<<;<==:=<58;===;:8'8:<===:.9:38908:=;;7;57)%.+%)967%%-%%'6:-%)7);<;0+%
| |
− | </pre>
| |
− | | |
− | Conclusion:
| |
− | | |
− | If the first read is forward, then itself is the same as reference sequence and the second read is reverse complement to the reference sequence.
| |
− | | |
− | If the first read is backward, then itself is reverse complement to the reference genome and the second read is the same as the reference sequence.
| |
− | | |
− | The first strand always position can always obtain from tag, first two fields (seperated by colon).
| |
− | | |
− | The second strand position is first strand position plus the offset.
| |
− | | |
− | For SOLiD (from Sanger, 50 mer hap1 test file)
| |
− | e.g.
| |
− | | |
− | <pre>
| |
− | _1 file:
| |
− | >2:67043752:F:1445;2,67043761,A,G;None
| |
− | T12221203021201200302123102221322000012301300211213
| |
− | 22212031230012003021211022213220000123013022112123 (ref)
| |
− | >4:125830377:R:-1541;None;None
| |
− | T30002222300330113020203010322111010300030003230320
| |
− | | |
− | _2 file:
| |
− | >2:67043752:F:1445;2,67043761,A,G;None
| |
− | G13031223023023012201210020003310110111111203310211
| |
− | 30312230230230122012100200033121201111113033112112 (ref)
| |
− |
| |
− | >4:125830377:R:-1541;None;None
| |
− | G13311131230200010201210032223330120312000301230032
| |
− | | |
− | </pre>
| |
− | | |
− | Conclusion:
| |
− | The first strand and second strand have the same direction (both either same as the reference genome, or reverse complement to reference genome),
| |
− | where their positions are the same as Illumina reads.
| |
− | | |
− | <br>
| |
− | | |
− | = Bulk statistics result =
| |
− | Running time (all submitted to the MOSIX client nodes)
| |
− | <br>
| |
− | Calculated by "./parseRunbatch.py batch2.log |cutrange 0,-1|charrange :-1".
| |
− | | |
− | Log file is from runbatch.pl and negative time means unfinished (at the moment of editing).
| |
− | | |
− | TODO: Add file size comparison; add link to memory page summarized by Dharknes.
| |
− | <pre>
| |
− | BWA(second) Karma(second) Scenarios
| |
− | 7561 4638 BS_PE_DEL200_1000000_50_?.fastq
| |
− | 7548 4677 BS_PE_DEL30_1000000_50_?.fastq
| |
− | 7225 4730 BS_PE_DEL5_1000000_50_?.fastq
| |
− | 975 6531 BS_PE_EXACT_1000000_50_?.fastq
| |
− | 1726 793 BS_PE_INDEL30_1000000_50_?.fastq
| |
− | 6199 4140 BS_PE_INDEL5_1000000_50_?.fastq
| |
− | 1193 4949 BS_PE_SNP1_1000000_50_?.fastq
| |
− | 1646 4513 BS_PE_SNP2_1000000_50_?.fastq
| |
− | 2064 4089 BS_PE_SNP3_1000000_50_?.fastq
| |
− | 2594 3707 BS_SE_DEL200_1000000_50.fastq
| |
− | 2641 3942 BS_SE_DEL30_1000000_50.fastq
| |
− | 2355 4263 BS_SE_DEL5_1000000_50.fastq
| |
− | 441 4228 BS_SE_EXACT_1000000_50.fastq
| |
− | 809 764 BS_SE_INDEL30_1000000_50.fastq
| |
− | 2217 3932 BS_SE_INDEL5_1000000_50.fastq
| |
− | 645 3808 BS_SE_SNP1_1000000_50.fastq
| |
− | 1102 3473 BS_SE_SNP2_1000000_50.fastq
| |
− | 1142 3267 BS_SE_SNP3_1000000_50.fastq
| |
− | 6193 6909 CS_PE_DEL200_1000000_50_?.fastq
| |
− | 6173 6636 CS_PE_DEL30_1000000_50_?.fastq
| |
− | 6096 6702 CS_PE_DEL5_1000000_50_?.fastq
| |
− | 858 8496 CS_PE_EXACT_1000000_50_?.fastq
| |
− | 1743 948 CS_PE_INDEL30_1000000_50_?.fastq
| |
− | 5517 5412 CS_PE_INDEL5_1000000_50_?.fastq
| |
− | 1253 8454 CS_PE_SNP1_1000000_50_?.fastq
| |
− | 2113 7420 CS_PE_SNP2_1000000_50_?.fastq
| |
− | 2622 6076 CS_PE_SNP3_1000000_50_?.fastq
| |
− | 3878 1493 CS_SE_DEL200_1000000_50.fastq
| |
− | 3859 1513 CS_SE_DEL30_1000000_50.fastq
| |
− | 3775 1542 CS_SE_DEL5_1000000_50.fastq
| |
− | 621 1666 CS_SE_EXACT_1000000_50.fastq
| |
− | 1392 289 CS_SE_INDEL30_1000000_50.fastq
| |
− | 3525 1390 CS_SE_INDEL5_1000000_50.fastq
| |
− | 874 1661 CS_SE_SNP1_1000000_50.fastq
| |
− | 1965 1449 CS_SE_SNP2_1000000_50.fastq
| |
− | 3314 1237 CS_SE_SNP3_1000000_50.fastq
| |
− | </pre>
| |