Changes

From Genome Analysis Wiki
Jump to navigationJump to search
no edit summary
Line 35: Line 35:  
== Mapping Qualities  ==
 
== Mapping Qualities  ==
   −
We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.  
+
We should evaluate mapping qualities by counting how many reads are assigned each mapping quality (or greater) and among those how many map correctly or incorrectly. This gives a Heng Li graph, where one plots number of correctly mapped reads vs. number of mismapped reads.
 
  −
== Available Test Datasets  ==
  −
 
  −
*Location: wonderland:~zhanxw/BigSimulation
  −
*Scenarios:
  −
 
  −
no polymorphism ; 1, 2, 3 SNP ; Deletion 5, 30, 200; Insertion 5, 30
  −
 
  −
*Quality String
  −
 
  −
Picked the 75 percentile of Sanger Iluumina 108 mer test data set
  −
BCCCCBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBAAAAAAAAAA@@@@@@@@@@@@@@@???????????>>>>>>>>>>>>=========<<<<<<<<<<;;";
  −
 
  −
 
  −
*Format
  −
 
  −
; Both base space and color space
  −
; Both single end and paired end, and paired end reads are given insert size 1500.
  −
; Forward strand and reverse strand are randomly assign with probability 1/2
  −
 
  −
* Tag
  −
@2:12345:F:SE:Exact
  −
@2:12345:F:SE:SNP:2,12345,A,G;2,12346,T,C
  −
@2:12345:F:PE+offset:SNP:2,12345,A,G  (ref is A, read is G)
  −
@2:12345:F:PE+offset:Indel:25M30D5M
  −
 
  −
* File Naming
  −
BS_SE_EXACT_1M_50
  −
BS_SE_SNP1_1M_50
  −
CS_SE_INDEL1_1M
  −
CS_SE_INDEL30_1M
  −
CS_SE_INDEL200_1M
  −
CS_SE_DEL1_1M
  −
 
  −
For PE, appending "_1" and "_2", e.g.:
  −
PE_EXACT_1M_1
  −
PE_EXACT_1M_2
  −
 
  −
*Program (generator)
  −
 
  −
Usage:
  −
 
  −
        generator [bs|cs] [se|pe] [exact|snpXX|indelXX|delXX] -n numbers -l readLength -i insertSize
  −
        exact: Accurate sample from reference genome
  −
        snpXX: Bring total XXX SNP for a single read or a pair of reads
  −
        indelXX: Insert a random XX-length piece for a single read, or at the same position for a paired reads
  −
        delXX: Delete a random XX-length piece for a single read, or at the same position for a paired reads
  −
        e.g. ./generator bs se exact -n 100 -l 35
  −
 
  −
*Output
  −
 
  −
Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read. For each read, the tag was named in a similar way to Sanger's.
  −
 
  −
* Example
  −
For illumina (from Sanger, 108mer hap1 test file):
  −
Example:
  −
<pre>
  −
_1 file:
  −
@20:14812275:F:217;None;None/1
  −
AGTTGTTTACTTTCCTTTCCTACCTGGCTGCATCTGTCACATGCATATAGTGTCCCCTGACATGAAGCTCTGATATTGATCTGGAGCCCTATTGGTCTGCAAGTGACT
  −
+
  −
%27::2:::<70<<::95<<6/8<.)3;::9-,3:6/67731/.+)66;;53'31;9<815.%%%+%4-%%%90-)./26<831))(.%%%%%%%)%0%2%%%%%+%%
  −
 
  −
@15:59364621:R:-118;None;None/1
  −
TGTTCAACCCACTATTAAGCCAGTATTAAATTGTTAATATCAGTTATTATACTTTTATTTCTAAAATTTCTATTTGATCCCTTTTTTTATAAACTCCAATGCATTCTC
  −
+
  −
%%2=;28>>>>=><>>>>>=>>=>>>;>=>9<1%+,//0+)<<91<4=;;<.%)2::8;;/9<;;;;8647<<;8;;066:<:4628;;;;5:9<<0/25752:3482
  −
 
  −
_2 file:
  −
@20:14812275:F:217;None;None/2
  −
CACTGGAGGGAATCCAATCCCAAATTAATATAACAAAACCAGAAGCTTGCTTAAAAAATATTTTATCAGATTCCAAAGTTGAGCTTGTGTTAGGGTGTACTGGAACTC
  −
+
  −
%%0;+250::-863486::599<9679/2%%))%+80%--7<;9/1%33,-%%)28/),3,67-8;56<1%)0/%%8;<;59/%%,())%%1%%+%).%099'4;+%-
  −
 
  −
@15:59364621:R:-118;None;None/2
  −
AGAAATAAGACCACATGACAATGTTAAAAATAAAACAGGCAATAGCAATAGTCCCAGAGGTGGTTACAATATGATTTCATGCTCCAGAAAGTATAGGAGAAGACAAAG
  −
+
  −
%3===;==;7<<;7<5;==<<4<;9=8==<====:<<<<<;<==:=<58;===;:8'8:<===:.9:38908:=;;7;57)%.+%)967%%-%%'6:-%)7);<;0+%
  −
</pre>
  −
 
  −
Conclusion:
  −
If the first read is forward, then itself is the same as reference sequence and the second read is reverse complement to the reference sequence.
  −
If the first read is backward, then itself is reverse complement to the reference genome and the second read is the same as the reference sequence.
  −
The first strand always position can always obtain from tag, first two fields (seperated by colon).
  −
The second strand position is first strand position plus the offset.
  −
 
  −
For SOLiD (from Sanger, 50 mer hap1 test file)
  −
e.g.
  −
 
  −
<pre>
  −
_1 file:
  −
>2:67043752:F:1445;2,67043761,A,G;None
  −
T12221203021201200302123102221322000012301300211213
  −
  22212031230012003021211022213220000123013022112123 (ref)
  −
>4:125830377:R:-1541;None;None
  −
T30002222300330113020203010322111010300030003230320
  −
 
  −
_2 file:
  −
>2:67043752:F:1445;2,67043761,A,G;None
  −
G13031223023023012201210020003310110111111203310211
  −
  30312230230230122012100200033121201111113033112112 (ref)
  −
  −
>4:125830377:R:-1541;None;None
  −
G13311131230200010201210032223330120312000301230032
  −
 
  −
</pre>
  −
 
  −
Conclusion:
  −
The first strand and second strand have the same direction (both either same as the reference genome, or reverse complement to reference genome),
  −
where their positions are the same as Illumina reads.
  −
 
  −
<br>
  −
 
  −
= Bulk statistics result  =
  −
Running time (all submitted to the MOSIX client nodes)
  −
<br>
  −
Calculated by "./parseRunbatch.py batch2.log  |cutrange 0,-1|charrange :-1".
  −
 
  −
Log file is from runbatch.pl and negative time means unfinished (at the moment of editing).
  −
 
  −
TODO: Add file size comparison; add link to memory page summarized by Dharknes.
  −
<pre>
  −
BWA(second) Karma(second) Scenarios
  −
2594 7182 BS_SE_DEL200_1000000_50.fastq
  −
2641 -1 BS_SE_DEL30_1000000_50.fastq
  −
2355 -1 BS_SE_DEL5_1000000_50.fastq
  −
441 7941 BS_SE_EXACT_1000000_50.fastq
  −
809 282 BS_SE_INDEL30_1000000_50.fastq
  −
2217 -1 BS_SE_INDEL5_1000000_50.fastq
  −
645 7206 BS_SE_SNP1_1000000_50.fastq
  −
1102 -1 BS_SE_SNP2_1000000_50.fastq
  −
1142 -1 BS_SE_SNP3_1000000_50.fastq
  −
6536 8874 BS_PE_DEL200_1000000_50_?.fastq
  −
6699 9017 BS_PE_DEL30_1000000_50_?.fastq
  −
6468 9033 BS_PE_DEL5_1000000_50_?.fastq
  −
1743 10112 BS_PE_EXACT_1000000_50_?.fastq
  −
2305 231 BS_PE_INDEL30_1000000_50_?.fastq
  −
5703 2989 BS_PE_INDEL5_1000000_50_?.fastq
  −
1974 3718 BS_PE_SNP1_1000000_50_?.fastq
  −
2396 3339 BS_PE_SNP2_1000000_50_?.fastq
  −
2817 3131 BS_PE_SNP3_1000000_50_?.fastq
  −
4362 16074 CS_PE_DEL200_1000000_50_?.fastq
  −
4385 -1 CS_PE_DEL30_1000000_50_?.fastq
  −
4373 9287 CS_PE_DEL5_1000000_50_?.fastq
  −
773 -1 CS_PE_EXACT_1000000_50_?.fastq
  −
1735 3142 CS_PE_INDEL30_1000000_50_?.fastq
  −
4023 8591 CS_PE_INDEL5_1000000_50_?.fastq
  −
1034 10528 CS_PE_SNP1_1000000_50_?.fastq
  −
2236 -1 CS_PE_SNP2_1000000_50_?.fastq
  −
3810 6617 CS_PE_SNP3_1000000_50_?.fastq
  −
7129 1493 CS_SE_DEL200_1000000_50.fastq
  −
7115 1513 CS_SE_DEL30_1000000_50.fastq
  −
7065 1542 CS_SE_DEL5_1000000_50.fastq
  −
1544 1666 CS_SE_EXACT_1000000_50.fastq
  −
2954 289 CS_SE_INDEL30_1000000_50.fastq
  −
6547 1390 CS_SE_INDEL5_1000000_50.fastq
  −
1690 1661 CS_SE_SNP1_1000000_50.fastq
  −
2853 1449 CS_SE_SNP2_1000000_50.fastq
  −
4039 1237 CS_SE_SNP3_1000000_50.fastq
  −
</pre>
 

Navigation menu