Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 50: Line 50:  
*Format
 
*Format
   −
both base space and color space both single end and paired end, and paired end reads are given insert size 1500.  
+
; Both base space and color space  
 +
; Both single end and paired end, and paired end reads are given insert size 1500.  
 +
; Forward strand and reverse strand are randomly assign with probability 1/2
 +
 
 +
* Tag
 +
@2:12345:F:SE:Exact
 +
@2:12345:F:SE:SNP:2,12345,A,G;2,12346,T,C
 +
@2:12345:F:PE+offset:SNP:2,12345,A,G  (ref is A, read is G)
 +
@2:12345:F:PE+offset:Indel:25M30D5M
 +
 
 +
* File Naming
 +
BS_SE_EXACT_1M_50
 +
BS_SE_SNP1_1M_50
 +
CS_SE_INDEL1_1M
 +
CS_SE_INDEL30_1M
 +
CS_SE_INDEL200_1M
 +
CS_SE_DEL1_1M
 +
 
 +
For PE, appending "_1" and "_2", e.g.:
 +
PE_EXACT_1M_1
 +
PE_EXACT_1M_2
    
*Program (generator)
 
*Program (generator)
Line 67: Line 87:  
Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read. For each read, the tag was named in a similar way to Sanger's.  
 
Simulation file are named like: BS_SE_EXACT_1000000_35, meaning base space, single end, exact (no polymorphism), 1M reads, 35 bp per read. For each read, the tag was named in a similar way to Sanger's.  
   −
<br>  
+
* Example
 +
For illumina (from Sanger, 108mer hap1 test file):
 +
Example:
 +
_1 file:
 +
@20:14812275:F:217;None;None/1
 +
AGTTGTTTACTTTCCTTTCCTACCTGGCTGCATCTGTCACATGCATATAGTGTCCCCTGACATGAAGCTCTGATATTGATCTGGAGCCCTATTGGTCTGCAAGTGACT
 +
+
 +
%27::2:::<70<<::95<<6/8<.)3;::9-,3:6/67731/.+)66;;53'31;9<815.%%%+%4-%%%90-)./26<831))(.%%%%%%%)%0%2%%%%%+%%
 +
 
 +
@15:59364621:R:-118;None;None/1
 +
TGTTCAACCCACTATTAAGCCAGTATTAAATTGTTAATATCAGTTATTATACTTTTATTTCTAAAATTTCTATTTGATCCCTTTTTTTATAAACTCCAATGCATTCTC
 +
+
 +
%%2=;28>>>>=><>>>>>=>>=>>>;>=>9<1%+,//0+)<<91<4=;;<.%)2::8;;/9<;;;;8647<<;8;;066:<:4628;;;;5:9<<0/25752:3482
 +
 
 +
_2 file:
 +
@20:14812275:F:217;None;None/2
 +
CACTGGAGGGAATCCAATCCCAAATTAATATAACAAAACCAGAAGCTTGCTTAAAAAATATTTTATCAGATTCCAAAGTTGAGCTTGTGTTAGGGTGTACTGGAACTC
 +
+
 +
%%0;+250::-863486::599<9679/2%%))%+80%--7<;9/1%33,-%%)28/),3,67-8;56<1%)0/%%8;<;59/%%,())%%1%%+%).%099'4;+%-
 +
 
 +
@15:59364621:R:-118;None;None/2
 +
AGAAATAAGACCACATGACAATGTTAAAAATAAAACAGGCAATAGCAATAGTCCCAGAGGTGGTTACAATATGATTTCATGCTCCAGAAAGTATAGGAGAAGACAAAG
 +
+
 +
%3===;==;7<<;7<5;==<<4<;9=8==<====:<<<<<;<==:=<58;===;:8'8:<===:.9:38908:=;;7;57)%.+%)967%%-%%'6:-%)7);<;0+%
 +
 
 +
Conclusion:
 +
If the first read is forward, then itself is the same as reference sequence and the second read is reverse complement to the reference sequence.
 +
If the first read is backward, then itself is reverse complement to the reference genome and the second read is the same as the reference sequence.
 +
The first strand always position can always obtain from tag, first two fields (seperated by colon).
 +
The second strand position is first strand position plus the offset.
 +
 
 +
For SOLiD (from Sanger, 50 mer hap1 test file)
 +
e.g.
 +
_1 file:
 +
>2:67043752:F:1445;2,67043761,A,G;None
 +
T12221203021201200302123102221322000012301300211213
 +
  22212031230012003021211022213220000123013022112123 (ref)
 +
>4:125830377:R:-1541;None;None
 +
T30002222300330113020203010322111010300030003230320
 +
 
 +
_2 file:
 +
>2:67043752:F:1445;2,67043761,A,G;None
 +
G13031223023023012201210020003310110111111203310211
 +
  30312230230230122012100200033121201111113033112112 (ref)
 +
 +
>4:125830377:R:-1541;None;None
 +
G13311131230200010201210032223330120312000301230032
 +
 
 +
Conclusion:
 +
The first strand and second strand have the same direction (both either same as the reference genome, or reverse complement to reference genome),
 +
where their positions are the same as Illumina reads.
 +
 
 +
<br>
    
= Bulk statistics result  =
 
= Bulk statistics result  =
255

edits

Navigation menu