Changes

BamGenotypeCheck (view source)

Revision as of 22:50, 23 November 2009

978 bytes removed , 22:50, 23 November 2009

no edit summary

Line 50: Line 50:

== TODO ==

−

Frequently, we will want to run lane checking on a mapped .bam file which already contains sequence data from many different instrument runs merged together. They are merged because the sequencing center said they all belong to the same individual. In Pilot 3, this was true for all of the Baylor LS 454 sequencing data. In this case, the read identifier in column 1 of the .sam file carries information about which sequencing run each read belongs to, as well as information that uniquely identifies that read within its run. The read identifiers often are dot or colon-separated strings of the form 'run_name<sep>read_number'. ~~The 'run_name' may be either an SRR / ERR identifier or~~ the ~~sequencing center's own alpha-numeric internal run identifier~~.

+

1. Separate the results by "Read group classifier".

−

The ~~"Read group classifier" is~~ an extended regular expression such as '\(^[^.:]+\)[.:].*' ~~which~~ matches just the part of each read identifier that is common to all reads from one instrument run and which differs between instrument runs. ~~The regular expression is passed into the lane checking program as an ascii string~~. ~~The program keeps track~~ of ~~all distinct values it has seen for~~ the ~~matched portion, and must keep a separate tally of matches and mismatches for each combination of [read group x candidate~~ individual~~]. By itself,~~ the ~~matched portion of each read identifier does not fully specify which original .fastq~~ file a ~~read came from. The 'bitwise flag' value in column 2~~ of the .sam format has the remaining information. This is able to distinguish between the 'left end', 'right end' and 'single end' reads which come from each Illumina paired-end sequencing run. The Baylor LS 454 data were all single end reads, so I did not have to deal with this complication.<br>

+

The mapped .bam file may contains sequence data from different instrument runs. The read identifiers often are dot or colon-separated strings of the form 'run_name<sep>read_number'. The 'run_name' may be either an SRR / ERR identifier or the sequencing center's own alpha-numeric internal run identifier. Allow users to input extended regular expression such as '\(^[^.:]+\)[.:].*' hich matches just the part of each read identifier that is common to all reads from one instrument run and which differs between instrument runs.

+

2. Use model based approach to calculate probability of lane coming from the claimed individual in the index file given a pool of individuals.  

+

<br>

Weich

533

edits

Changes

BamGenotypeCheck (view source)

Revision as of 22:50, 23 November 2009

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools