Difference between revisions of "Base Caller Summaries"

Latest revision as of 17:04, 12 March 2010

Standard Illumina Base Caller (Bustard)

Sequencing-by-Synthesis (SBS)

DNA sample obtained, containing many copies of same sequences and randomly fragmented
Single-stranded DNA fragments attached to slide and amplified so there is a cluster of each fragment
DNA polymerase and 4 terminal bases (with distinct fluorescent markers) added
Clusters excited by lasers and photos taken in optimal wavelengths for 4 fluorophores
Fluorophores and terminators removed and process repeated for L cycles

Image Analysis

Corrects for imperfect repositioning of camera and aberrations of lens by aligning images to reference from original cycle
Signal for each cluster characterized as time series data of fluorescence intensities and noise

Base Calling

Converts fluorescence signals into actual sequence data with quality scores
Takes intensities of four channels for every cluster in each cycle and determines concentration of each base
Renormalizes concentrations by multiplying by ratio of average concentrations in first cycle and current cycle
Uses Markov model to determine transition matrix modeling probability of phasing (no new base synthesized), prephasing (two new bases synthesized), and normal incorporation
Uses transition matrix and observed concentrations of each base to determine concentrations in absence of phasing and reports these as base calls
- Assumes crosstalk matrix constant for a given sequencing run and that phasing affects all nucleotides in the same way

General Noise Factors

Phasing
- Failures in nucleotide incorporation or block removal or incorporation of more than one nucleotide in a particular cycle
Fading
- Decay in fluorescent signal intensity with each cycle
- Likely attributable to material loss during sequencing
Crosstalk
- C channel illumination overlaps with A: a C label fluoresces in A channel (similarly G and T overlap)
- Likely caused by overlap in dye emission frequencies
T Accumulation
- The fluorophores used for thymine are not always removed properly after each iteration
- Intensity of T signal increases across sequencing run

Alta-Cyclic

Training Stage

Learns run-specific noise patterns according to model and finds optimized solution reducing affect of noise sources using a Support Vector Machine (SVM)
Half of training set used for cross-validation

Base Calling Stage

Reports all sequences from run with optimized parameters

Differences from Standard Illumina Base Caller

Calling parameters optimized empirically and tested to enhance accuracy of each run
Calculates phasing parameters based on parametric model
Dynamically tracks changes in crosstalk, which disrupt signals in later cycles

Probabilistic Base Calling

Produces an alternative probabilistic base calling method based on the fluorescence intensity quantifications that uses:
- Extended IUPAC alphabet to code ambiguous bases
- Information criterion to control length of trustable reads
Reduced systematic bias by addressing:
- Crosstalk
- Dephasing
- Optical effect that tiles in center of image appear brighter corrected by fitting a 2D loess model to intensities and subtracting difference between fit and median intensities
Measure level of uncertainty in base calling by entropy (uncertainty in determination of correct kth base)
Does not consider fine-tuning image analysis

BayesCall

Model-based approach to base calling
Main goal is to model sequencing process by taking stochasticity into account and by explicitly modeling how errors may arise
Obtain base calls by maximizing posterior distribution of sequences given observed data and assuming a uniform prior on sequences

Swift

Performs both image analysis and base calling

Image Analysis

Background subtraction – minimal pixel value within a window around each pixel subtracted from central pixel’s value
Image correlation – alignment of images to reference cycle
Object identification and intensity extraction

Base Calling

Corrects for crosstalk by performing linear regression on crosstalk plots and use slope to derive correction matrix, performed iteratively until slope is zero
Phasing correction by ranking clusters by chastity (the ratio of the highest intensity to the sum of the top two intensities) - use top 400 clusters to estimate phasing and apply it as a correction
After correction, base with maximum intensity chosen as called base

Ibis

Method

Estimate sequencing chemistry model as a parameter directly from data using statistical learning
Training set from Bustard output using raw cluster intensities
Used a base caller with SVM classifiers for each cycle that have intensity values of the current cycle as well as the previous and following cycles (if they exist)
Data set created by aligning raw reads with mismatches for a fraction of the tiles to a reference sequence
- Half of this set used as a training set and the other half as a test set used to check results of training
Estimate parameters for calculating a quality score given class assignment and distances to the classification/decision boundary from SVM

Comparison

Unlike AltaCyclic, includes base-specific phasing parameters so can correct raw intensities for T accumulation
Does not call an 'N' character for poor quality bases
Process unique as causes of sequencing error not modeled separately
- Consider causes together by using neighboring signals in statistical learning procedure

References

Erlich, Y., Mitra, P.P., delaBastide, M., McCombie, W.R., Hannon, G.J. (2008) Alta-Cyclic: A self-optimizing base caller for next-generation sequencing. Nature Methods 5:679-682

Kao, W.-C., Stevens, K., Song, Y.S. (2009) BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. Genome Research 19:1884-1895

Kircher, M., Stenzel, U., Kelso, J. (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol. 10(8):Article R83

Rougemont, J., Amzallag, A., Iseli, C., Farinelli, L., Xenarios, I., Naef, F. (2008) Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics 9:Article 431

Whiteford, N., Skelly, T., Curtis, C., Ritchie, M.E., Löhr, A., Zaranek, A.W., Abnizova, I., Brown, C. (2009) Swift: Primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics 25:2194-2199

@@ Line 1: / Line 1: @@
-= Standard Illumina Base Caller =
+==Standard Illumina Base Caller (Bustard)==
-= Alta-Cyclic =
+===Sequencing-by-Synthesis (SBS)===
-= Probabilistic Base Calling =
+*DNA sample obtained, containing many copies of same sequences and randomly fragmented
+*Single-stranded DNA fragments attached to slide and amplified so there is a cluster of each fragment
+*DNA polymerase and 4 terminal bases (with distinct fluorescent markers) added
+*Clusters excited by lasers and photos taken in optimal wavelengths for 4 fluorophores
+*Fluorophores and terminators removed and process repeated for L cycles
-= BayesCall =
+===Image Analysis===
+*Corrects for imperfect repositioning of camera and aberrations of lens by aligning images to reference from original cycle
+*Signal for each cluster characterized as time series data of fluorescence intensities and noise
-= Swift =
+===Base Calling===
+*Converts fluorescence signals into actual sequence data with quality scores
+*Takes intensities of four channels for every cluster in each cycle and determines concentration of each base
+*Renormalizes concentrations by multiplying by ratio of average concentrations in first cycle and current cycle
+*Uses Markov model to determine transition matrix modeling probability of phasing (no new base synthesized), prephasing (two new bases synthesized), and normal incorporation
+*Uses transition matrix and observed concentrations of each base to determine concentrations in absence of phasing and reports these as base calls
+**Assumes crosstalk matrix constant for a given sequencing run and that phasing affects all nucleotides in the same way
-= Ibis =
+===General Noise Factors===
+*Phasing
+**Failures in nucleotide incorporation or block removal or incorporation of more than one nucleotide in a particular cycle
+*Fading
+**Decay in fluorescent signal intensity with each cycle
+**Likely attributable to material loss during sequencing
+*Crosstalk
+**C channel illumination overlaps with A: a C label fluoresces in A channel (similarly G and T overlap)
+**Likely caused by overlap in dye emission frequencies
+*T Accumulation
+**The fluorophores used for thymine are not always removed properly after each iteration
+**Intensity of T signal increases across sequencing run
-(To be added soon.)
+==Alta-Cyclic==
+===Training Stage===
+*Learns run-specific noise patterns according to model and finds optimized solution reducing affect of noise sources using a Support Vector Machine (SVM)
+*Half of training set used for cross-validation
-= References =
+===Base Calling Stage===
+*Reports all sequences from run with optimized parameters
+===Differences from Standard Illumina Base Caller===
+*Calling parameters optimized empirically and tested to enhance accuracy of each run
+*Calculates phasing parameters based on parametric model
+*Dynamically tracks changes in crosstalk, which disrupt signals in later cycles
+==Probabilistic Base Calling==
+*Produces an alternative probabilistic base calling method based on the fluorescence intensity quantifications that uses:
+**Extended IUPAC alphabet to code ambiguous bases
+**Information criterion to control length of trustable reads
+*Reduced systematic bias by addressing:
+**Crosstalk
+**Dephasing
+**Optical effect that tiles in center of image appear brighter corrected by fitting a 2D loess model to intensities and subtracting difference between fit and median intensities
+*Measure level of uncertainty in base calling by entropy (uncertainty in determination of correct kth base)
+*Does not consider fine-tuning image analysis
+==BayesCall==
+*Model-based approach to base calling
+*Main goal is to model sequencing process by taking stochasticity into account and by explicitly modeling how errors may arise
+*Obtain base calls by maximizing posterior distribution of sequences given observed data and assuming a uniform prior on sequences
+==Swift==
+Performs both image analysis and base calling
+===Image Analysis===
+*Background subtraction – minimal pixel value within a window around each pixel subtracted from central pixel’s value
+*Image correlation – alignment of images to reference cycle
+*Object identification and intensity extraction
+===Base Calling===
+*Corrects for crosstalk by performing linear regression on crosstalk plots and use slope to derive correction matrix, performed iteratively until slope is zero
+*Phasing correction by ranking clusters by chastity (the ratio of the highest intensity to the sum of the top two intensities) - use top 400 clusters to estimate phasing and apply it as a correction
+*After correction, base with maximum intensity chosen as called base
+==Ibis==
+===Method===
+*Estimate sequencing chemistry model as a parameter directly from data using statistical learning
+*Training set from Bustard output using raw cluster intensities
+*Used a base caller with SVM classifiers for each cycle that have intensity values of the current cycle as well as the previous and following cycles (if they exist)
+*Data set created by aligning raw reads with mismatches for a fraction of the tiles to a reference sequence
+**Half of this set used as a training set and the other half as a test set used to check results of training
+*Estimate parameters for calculating a quality score given class assignment and distances to the classification/decision boundary from SVM
+===Comparison===
+*Unlike AltaCyclic, includes base-specific phasing parameters so can correct raw intensities for T accumulation
+*Does not call an 'N' character for poor quality bases
+*Process unique as causes of sequencing error not modeled separately
+**Consider causes together by using neighboring signals in statistical learning procedure
+==References==
 Erlich, Y., Mitra, P.P., delaBastide, M., McCombie, W.R., Hannon, G.J. (2008) Alta-Cyclic: A self-optimizing base caller for next-generation sequencing. ''Nature Methods'' '''5''':679-682
 Kao, W.-C., Stevens, K., Song, Y.S. (2009) BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. ''Genome Research'' '''19''':1884-1895
+Kircher, M., Stenzel, U., Kelso, J.  (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. ''Genome Biol.'' '''10(8)''':Article R83
 Rougemont, J., Amzallag, A., Iseli, C., Farinelli, L., Xenarios, I., Naef, F. (2008) Probabilistic base calling of Solexa sequencing data. ''BMC Bioinformatics'' '''9''':Article 431
 Whiteford, N., Skelly, T., Curtis, C., Ritchie, M.E., Löhr, A., Zaranek, A.W., Abnizova, I., Brown, C. (2009) Swift: Primary data analysis for the Illumina Solexa sequencing platform. ''Bioinformatics'' '''25''':2194-2199

Difference between revisions of "Base Caller Summaries"

Latest revision as of 17:04, 12 March 2010

Contents

Standard Illumina Base Caller (Bustard)

Sequencing-by-Synthesis (SBS)

Image Analysis

Base Calling

General Noise Factors

Alta-Cyclic

Training Stage

Base Calling Stage

Differences from Standard Illumina Base Caller

Probabilistic Base Calling

BayesCall

Swift

Image Analysis

Base Calling

Ibis

Method

Comparison

References

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools