Difference between revisions of "Base Caller Summaries"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Standard Illumina Base Caller =
+
==Standard Illumina Base Caller (Bustard)==
  
= Alta-Cyclic =
+
===Sequencing-by-Synthesis (SBS)===
  
= Probabilistic Base Calling =
+
*DNA sample obtained, containing many copies of same sequences and randomly fragmented
 +
*Single-stranded DNA fragments attached to slide and amplified so there is a cluster of each fragment
 +
*DNA polymerase and 4 terminal bases (with distinct fluorescent markers) added
 +
*Clusters excited by lasers and photos taken in optimal wavelengths for 4 fluorophores
 +
*Fluorophores and terminators removed and process repeated for L cycles
  
= BayesCall =
+
===Image Analysis===
 +
*Corrects for imperfect repositioning of camera and aberrations of lens by aligning images to reference from original cycle
 +
*Signal for each cluster characterized as time series data of fluorescence intensities and noise
  
= Swift =
+
===Base Calling===
 +
*Converts fluorescence signals into actual sequence data with quality scores
 +
*Takes intensities of four channels for every cluster in each cycle and determines concentration of each base
 +
*Renormalizes concentrations by multiplying by ratio of average concentrations in first cycle and current cycle
 +
*Uses Markov model to determine transition matrix modeling probability of phasing (no new base synthesized), prephasing (two new bases synthesized), and normal incorporation
 +
*Uses transition matrix and observed concentrations of each base to determine concentrations in absence of phasing and reports these as base calls
 +
**Assumes crosstalk matrix constant for a given sequencing run and that phasing affects all nucleotides in the same way
  
= Ibis =
+
===General Noise Factors===
 +
*Phasing
 +
**Failures in nucleotide incorporation or block removal or incorporation of more than one nucleotide in a particular cycle
 +
*Fading
 +
**Decay in fluorescent signal intensity with each cycle
 +
**Likely attributable to material loss during sequencing
 +
*Crosstalk
 +
**C channel illumination overlaps with A: a C label fluoresces in A channel (similarly G and T overlap)
 +
**Likely caused by overlap in dye emission frequencies
 +
*T Accumulation
 +
**The fluorophores used for thymine are not always removed properly after each iteration
 +
**Intensity of T signal increases across sequencing run
  
(To be added soon.)
+
==Alta-Cyclic==
 +
===Training Stage===
 +
*Learns run-specific noise patterns according to model and finds optimized solution reducing affect of noise sources using a Support Vector Machine (SVM)
 +
*Half of training set used for cross-validation
  
= References =
+
===Base Calling Stage===
 +
*Reports all sequences from run with optimized parameters
 +
 
 +
===Differences from Standard Illumina Base Caller===
 +
*Calling parameters optimized empirically and tested to enhance accuracy of each run
 +
*Calculates phasing parameters based on parametric model
 +
*Dynamically tracks changes in crosstalk, which disrupt signals in later cycles
 +
 
 +
==Probabilistic Base Calling==
 +
*Produces an alternative probabilistic base calling method based on the fluorescence intensity quantifications that uses:
 +
**Extended IUPAC alphabet to code ambiguous bases
 +
**Information criterion to control length of trustable reads
 +
*Reduced systematic bias by addressing:
 +
**Crosstalk
 +
**Dephasing
 +
**Optical effect that tiles in center of image appear brighter corrected by fitting a 2D loess model to intensities and subtracting difference between fit and median intensities
 +
*Measure level of uncertainty in base calling by entropy (uncertainty in determination of correct kth base)
 +
*Does not consider fine-tuning image analysis
 +
 
 +
==BayesCall==
 +
*Model-based approach to base calling
 +
*Main goal is to model sequencing process by taking stochasticity into account and by explicitly modeling how errors may arise
 +
*Obtain base calls by maximizing posterior distribution of sequences given observed data and assuming a uniform prior on sequences
 +
 
 +
==Swift==
 +
Performs both image analysis and base calling
 +
 
 +
===Image Analysis===
 +
*Background subtraction – minimal pixel value within a window around each pixel subtracted from central pixel’s value
 +
*Image correlation – alignment of images to reference cycle
 +
*Object identification and intensity extraction
 +
 
 +
===Base Calling===
 +
*Corrects for crosstalk by performing linear regression on crosstalk plots and use slope to derive correction matrix, performed iteratively until slope is zero
 +
*Phasing correction by ranking clusters by chastity (the ratio of the highest intensity to the sum of the top two intensities) - use top 400 clusters to estimate phasing and apply it as a correction
 +
*After correction, base with maximum intensity chosen as called base
 +
 
 +
==Ibis==
 +
 
 +
===Method===
 +
*Estimate sequencing chemistry model as a parameter directly from data using statistical learning
 +
*Training set from Bustard output using raw cluster intensities
 +
*Used a base caller with SVM classifiers for each cycle that have intensity values of the current cycle as well as the previous and following cycles (if they exist)
 +
*Data set created by aligning raw reads with mismatches for a fraction of the tiles to a reference sequence
 +
**Half of this set used as a training set and the other half as a test set used to check results of training
 +
*Estimate parameters for calculating a quality score given class assignment and distances to the classification/decision boundary from SVM
 +
 
 +
===Comparison===
 +
*Unlike AltaCyclic, includes base-specific phasing parameters so can correct raw intensities for T accumulation
 +
*Does not call an 'N' character for poor quality bases
 +
*Process unique as causes of sequencing error not modeled separately
 +
**Consider causes together by using neighboring signals in statistical learning procedure
 +
 
 +
==References==
  
 
Erlich, Y., Mitra, P.P., delaBastide, M., McCombie, W.R., Hannon, G.J. (2008) Alta-Cyclic: A self-optimizing base caller for next-generation sequencing. ''Nature Methods'' '''5''':679-682  
 
Erlich, Y., Mitra, P.P., delaBastide, M., McCombie, W.R., Hannon, G.J. (2008) Alta-Cyclic: A self-optimizing base caller for next-generation sequencing. ''Nature Methods'' '''5''':679-682  
  
 
Kao, W.-C., Stevens, K., Song, Y.S. (2009) BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. ''Genome Research'' '''19''':1884-1895  
 
Kao, W.-C., Stevens, K., Song, Y.S. (2009) BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. ''Genome Research'' '''19''':1884-1895  
 +
 +
Kircher, M., Stenzel, U., Kelso, J.  (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. ''Genome Biol.'' '''10(8)''':Article R83
  
 
Rougemont, J., Amzallag, A., Iseli, C., Farinelli, L., Xenarios, I., Naef, F. (2008) Probabilistic base calling of Solexa sequencing data. ''BMC Bioinformatics'' '''9''':Article 431  
 
Rougemont, J., Amzallag, A., Iseli, C., Farinelli, L., Xenarios, I., Naef, F. (2008) Probabilistic base calling of Solexa sequencing data. ''BMC Bioinformatics'' '''9''':Article 431  
  
 
Whiteford, N., Skelly, T., Curtis, C., Ritchie, M.E., Löhr, A., Zaranek, A.W., Abnizova, I., Brown, C. (2009) Swift: Primary data analysis for the Illumina Solexa sequencing platform. ''Bioinformatics'' '''25''':2194-2199
 
Whiteford, N., Skelly, T., Curtis, C., Ritchie, M.E., Löhr, A., Zaranek, A.W., Abnizova, I., Brown, C. (2009) Swift: Primary data analysis for the Illumina Solexa sequencing platform. ''Bioinformatics'' '''25''':2194-2199

Latest revision as of 17:04, 12 March 2010

Standard Illumina Base Caller (Bustard)

Sequencing-by-Synthesis (SBS)

  • DNA sample obtained, containing many copies of same sequences and randomly fragmented
  • Single-stranded DNA fragments attached to slide and amplified so there is a cluster of each fragment
  • DNA polymerase and 4 terminal bases (with distinct fluorescent markers) added
  • Clusters excited by lasers and photos taken in optimal wavelengths for 4 fluorophores
  • Fluorophores and terminators removed and process repeated for L cycles

Image Analysis

  • Corrects for imperfect repositioning of camera and aberrations of lens by aligning images to reference from original cycle
  • Signal for each cluster characterized as time series data of fluorescence intensities and noise

Base Calling

  • Converts fluorescence signals into actual sequence data with quality scores
  • Takes intensities of four channels for every cluster in each cycle and determines concentration of each base
  • Renormalizes concentrations by multiplying by ratio of average concentrations in first cycle and current cycle
  • Uses Markov model to determine transition matrix modeling probability of phasing (no new base synthesized), prephasing (two new bases synthesized), and normal incorporation
  • Uses transition matrix and observed concentrations of each base to determine concentrations in absence of phasing and reports these as base calls
    • Assumes crosstalk matrix constant for a given sequencing run and that phasing affects all nucleotides in the same way

General Noise Factors

  • Phasing
    • Failures in nucleotide incorporation or block removal or incorporation of more than one nucleotide in a particular cycle
  • Fading
    • Decay in fluorescent signal intensity with each cycle
    • Likely attributable to material loss during sequencing
  • Crosstalk
    • C channel illumination overlaps with A: a C label fluoresces in A channel (similarly G and T overlap)
    • Likely caused by overlap in dye emission frequencies
  • T Accumulation
    • The fluorophores used for thymine are not always removed properly after each iteration
    • Intensity of T signal increases across sequencing run

Alta-Cyclic

Training Stage

  • Learns run-specific noise patterns according to model and finds optimized solution reducing affect of noise sources using a Support Vector Machine (SVM)
  • Half of training set used for cross-validation

Base Calling Stage

  • Reports all sequences from run with optimized parameters

Differences from Standard Illumina Base Caller

  • Calling parameters optimized empirically and tested to enhance accuracy of each run
  • Calculates phasing parameters based on parametric model
  • Dynamically tracks changes in crosstalk, which disrupt signals in later cycles

Probabilistic Base Calling

  • Produces an alternative probabilistic base calling method based on the fluorescence intensity quantifications that uses:
    • Extended IUPAC alphabet to code ambiguous bases
    • Information criterion to control length of trustable reads
  • Reduced systematic bias by addressing:
    • Crosstalk
    • Dephasing
    • Optical effect that tiles in center of image appear brighter corrected by fitting a 2D loess model to intensities and subtracting difference between fit and median intensities
  • Measure level of uncertainty in base calling by entropy (uncertainty in determination of correct kth base)
  • Does not consider fine-tuning image analysis

BayesCall

  • Model-based approach to base calling
  • Main goal is to model sequencing process by taking stochasticity into account and by explicitly modeling how errors may arise
  • Obtain base calls by maximizing posterior distribution of sequences given observed data and assuming a uniform prior on sequences

Swift

Performs both image analysis and base calling

Image Analysis

  • Background subtraction – minimal pixel value within a window around each pixel subtracted from central pixel’s value
  • Image correlation – alignment of images to reference cycle
  • Object identification and intensity extraction

Base Calling

  • Corrects for crosstalk by performing linear regression on crosstalk plots and use slope to derive correction matrix, performed iteratively until slope is zero
  • Phasing correction by ranking clusters by chastity (the ratio of the highest intensity to the sum of the top two intensities) - use top 400 clusters to estimate phasing and apply it as a correction
  • After correction, base with maximum intensity chosen as called base

Ibis

Method

  • Estimate sequencing chemistry model as a parameter directly from data using statistical learning
  • Training set from Bustard output using raw cluster intensities
  • Used a base caller with SVM classifiers for each cycle that have intensity values of the current cycle as well as the previous and following cycles (if they exist)
  • Data set created by aligning raw reads with mismatches for a fraction of the tiles to a reference sequence
    • Half of this set used as a training set and the other half as a test set used to check results of training
  • Estimate parameters for calculating a quality score given class assignment and distances to the classification/decision boundary from SVM

Comparison

  • Unlike AltaCyclic, includes base-specific phasing parameters so can correct raw intensities for T accumulation
  • Does not call an 'N' character for poor quality bases
  • Process unique as causes of sequencing error not modeled separately
    • Consider causes together by using neighboring signals in statistical learning procedure

References

Erlich, Y., Mitra, P.P., delaBastide, M., McCombie, W.R., Hannon, G.J. (2008) Alta-Cyclic: A self-optimizing base caller for next-generation sequencing. Nature Methods 5:679-682

Kao, W.-C., Stevens, K., Song, Y.S. (2009) BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. Genome Research 19:1884-1895

Kircher, M., Stenzel, U., Kelso, J. (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol. 10(8):Article R83

Rougemont, J., Amzallag, A., Iseli, C., Farinelli, L., Xenarios, I., Naef, F. (2008) Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics 9:Article 431

Whiteford, N., Skelly, T., Curtis, C., Ritchie, M.E., Löhr, A., Zaranek, A.W., Abnizova, I., Brown, C. (2009) Swift: Primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics 25:2194-2199