Difference between revisions of "Tandem Repeat Concepts"
Line 3: | Line 3: | ||
Tandem repeats are a common polymorphism in the genome. | Tandem repeats are a common polymorphism in the genome. | ||
− | This wiki | + | This wiki is an attempt at consolidation on past work by many people |
+ | |||
+ | |||
+ | * Gary Benson's tandem repeat finder's | ||
+ | * Gymrek et al. | ||
+ | * Highnam et al. | ||
= Definition = | = Definition = |
Revision as of 19:15, 25 February 2016
Introduction
Tandem repeats are a common polymorphism in the genome.
This wiki is an attempt at consolidation on past work by many people
- Gary Benson's tandem repeat finder's
- Gymrek et al.
- Highnam et al.
Definition
A series of repeats that are contiguous
Concepts
Motif Canonical Class
Using the concepts of shifting, the canonical class of a set of motifs can be defined as a equivalence relationship.
ACG ~ CGA if there exists a shift that allows s(ACG, i) = CGA
This relationship can be show to be reflexive, symmetric and transitive. And a equivalence class can be defined on this.
Example:
This is a distribution of motifs without collapsing
A C G T AA CC GG TT AC AG AT CG CT GT CA GA
show all permutations show acyclic show shift show reverse complement
Shifting
Consider the motifs ACTT, TACT and TTAC, the following stretches can be observed in the genome
GGGGGGACTTACTTACTTACTTACTTAGGGGG GGGGGGTACTTACTTACTTACTTACTTGGGGG GGGGGGTTACTTACTTACTTACTTACTGGGGG
but looking from the right flank, they can easily be CTTA, ACTT and TACT respectively.
The concept of shifting the sequence is useful for grouping such like motifs together.
We define shift as follow
A shift of a sequence is the sliding of the sequence with the alleles wrapped to the front?
Reverse Complement
Reverse complement is is a common concept in sequence analysis.
ACCCTCCCCTCT AGAGGGGAGGGT
Acyclicity
Motifs are required to be acyclic. For example, a motif ACACAC should just be represented by AC as it is 3 copies of AC.
A sequence is cyclic if and only if there exists a sub sequence in which it is a multiple copy of.
The definition can be more explicit as follows:
A sequence is cyclic if and only if there exists a non trivial shift of the sequence that is equivalent to the sequence.
Take for example, the sequence ACACA, is this a bona fide motif? After it seems like it is 2.5 copies of AC and AC might be more appropriate.
shift 0: ACACA shift 1: CACAA shift 2: ACAAC shift 3: CAACA shift 4: AACAC
So ACACA is a bona fide motif.
Fractional counts
While it is natural to think of a stretch of repeat as a integer, it is useful to consider fractional counts of a repeat especially in large repeat tracts. This is done in Tandem Repeat Finder.
GGGTTAAGGGTTAAGGGTTAAGGGTTAAGGGTTAAGGG
This is 5.5 copies of the repeat GGGTTAA
Scoring
You can use a score bounded by one.
By random matching -
TRF Scoring
A nice useful scheme is +2 for matches and -7 for indels and mismatches, and it provides a convenient way of scoring for purity.
Inexact Tandem Repeats
We see alot of this
AAAAAAAAAAAGAAAAAAAAAAAAAA
or
ACACACACACACGACACACAACACACACACAC
or
GAAAGAAGGAAAAGAGAGAAAAGAAGAAGAA
Characteristics
- motif length
- motif basis
- repeat tract length
- purity
questions.
Is AAAAC really different from AAAAAC?
Algorithm for Detection
Exact alignment algorithm
Fuzzy alignment algorithm
Detection of a motif in a sequence
Model free left alignment and right alignment
Model based fuzzy left alignment and right alignment
Model free fuzzy left alignment and right alignment
Implementation
This is implemented in vt.
Citation
Maintained by
This page is maintained by Adrian.