Introduction

Tandem repeats are a common polymorphism in the genome.

This wiki is an attempt at consolidation on past work by many people

We talk about its representation, major characteristics, and a set of useful definitions and algorithms for working with them

• Gary Benson's tandem repeat finder's
• Gymrek et al.
• Highnam et al.

Definition

A series of repeats that are contiguous

Concepts

Motif Canonical Class

Using the concepts of shifting, the canonical class of a set of motifs can be defined as a equivalence relationship.

```  ACG ~ CGA if there exists a shift that allows s(ACG, i) = CGA
```

This relationship can be show to be reflexive, symmetric and transitive. And a equivalence class can be defined on this.

Example:

This is a distribution of motifs without collapsing

``` A
C
G
T
AA
CC
GG
TT
AC
AG
AT
CG
CT
GT
CA
GA

```

show all permutations show acyclic show shift show reverse complement

Shifting

Consider the motifs ACTT, TACT and TTAC, the following stretches can be observed in the genome

``` GGGGGGACTTACTTACTTACTTACTTAGGGGG
GGGGGGTACTTACTTACTTACTTACTTGGGGG
GGGGGGTTACTTACTTACTTACTTACTGGGGG
```

but looking from the right flank, they can easily be CTTA, ACTT and TACT respectively.

The concept of shifting the sequence is useful for grouping such like motifs together.

We define shift as follow

``` A shift of a sequence is the sliding of the sequence with the alleles wrapped to the front?
```

Reverse Complement

Reverse complement is is a common concept in sequence analysis.

``` ACCCTCCCCTCT
AGAGGGGAGGGT
```

Acyclicity

Motifs are required to be acyclic. For example, a motif ACACAC should just be represented by AC as it is 3 copies of AC.

```  A sequence is cyclic if and only if there exists a sub sequence in which it is a multiple copy of.
```

The definition can be more explicit as follows:

``` A sequence is cyclic if and only if there exists a non trivial shift of the sequence that is equivalent to  the sequence.
```

Take for example, the sequence ACACA, is this a bona fide motif? After it seems like it is 2.5 copies of AC and AC might be more appropriate.

``` shift 0: ACACA
shift 1: CACAA
shift 2: ACAAC
shift 3: CAACA
shift 4: AACAC

```

So ACACA is a bona fide motif.

Fractional counts

While it is natural to think of a stretch of repeat as a integer, it is useful to consider fractional counts of a repeat especially in large repeat tracts. This is done in Tandem Repeat Finder.

``` GGGTTAAGGGTTAAGGGTTAAGGGTTAAGGGTTAAGGG
```

This is 5.5 copies of the repeat GGGTTAA

Scoring

You can use a score bounded by one.

By random matching -

TRF Scoring

A nice useful scheme is +2 for matches and -7 for indels and mismatches, and it provides a convenient way of scoring for purity.

Inexact Tandem Repeats

We see alot of this

``` AAAAAAAAAAAGAAAAAAAAAAAAAA
```

or

``` ACACACACACACGACACACAACACACACACAC
```

or

```  GAAAGAAGGAAAAGAGAGAAAAGAAGAAGAA
```

Characteristics

• motif length
• motif basis
• repeat tract length
• purity

questions.

``` Is AAAAC really different from AAAAAC?
```

Algorithm for Detection

Exact alignment algorithm

Fuzzy alignment algorithm

Detection of a motif in a sequence

``` The following shows the trace of how the algorithm works
```
```   ============================================
ANNOTATING INDEL FUZZILY
********************************************
EXTRACTIING REGION BY EXACT LEFT AND RIGHT ALIGNMENT

20:131948:C/CCA
EXACT REGION 131948-131965 (18)
CCACACACACACACACAA
FINAL EXACT REGION 131948-131965 (18)
CCACACACACACACACAA
********************************************
PICK CANDIDATE MOTIFS

Longest Allele : C[CA]CACACACACACACACAA
detecting motifs for an str
seq: CCACACACACACACACACAA
len : 20
cmax_len : 10
candidate motifs: 25
AC : 0.894737 2 0
AAC : 0.5 3 0.0555556
ACC : 0.5 3 0.0555556
AAAC : 0.0588235 4 0.125 (< 2 copies)
ACCC : 0.0588235 4 0.125 (< 2 copies)
AACAC : 0.5 5 0.02
ACACC : 0.5 5 0.02
AAACAC : 0.0666667 6 0.0555556 (< 2 copies)
ACACCC : 0.0666667 6 0.0555556 (< 2 copies)
AACACAC : 0.5 7 0.0102041
ACACACC : 0.5 7 0.0102041
AAACACAC : 0.0769231 8 0.03125 (< 2 copies)
ACACACCC : 0.0769231 8 0.03125 (< 2 copies)
AACACACAC : 0.5 9 0.00617284 (< 2 copies)
ACACACACC : 0.5 9 0.00617284 (< 2 copies)
AAACACACAC : 0.0909091 10 0.02 (< 2 copies)
ACACACACCC : 0.0909091 10 0.02 (< 2 copies)
********************************************
PICKING NEXT BEST MOTIF

selected:         AC 0.89 0.00
********************************************
DETECTING REPEAT TRACT FUZZILY
++++++++++++++++++++++++++++++++++++++++++++
Exact left/right alignment

repeat_tract              : CACACACACACACACA
position                  : [131949,131964]
motif_concordance         : 1
repeat units              : 8
exact repeat units        : 8
total no. of repeat units : 8

++++++++++++++++++++++++++++++++++++++++++++
Fuzzy right alignment

repeat motif : CA
rflank       : AACTC
mlen         : 2
rflen        : 5
plen         : 111

rlen         : 106

optimal score: 50.5073
optimal state: MR
optimal track: MR|r|0|5
optimal probe len: 25
optimal path length : 107
max j: 106
probe: (1~82) [1~10] (1~5)

motif #           : 10 [83,101]
motif concordance : 95% (9/10)
motif discordance : 0|1|0|0|0|0|0|0|0|0

Model:  ----------------------------------------------------------------------------------CACACACACACACACACACAAACTC
SYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYMMMDMMMMMMMMMMMMMMMMMMMMME
oo++oo++oo++oo++oo++RRRRR

++++++++++++++++++++++++++++++++++++++++++++
Fuzzy left alignment

lflank       : ATCTTA
repeat motif : CA
lflen        : 6
mlen         : 2
plen         : 111

rlen         : 105

optimal score: 50.5858
optimal state: Z
optimal track: Z|m|10|2
optimal probe len: 26
optimal path length : 106
max j: 105
mismatch penalty: 3

model: (1~6) [1~10]

motif #           : 10 [7,25]
motif concordance : 95% (9/10)
motif discordance : 0|1|0|0|0|0|0|0|0|0

Model:  ATCTTACACACACACACACACACACA--------------------------------------------------------------------------------
SMMMMMMMMMDMMMMMMMMMMMMMMMMZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZE
LLLLLLoo++oo++oo++oo++oo++

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
VNTR Summary
rid          : 19
motif        : AC
ru           : CA

Exact
repeat_tract                    : CACACACACACACACA
position                        : [131949,131964]
reference repeat unit length    : 8
motif_concordance               : 1
repeat units                    : 8
exact repeat units              : 8
total no. of repeat units       : 8

Fuzzy
repeat_tract                    : CACCACACACACACACACA
position                        : [131946,131964]
reference repeat unit length    : 19
motif_concordance               : 0.95
repeat units                    : 19
exact repeat units              : 9
total no. of repeat units       : 10
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
```

Implementation

This is implemented in vt.