Difference between revisions of "Tandem Repeat Concepts"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(4 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
Tandem repeats are a common polymorphism in the genome.     
 
Tandem repeats are a common polymorphism in the genome.     
  
This wiki talks about its representation - and compiles from previous work -especially by Gary Benson's tandem repeat finder's output.
+
This wiki is an attempt at consolidation on past work by many people
 +
 
 +
We talk about its representation, major characteristics, and a set of useful definitions and algorithms for working with them
 +
 
 +
* Gary Benson's tandem repeat finder's
 +
* Gymrek et al.
 +
* Highnam et al.
  
 
= Definition =
 
= Definition =
Line 115: Line 121:
 
A nice useful scheme is +2 for matches and -7 for indels and mismatches, and it provides a convenient way of scoring for purity.
 
A nice useful scheme is +2 for matches and -7 for indels and mismatches, and it provides a convenient way of scoring for purity.
  
== Normalized scoring ==
+
== Inexact Tandem Repeats ==
 +
 
 +
We see alot of this
 +
 
 +
  AAAAAAAAAAAGAAAAAAAAAAAAAA
 +
 
 +
or
  
 +
  ACACACACACACGACACACAACACACACACAC
  
 +
or
  
= Classification =
+
  GAAAGAAGGAAAAGAGAGAAAAGAAGAAGAA
  
 +
= Characteristics =
 +
 
 
* motif length
 
* motif length
 
* motif basis
 
* motif basis
* repeat tract lengfth
+
* repeat tract length
 
* purity
 
* purity
 +
 +
questions.
 +
 +
  Is AAAAC really different from AAAAAC?
 +
 +
 
  
  
Line 135: Line 157:
 
== Detection of a motif in a sequence ==
 
== Detection of a motif in a sequence ==
  
 +
  The following shows the trace of how the algorithm works
  
 +
    ============================================
 +
    ANNOTATING INDEL FUZZILY
 +
    ********************************************
 +
    EXTRACTIING REGION BY EXACT LEFT AND RIGHT ALIGNMENT
 +
   
 +
    20:131948:C/CCA
 +
    EXACT REGION 131948-131965 (18)
 +
                CCACACACACACACACAA
 +
    FINAL EXACT REGION 131948-131965 (18)
 +
                      CCACACACACACACACAA
 +
    ********************************************
 +
    PICK CANDIDATE MOTIFS
 +
   
 +
    Longest Allele : C[CA]CACACACACACACACAA
 +
    detecting motifs for an str
 +
    seq: CCACACACACACACACACAA
 +
    len : 20
 +
    cmax_len : 10
 +
    candidate motifs: 25
 +
    AC : 0.894737 2 0
 +
    AAC : 0.5 3 0.0555556
 +
    ACC : 0.5 3 0.0555556
 +
    AAAC : 0.0588235 4 0.125 (< 2 copies)
 +
    ACCC : 0.0588235 4 0.125 (< 2 copies)
 +
    AACAC : 0.5 5 0.02
 +
    ACACC : 0.5 5 0.02
 +
    AAACAC : 0.0666667 6 0.0555556 (< 2 copies)
 +
    ACACCC : 0.0666667 6 0.0555556 (< 2 copies)
 +
    AACACAC : 0.5 7 0.0102041
 +
    ACACACC : 0.5 7 0.0102041
 +
    AAACACAC : 0.0769231 8 0.03125 (< 2 copies)
 +
    ACACACCC : 0.0769231 8 0.03125 (< 2 copies)
 +
    AACACACAC : 0.5 9 0.00617284 (< 2 copies)
 +
    ACACACACC : 0.5 9 0.00617284 (< 2 copies)
 +
    AAACACACAC : 0.0909091 10 0.02 (< 2 copies)
 +
    ACACACACCC : 0.0909091 10 0.02 (< 2 copies)
 +
    ********************************************
 +
    PICKING NEXT BEST MOTIF
 +
   
 +
    selected:        AC 0.89 0.00
 +
    ********************************************
 +
    DETECTING REPEAT TRACT FUZZILY
 +
    ++++++++++++++++++++++++++++++++++++++++++++
 +
    Exact left/right alignment
 +
   
 +
    repeat_tract              : CACACACACACACACA
 +
    position                  : [131949,131964]
 +
    motif_concordance        : 1
 +
    repeat units              : 8
 +
    exact repeat units        : 8
 +
    total no. of repeat units : 8
 +
   
 +
    ++++++++++++++++++++++++++++++++++++++++++++
 +
    Fuzzy right alignment
 +
   
 +
    repeat motif : CA
 +
    rflank      : AACTC
 +
    mlen        : 2
 +
    rflen        : 5
 +
    plen        : 111
 +
   
 +
    read        : AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACACCACACACACACACACAAACTC
 +
    rlen        : 106
 +
   
 +
    optimal score: 50.5073
 +
    optimal state: MR
 +
    optimal track: MR|r|0|5
 +
    optimal probe len: 25
 +
    optimal path length : 107
 +
    max j: 106
 +
    probe: (1~82) [1~10] (1~5)
 +
    read : (1~82) [83~101] (102~106)
 +
   
 +
    motif #          : 10 [83,101]
 +
    motif concordance : 95% (9/10)
 +
    motif discordance : 0|1|0|0|0|0|0|0|0|0
 +
   
 +
    Model:  ----------------------------------------------------------------------------------CACACACACACACACACACAAACTC
 +
          SYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYMMMDMMMMMMMMMMMMMMMMMMMMME
 +
                                                                                              oo++oo++oo++oo++oo++RRRRR
 +
    Read:  AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACAC-CACACACACACACACAAACTC
 +
   
 +
    ++++++++++++++++++++++++++++++++++++++++++++
 +
    Fuzzy left alignment
 +
   
 +
    lflank      : ATCTTA
 +
    repeat motif : CA
 +
    lflen        : 6
 +
    mlen        : 2
 +
    plen        : 111
 +
   
 +
    read        : ATCTTACACCACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT
 +
    rlen        : 105
 +
   
 +
    optimal score: 50.5858
 +
    optimal state: Z
 +
    optimal track: Z|m|10|2
 +
    optimal probe len: 26
 +
    optimal path length : 106
 +
    max j: 105
 +
    mismatch penalty: 3
 +
   
 +
    model: (1~6) [1~10]
 +
    read : (1~6) [7~25][26~106]
 +
   
 +
    motif #          : 10 [7,25]
 +
    motif concordance : 95% (9/10)
 +
    motif discordance : 0|1|0|0|0|0|0|0|0|0
 +
   
 +
    Model:  ATCTTACACACACACACACACACACA--------------------------------------------------------------------------------
 +
          SMMMMMMMMMDMMMMMMMMMMMMMMMMZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZE
 +
            LLLLLLoo++oo++oo++oo++oo++                                                                               
 +
    Read:  ATCTTACAC-CACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT
 +
   
 +
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 +
    VNTR Summary
 +
    rid          : 19
 +
    motif        : AC
 +
    ru          : CA
 +
   
 +
    Exact
 +
    repeat_tract                    : CACACACACACACACA
 +
    position                        : [131949,131964]
 +
    reference repeat unit length    : 8
 +
    motif_concordance              : 1
 +
    repeat units                    : 8
 +
    exact repeat units              : 8
 +
    total no. of repeat units      : 8
 +
   
 +
    Fuzzy
 +
    repeat_tract                    : CACCACACACACACACACA
 +
    position                        : [131946,131964]
 +
    reference repeat unit length    : 19
 +
    motif_concordance              : 0.95
 +
    repeat units                    : 19
 +
    exact repeat units              : 9
 +
    total no. of repeat units      : 10
 +
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  
 
== Model free left alignment and right alignment ==
 
== Model free left alignment and right alignment ==

Latest revision as of 19:17, 25 February 2016

Introduction

Tandem repeats are a common polymorphism in the genome.

This wiki is an attempt at consolidation on past work by many people

We talk about its representation, major characteristics, and a set of useful definitions and algorithms for working with them

  • Gary Benson's tandem repeat finder's
  • Gymrek et al.
  • Highnam et al.

Definition

A series of repeats that are contiguous


Concepts

Motif Canonical Class

Using the concepts of shifting, the canonical class of a set of motifs can be defined as a equivalence relationship.

  ACG ~ CGA if there exists a shift that allows s(ACG, i) = CGA

This relationship can be show to be reflexive, symmetric and transitive. And a equivalence class can be defined on this.

Example:

This is a distribution of motifs without collapsing

 A
 C
 G
 T
 AA
 CC
 GG 
 TT
 AC
 AG
 AT
 CG
 CT
 GT
 CA
 GA
 

show all permutations show acyclic show shift show reverse complement



Shifting

Consider the motifs ACTT, TACT and TTAC, the following stretches can be observed in the genome

 GGGGGGACTTACTTACTTACTTACTTAGGGGG
 GGGGGGTACTTACTTACTTACTTACTTGGGGG
 GGGGGGTTACTTACTTACTTACTTACTGGGGG

but looking from the right flank, they can easily be CTTA, ACTT and TACT respectively.

The concept of shifting the sequence is useful for grouping such like motifs together.

We define shift as follow

 A shift of a sequence is the sliding of the sequence with the alleles wrapped to the front?

Reverse Complement

Reverse complement is is a common concept in sequence analysis.

 ACCCTCCCCTCT
 AGAGGGGAGGGT

Acyclicity

Motifs are required to be acyclic. For example, a motif ACACAC should just be represented by AC as it is 3 copies of AC.

  A sequence is cyclic if and only if there exists a sub sequence in which it is a multiple copy of.

The definition can be more explicit as follows:

 A sequence is cyclic if and only if there exists a non trivial shift of the sequence that is equivalent to  the sequence.

Take for example, the sequence ACACA, is this a bona fide motif? After it seems like it is 2.5 copies of AC and AC might be more appropriate.

 shift 0: ACACA
 shift 1: CACAA
 shift 2: ACAAC
 shift 3: CAACA
 shift 4: AACAC

So ACACA is a bona fide motif.

Fractional counts

While it is natural to think of a stretch of repeat as a integer, it is useful to consider fractional counts of a repeat especially in large repeat tracts. This is done in Tandem Repeat Finder.

 GGGTTAAGGGTTAAGGGTTAAGGGTTAAGGGTTAAGGG

This is 5.5 copies of the repeat GGGTTAA

Scoring

You can use a score bounded by one.

By random matching -

TRF Scoring

A nice useful scheme is +2 for matches and -7 for indels and mismatches, and it provides a convenient way of scoring for purity.

Inexact Tandem Repeats

We see alot of this

 AAAAAAAAAAAGAAAAAAAAAAAAAA

or

 ACACACACACACGACACACAACACACACACAC

or

  GAAAGAAGGAAAAGAGAGAAAAGAAGAAGAA

Characteristics

  • motif length
  • motif basis
  • repeat tract length
  • purity

questions.

 Is AAAAC really different from AAAAAC?



Algorithm for Detection

Exact alignment algorithm

Fuzzy alignment algorithm

Detection of a motif in a sequence

 The following shows the trace of how the algorithm works
   ============================================
   ANNOTATING INDEL FUZZILY
   ********************************************
   EXTRACTIING REGION BY EXACT LEFT AND RIGHT ALIGNMENT
   
   20:131948:C/CCA
   EXACT REGION 131948-131965 (18) 
                CCACACACACACACACAA
   FINAL EXACT REGION 131948-131965 (18) 
                      CCACACACACACACACAA
   ********************************************
   PICK CANDIDATE MOTIFS
   
   Longest Allele : C[CA]CACACACACACACACAA
   detecting motifs for an str
   seq: CCACACACACACACACACAA
   len : 20
   cmax_len : 10
   candidate motifs: 25
   AC : 0.894737 2 0
   AAC : 0.5 3 0.0555556
   ACC : 0.5 3 0.0555556
   AAAC : 0.0588235 4 0.125 (< 2 copies)
   ACCC : 0.0588235 4 0.125 (< 2 copies)
   AACAC : 0.5 5 0.02
   ACACC : 0.5 5 0.02
   AAACAC : 0.0666667 6 0.0555556 (< 2 copies)
   ACACCC : 0.0666667 6 0.0555556 (< 2 copies)
   AACACAC : 0.5 7 0.0102041
   ACACACC : 0.5 7 0.0102041
   AAACACAC : 0.0769231 8 0.03125 (< 2 copies)
   ACACACCC : 0.0769231 8 0.03125 (< 2 copies)
   AACACACAC : 0.5 9 0.00617284 (< 2 copies)
   ACACACACC : 0.5 9 0.00617284 (< 2 copies)
   AAACACACAC : 0.0909091 10 0.02 (< 2 copies)
   ACACACACCC : 0.0909091 10 0.02 (< 2 copies)
   ********************************************
   PICKING NEXT BEST MOTIF
   
   selected:         AC 0.89 0.00
   ********************************************
   DETECTING REPEAT TRACT FUZZILY
   ++++++++++++++++++++++++++++++++++++++++++++
   Exact left/right alignment
   
   repeat_tract              : CACACACACACACACA
   position                  : [131949,131964]
   motif_concordance         : 1
   repeat units              : 8
   exact repeat units        : 8
   total no. of repeat units : 8
   
   ++++++++++++++++++++++++++++++++++++++++++++
   Fuzzy right alignment
   
   repeat motif : CA
   rflank       : AACTC
   mlen         : 2
   rflen        : 5
   plen         : 111
   
   read         : AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACACCACACACACACACACAAACTC
   rlen         : 106
   
   optimal score: 50.5073
   optimal state: MR
   optimal track: MR|r|0|5
   optimal probe len: 25
   optimal path length : 107
   max j: 106
   probe: (1~82) [1~10] (1~5)
   read : (1~82) [83~101] (102~106)
   
   motif #           : 10 [83,101]
   motif concordance : 95% (9/10)
   motif discordance : 0|1|0|0|0|0|0|0|0|0
   
   Model:  ----------------------------------------------------------------------------------CACACACACACACACACACAAACTC 
          SYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYMMMDMMMMMMMMMMMMMMMMMMMMME
                                                                                             oo++oo++oo++oo++oo++RRRRR 
   Read:   AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACAC-CACACACACACACACAAACTC 
   
   ++++++++++++++++++++++++++++++++++++++++++++
   Fuzzy left alignment
   
   lflank       : ATCTTA
   repeat motif : CA
   lflen        : 6
   mlen         : 2
   plen         : 111
   
   read         : ATCTTACACCACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT
   rlen         : 105
   
   optimal score: 50.5858
   optimal state: Z
   optimal track: Z|m|10|2
   optimal probe len: 26
   optimal path length : 106
   max j: 105
   mismatch penalty: 3
   
   model: (1~6) [1~10]
   read : (1~6) [7~25][26~106]
   
   motif #           : 10 [7,25]
   motif concordance : 95% (9/10)
   motif discordance : 0|1|0|0|0|0|0|0|0|0
   
   Model:  ATCTTACACACACACACACACACACA-------------------------------------------------------------------------------- 
          SMMMMMMMMMDMMMMMMMMMMMMMMMMZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZE
           LLLLLLoo++oo++oo++oo++oo++                                                                                 
   Read:   ATCTTACAC-CACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT 
   
   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
   VNTR Summary
   rid          : 19
   motif        : AC
   ru           : CA
   
   Exact
   repeat_tract                    : CACACACACACACACA
   position                        : [131949,131964]
   reference repeat unit length    : 8
   motif_concordance               : 1
   repeat units                    : 8
   exact repeat units              : 8
   total no. of repeat units       : 8
   
   Fuzzy
   repeat_tract                    : CACCACACACACACACACA
   position                        : [131946,131964]
   reference repeat unit length    : 19
   motif_concordance               : 0.95
   repeat units                    : 19
   exact repeat units              : 9
   total no. of repeat units       : 10
   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Model free left alignment and right alignment

Model based fuzzy left alignment and right alignment

Model free fuzzy left alignment and right alignment

Implementation

This is implemented in vt.

Citation

Maintained by

This page is maintained by Adrian.