Changes

From Genome Analysis Wiki
Jump to navigationJump to search
8,277 bytes added ,  19:17, 25 February 2016
Line 1: Line 1:  
= Introduction =
 
= Introduction =
   −
This page is about Tandem Repeats.
+
Tandem repeats are a common polymorphism in the genome.   
 +
 
 +
This wiki is an attempt at consolidation on past work by many people
 +
 
 +
We talk about its representation, major characteristics, and a set of useful definitions and algorithms for working with them
 +
 
 +
* Gary Benson's tandem repeat finder's
 +
* Gymrek et al.
 +
* Highnam et al.
    
= Definition =
 
= Definition =
Line 7: Line 15:  
A series of repeats that are contiguous
 
A series of repeats that are contiguous
   −
= Classification =
      +
= Concepts =
 +
 +
== Motif Canonical Class ==
 +
 +
Using the concepts of shifting, the canonical class of a set of motifs can be defined as
 +
a equivalence relationship.
 +
 +
  ACG ~ CGA if there exists a shift that allows s(ACG, i) = CGA
 +
 +
This relationship can be show to be reflexive, symmetric and transitive.  And a equivalence
 +
class can be defined on this.
 +
 +
Example:
 +
 +
This is a distribution of motifs without collapsing
 +
 +
  A
 +
  C
 +
  G
 +
  T
 +
  AA
 +
  CC
 +
  GG
 +
  TT
 +
  AC
 +
  AG
 +
  AT
 +
  CG
 +
  CT
 +
  GT
 +
  CA
 +
  GA
 +
 
 +
show all permutations
 +
show acyclic
 +
show shift
 +
show reverse complement
 +
 +
 +
 
 +
 +
 +
=== Shifting ===
 +
 +
Consider the motifs ACTT, TACT and TTAC, the following stretches can be observed in the genome
 +
 +
  GGGGGGACTTACTTACTTACTTACTTAGGGGG
 +
  GGGGGGTACTTACTTACTTACTTACTTGGGGG
 +
  GGGGGGTTACTTACTTACTTACTTACTGGGGG
 +
 +
but looking from the right flank,  they can easily be CTTA, ACTT and TACT respectively.
 +
 +
The concept of shifting the sequence is useful for grouping such like motifs together.
 +
 +
We define shift as follow
 +
 +
  A shift of a sequence is the sliding of the sequence with the alleles wrapped to the front?
 +
 +
=== Reverse Complement ===
 +
 +
Reverse complement is is a common concept in sequence analysis. 
 +
 +
  ACCCTCCCCTCT
 +
  AGAGGGGAGGGT
 +
 +
== Acyclicity  ==
 +
 +
Motifs are required to be acyclic.  For example, a motif ACACAC should just be represented by AC as it is 3 copies of AC.
 +
 +
  A sequence is cyclic if and only if there exists a sub sequence in which it is a multiple copy of.
 +
 +
The definition can be more explicit as follows:
 +
 +
  A sequence is cyclic if and only if there exists a non trivial shift of the sequence that is equivalent to  the sequence.
 +
 +
Take for example,  the sequence ACACA, is this a bona fide motif?  After it seems like it is 2.5 copies of AC and AC might be
 +
more appropriate.
 +
 +
  shift 0: ACACA
 +
  shift 1: CACAA
 +
  shift 2: ACAAC
 +
  shift 3: CAACA
 +
  shift 4: AACAC
 +
 +
So ACACA is a bona fide motif.
 +
 +
== Fractional counts  ==
 +
 +
While it is natural to think of a stretch of repeat as a integer, it is useful to consider fractional counts of a repeat especially in large repeat tracts.
 +
This is done in Tandem Repeat Finder.
 +
 +
  GGGTTAAGGGTTAAGGGTTAAGGGTTAAGGGTTAAGGG
 +
 +
This is 5.5 copies of the repeat GGGTTAA
 +
 +
== Scoring  ==
 +
 
 +
You can use a score bounded by one. 
 +
 +
By random matching -
 +
 +
== TRF Scoring  ==
 +
 +
A nice useful scheme is +2 for matches and -7 for indels and mismatches, and it provides a convenient way of scoring for purity.
 +
 +
== Inexact Tandem Repeats  ==
 +
 +
We see alot of this
 +
 +
  AAAAAAAAAAAGAAAAAAAAAAAAAA
 +
 +
or
 +
 +
  ACACACACACACGACACACAACACACACACAC
 +
 +
or
 +
 +
  GAAAGAAGGAAAAGAGAGAAAAGAAGAAGAA
 +
 +
= Characteristics =
 +
 
 
* motif length
 
* motif length
 
* motif basis
 
* motif basis
* repeat tract lengfth
+
* repeat tract length
 
* purity
 
* purity
 +
 +
questions.
 +
 +
  Is AAAAC really different from AAAAAC?
 +
 +
 
 +
    
== Algorithm for Detection ==
 
== Algorithm for Detection ==
 +
 +
Exact alignment algorithm
 +
 +
Fuzzy alignment algorithm
 +
 +
== Detection of a motif in a sequence ==
 +
 +
  The following shows the trace of how the algorithm works
 +
 +
    ============================================
 +
    ANNOTATING INDEL FUZZILY
 +
    ********************************************
 +
    EXTRACTIING REGION BY EXACT LEFT AND RIGHT ALIGNMENT
 +
   
 +
    20:131948:C/CCA
 +
    EXACT REGION 131948-131965 (18)
 +
                CCACACACACACACACAA
 +
    FINAL EXACT REGION 131948-131965 (18)
 +
                      CCACACACACACACACAA
 +
    ********************************************
 +
    PICK CANDIDATE MOTIFS
 +
   
 +
    Longest Allele : C[CA]CACACACACACACACAA
 +
    detecting motifs for an str
 +
    seq: CCACACACACACACACACAA
 +
    len : 20
 +
    cmax_len : 10
 +
    candidate motifs: 25
 +
    AC : 0.894737 2 0
 +
    AAC : 0.5 3 0.0555556
 +
    ACC : 0.5 3 0.0555556
 +
    AAAC : 0.0588235 4 0.125 (< 2 copies)
 +
    ACCC : 0.0588235 4 0.125 (< 2 copies)
 +
    AACAC : 0.5 5 0.02
 +
    ACACC : 0.5 5 0.02
 +
    AAACAC : 0.0666667 6 0.0555556 (< 2 copies)
 +
    ACACCC : 0.0666667 6 0.0555556 (< 2 copies)
 +
    AACACAC : 0.5 7 0.0102041
 +
    ACACACC : 0.5 7 0.0102041
 +
    AAACACAC : 0.0769231 8 0.03125 (< 2 copies)
 +
    ACACACCC : 0.0769231 8 0.03125 (< 2 copies)
 +
    AACACACAC : 0.5 9 0.00617284 (< 2 copies)
 +
    ACACACACC : 0.5 9 0.00617284 (< 2 copies)
 +
    AAACACACAC : 0.0909091 10 0.02 (< 2 copies)
 +
    ACACACACCC : 0.0909091 10 0.02 (< 2 copies)
 +
    ********************************************
 +
    PICKING NEXT BEST MOTIF
 +
   
 +
    selected:        AC 0.89 0.00
 +
    ********************************************
 +
    DETECTING REPEAT TRACT FUZZILY
 +
    ++++++++++++++++++++++++++++++++++++++++++++
 +
    Exact left/right alignment
 +
   
 +
    repeat_tract              : CACACACACACACACA
 +
    position                  : [131949,131964]
 +
    motif_concordance        : 1
 +
    repeat units              : 8
 +
    exact repeat units        : 8
 +
    total no. of repeat units : 8
 +
   
 +
    ++++++++++++++++++++++++++++++++++++++++++++
 +
    Fuzzy right alignment
 +
   
 +
    repeat motif : CA
 +
    rflank      : AACTC
 +
    mlen        : 2
 +
    rflen        : 5
 +
    plen        : 111
 +
   
 +
    read        : AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACACCACACACACACACACAAACTC
 +
    rlen        : 106
 +
   
 +
    optimal score: 50.5073
 +
    optimal state: MR
 +
    optimal track: MR|r|0|5
 +
    optimal probe len: 25
 +
    optimal path length : 107
 +
    max j: 106
 +
    probe: (1~82) [1~10] (1~5)
 +
    read : (1~82) [83~101] (102~106)
 +
   
 +
    motif #          : 10 [83,101]
 +
    motif concordance : 95% (9/10)
 +
    motif discordance : 0|1|0|0|0|0|0|0|0|0
 +
   
 +
    Model:  ----------------------------------------------------------------------------------CACACACACACACACACACAAACTC
 +
          SYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYMMMDMMMMMMMMMMMMMMMMMMMMME
 +
                                                                                              oo++oo++oo++oo++oo++RRRRR
 +
    Read:  AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACAC-CACACACACACACACAAACTC
 +
   
 +
    ++++++++++++++++++++++++++++++++++++++++++++
 +
    Fuzzy left alignment
 +
   
 +
    lflank      : ATCTTA
 +
    repeat motif : CA
 +
    lflen        : 6
 +
    mlen        : 2
 +
    plen        : 111
 +
   
 +
    read        : ATCTTACACCACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT
 +
    rlen        : 105
 +
   
 +
    optimal score: 50.5858
 +
    optimal state: Z
 +
    optimal track: Z|m|10|2
 +
    optimal probe len: 26
 +
    optimal path length : 106
 +
    max j: 105
 +
    mismatch penalty: 3
 +
   
 +
    model: (1~6) [1~10]
 +
    read : (1~6) [7~25][26~106]
 +
   
 +
    motif #          : 10 [7,25]
 +
    motif concordance : 95% (9/10)
 +
    motif discordance : 0|1|0|0|0|0|0|0|0|0
 +
   
 +
    Model:  ATCTTACACACACACACACACACACA--------------------------------------------------------------------------------
 +
          SMMMMMMMMMDMMMMMMMMMMMMMMMMZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZE
 +
            LLLLLLoo++oo++oo++oo++oo++                                                                               
 +
    Read:  ATCTTACAC-CACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT
 +
   
 +
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 +
    VNTR Summary
 +
    rid          : 19
 +
    motif        : AC
 +
    ru          : CA
 +
   
 +
    Exact
 +
    repeat_tract                    : CACACACACACACACA
 +
    position                        : [131949,131964]
 +
    reference repeat unit length    : 8
 +
    motif_concordance              : 1
 +
    repeat units                    : 8
 +
    exact repeat units              : 8
 +
    total no. of repeat units      : 8
 +
   
 +
    Fuzzy
 +
    repeat_tract                    : CACCACACACACACACACA
 +
    position                        : [131946,131964]
 +
    reference repeat unit length    : 19
 +
    motif_concordance              : 0.95
 +
    repeat units                    : 19
 +
    exact repeat units              : 9
 +
    total no. of repeat units      : 10
 +
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 +
 +
== Model free left alignment and right alignment ==
 +
 +
 +
 +
 +
== Model based fuzzy left alignment and right alignment ==
 +
 +
== Model free fuzzy left alignment and right alignment ==
    
= Implementation =
 
= Implementation =
1,102

edits

Navigation menu