Line 1,153: |
Line 1,153: |
| </div> | | </div> |
| | | |
− | === Annotate Indels === | + | === Remove overlap === |
| | | |
− | Annotates indels with VNTR information and adds a VNTR record. Facilitates the simultaneous calling of VNTR together with Indels and SNPs.
| + | Removes overlapping variants in a VCF file by tagging such variants with the FILTER flag overlap. |
| | | |
| <div class=" mw-collapsible mw-collapsed"> | | <div class=" mw-collapsible mw-collapsed"> |
Line 1,162: |
Line 1,162: |
| | | |
| <div style="height:20em; overflow:auto; border: 2px solid #FFF"> | | <div style="height:20em; overflow:auto; border: 2px solid #FFF"> |
− | CHROM POS ID REF ALT QUAL FILTER INFO
| + | |
− | 20 82079 . G A 1255.98 . NSAMPLES=1;E=43;N=51;ESUM=43;NSUM=51;FLANKSEQ=GGAGCACGCC[G/A]CCATGCCCGG
| |
− | 20 82217 . G A 1632.77 . NSAMPLES=1;E=56;N=61;ESUM=56;NSUM=61;FLANKSEQ=GAGCCACCGC[G/A]CCCGGCCCAG
| |
− | 20 83250 . CTGTGTGTG C . . NSAMPLES=1;E=18;N=35;ESUM=18;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT
| |
− | 20 83250 . CTGTGTGTGTG C . . NSAMPLES=1;E=3;N=35;ESUM=3;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT
| |
− | 20 83251 . TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG <VNTR> . . MOTIF=GT;RU=TG;FZ_CONCORDANCE=1;FZ_RL=52;FZ_LL=0;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FZ_RU_COUNTS=26,26;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG]TTTAGTATTT
| |
− | 20 83252 . G C 359.204 . NSAMPLES=1;E=13;N=14;ESUM=13;NSUM=14;FLANKSEQ=CTCTCTCTCT[G/C]TGTGTGTGTG
| |
− | 20 83260 . G C 500.163 . NSAMPLES=1;E=18;N=34;ESUM=18;NSUM=34;FLANKSEQ=CTGTGTGTGT[G/C]TGTGTGTGTG
| |
− | 20 83267 . T C 247.043 . NSAMPLES=1;E=11;N=43;ESUM=11;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT
| |
− | 20 83275 . T C 609.669 . NSAMPLES=1;E=24;N=43;ESUM=24;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT
| |
− | 20 90008 . C A 1546.88 . NSAMPLES=1;E=52;N=60;ESUM=52;NSUM=60;FLANKSEQ=AACAGAAAAC[C/A]AAATACTGTA
| |
− | 20 91088 . C T 1766.04 . NSAMPLES=1;E=58;N=66;ESUM=58;NSUM=66;FLANKSEQ=CCCAGCATAC[C/T]ATGGTTGTGC
| |
− | 20 91508 . G A 1266.93 . NSAMPLES=1;E=44;N=53;ESUM=44;NSUM=53;FLANKSEQ=AATTAGTAAG[G/A]CTTACGTAAG
| |
− | 20 91707 . C T 888.134 . NSAMPLES=1;E=30;N=53;ESUM=30;NSUM=53;FLANKSEQ=TGATTTTCTA[C/T]AGCAGGACCT
| |
− | 20 92527 . A G 828.593 . NSAMPLES=1;E=34;N=40;ESUM=34;NSUM=40;FLANKSEQ=ATTAATTGCC[A/G]TTCTCTCTTT
| |
− | 20 93440 . A G 688.144 . NSAMPLES=1;E=24;N=58;ESUM=24;NSUM=58;FLANKSEQ=TTGGATGCAT[A/G]GTCTGTAAAT
| |
− | 20 93636 . TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT <VNTR> . . MOTIF=T;RU=T;FZ_CONCORDANCE=0.939394;FZ_RL=35;FZ_LL=0;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FZ_RU_COUNTS=31,33;FLANKSEQ=TCTAGGATTC[TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT
| |
− | 20 93646 . C CT . . NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FLANKSEQ=TTTTTCTTTC[TTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT;GMOTIF=T;TR=20:93636:TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT:<VNTR>:T
| |
− | 20 93717 . A T 31.7622 . NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKSEQ=CAGTGGCGTG[A/T]TCTTAGATCA
| |
− | 20 93931 . G A 628.149 . NSAMPLES=1;E=22;N=53;ESUM=22;NSUM=53;FLANKSEQ=GATTACAGGT[G/A]TGAGCCGCTG
| |
− | 20 100699 . C T 809.09 . NSAMPLES=1;E=28;N=61;ESUM=28;NSUM=61;FLANKSEQ=GGTGAAAAAT[C/T]ACCTGTCAGT
| |
− | 20 101362 . G A 1087.13 . NSAMPLES=1;E=36;N=67;ESUM=36;NSUM=67;FLANKSEQ=TAATACTGAA[G/A]TTTACTTCTC
| |
| | | |
− | </div> | + | <div class="mw-collapsible-content"> |
− | | + | |
− | The following shows the trace of how the algorithm works | + | remove_overlap v0.57 |
| | | |
− | ============================================
| + | description : Removes overlapping variants in a VCF file by tagging such variants with the FILTER flag overlap. |
− | ANNOTATING INDEL FUZZILY
| |
− | ********************************************
| |
− | EXTRACTIING REGION BY EXACT LEFT AND RIGHT ALIGNMENT
| |
− |
| |
− | 20:131948:C/CCA
| |
− | EXACT REGION 131948-131965 (18)
| |
− | CCACACACACACACACAA
| |
− | FINAL EXACT REGION 131948-131965 (18)
| |
− | CCACACACACACACACAA
| |
− | ********************************************
| |
− | PICK CANDIDATE MOTIFS
| |
− |
| |
− | Longest Allele : C[CA]CACACACACACACACAA
| |
− | detecting motifs for an str
| |
− | seq: CCACACACACACACACACAA
| |
− | len : 20
| |
− | cmax_len : 10
| |
− | candidate motifs: 25
| |
− | AC : 0.894737 2 0
| |
− | AAC : 0.5 3 0.0555556
| |
− | ACC : 0.5 3 0.0555556
| |
− | AAAC : 0.0588235 4 0.125 (< 2 copies)
| |
− | ACCC : 0.0588235 4 0.125 (< 2 copies)
| |
− | AACAC : 0.5 5 0.02
| |
− | ACACC : 0.5 5 0.02
| |
− | AAACAC : 0.0666667 6 0.0555556 (< 2 copies)
| |
− | ACACCC : 0.0666667 6 0.0555556 (< 2 copies)
| |
− | AACACAC : 0.5 7 0.0102041
| |
− | ACACACC : 0.5 7 0.0102041
| |
− | AAACACAC : 0.0769231 8 0.03125 (< 2 copies)
| |
− | ACACACCC : 0.0769231 8 0.03125 (< 2 copies)
| |
− | AACACACAC : 0.5 9 0.00617284 (< 2 copies)
| |
− | ACACACACC : 0.5 9 0.00617284 (< 2 copies)
| |
− | AAACACACAC : 0.0909091 10 0.02 (< 2 copies)
| |
− | ACACACACCC : 0.0909091 10 0.02 (< 2 copies)
| |
− | ********************************************
| |
− | PICKING NEXT BEST MOTIF
| |
− |
| |
− | selected: AC 0.89 0.00
| |
− | ********************************************
| |
− | DETECTING REPEAT TRACT FUZZILY
| |
− | ++++++++++++++++++++++++++++++++++++++++++++
| |
− | Exact left/right alignment
| |
− |
| |
− | repeat_tract : CACACACACACACACA
| |
− | position : [131949,131964]
| |
− | motif_concordance : 1
| |
− | repeat units : 8
| |
− | exact repeat units : 8
| |
− | total no. of repeat units : 8
| |
− |
| |
− | ++++++++++++++++++++++++++++++++++++++++++++
| |
− | Fuzzy right alignment
| |
− |
| |
− | repeat motif : CA
| |
− | rflank : AACTC
| |
− | mlen : 2
| |
− | rflen : 5
| |
− | plen : 111
| |
− |
| |
− | read : AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACACCACACACACACACACAAACTC
| |
− | rlen : 106
| |
− |
| |
− | optimal score: 50.5073
| |
− | optimal state: MR
| |
− | optimal track: MR|r|0|5
| |
− | optimal probe len: 25
| |
− | optimal path length : 107
| |
− | max j: 106
| |
− | probe: (1~82) [1~10] (1~5)
| |
− | read : (1~82) [83~101] (102~106)
| |
− |
| |
− | motif # : 10 [83,101]
| |
− | motif concordance : 95% (9/10)
| |
− | motif discordance : 0|1|0|0|0|0|0|0|0|0
| |
− |
| |
− | Model: ----------------------------------------------------------------------------------CACACACACACACACACACAAACTC
| |
− | SYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYMMMDMMMMMMMMMMMMMMMMMMMMME
| |
− | oo++oo++oo++oo++oo++RRRRR
| |
− | Read: AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACAC-CACACACACACACACAAACTC
| |
− |
| |
− | ++++++++++++++++++++++++++++++++++++++++++++
| |
− | Fuzzy left alignment
| |
− |
| |
− | lflank : ATCTTA
| |
− | repeat motif : CA
| |
− | lflen : 6
| |
− | mlen : 2
| |
− | plen : 111
| |
− |
| |
− | read : ATCTTACACCACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT
| |
− | rlen : 105
| |
− |
| |
− | optimal score: 50.5858
| |
− | optimal state: Z
| |
− | optimal track: Z|m|10|2
| |
− | optimal probe len: 26
| |
− | optimal path length : 106
| |
− | max j: 105
| |
− | mismatch penalty: 3
| |
− |
| |
− | model: (1~6) [1~10]
| |
− | read : (1~6) [7~25][26~106]
| |
− |
| |
− | motif # : 10 [7,25]
| |
− | motif concordance : 95% (9/10)
| |
− | motif discordance : 0|1|0|0|0|0|0|0|0|0
| |
− |
| |
− | Model: ATCTTACACACACACACACACACACA--------------------------------------------------------------------------------
| |
− | SMMMMMMMMMDMMMMMMMMMMMMMMMMZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZE
| |
− | LLLLLLoo++oo++oo++oo++oo++
| |
− | Read: ATCTTACAC-CACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT
| |
− |
| |
− | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
| |
− | VNTR Summary
| |
− | rid : 19
| |
− | motif : AC
| |
− | ru : CA
| |
− |
| |
− | Exact
| |
− | repeat_tract : CACACACACACACACA
| |
− | position : [131949,131964]
| |
− | reference repeat unit length : 8
| |
− | motif_concordance : 1
| |
− | repeat units : 8
| |
− | exact repeat units : 8
| |
− | total no. of repeat units : 8
| |
− |
| |
− | Fuzzy
| |
− | repeat_tract : CACCACACACACACACACA
| |
− | position : [131946,131964]
| |
− | reference repeat unit length : 19
| |
− | motif_concordance : 0.95
| |
− | repeat units : 19
| |
− | exact repeat units : 9
| |
− | total no. of repeat units : 10
| |
− | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
| |
| | | |
− | <div class="mw-collapsible-content">
| + | usage : vt remove_overlap [options] <in.vcf> |
− | usage : vt annotate_indels [options] <in.vcf> | |
| | | |
− | options : -v add vntr record [false] | + | options : -o output VCF file [-] |
− | -x override tags [false]
| |
− | -f filter expression []
| |
− | -d debug [false]
| |
− | -m mode [f]
| |
− | e : by exact alignment f : by fuzzy alignment
| |
− | -c classification schemas of tandem repeat [6]
| |
− | 1 : lai2003
| |
− | 2 : kelkar2008
| |
− | 3 : fondon2012
| |
− | 4 : ananda2013
| |
− | 5 : willems2014
| |
− | 6 : tan_kang2015
| |
− | -a annotation type [v]
| |
− | v : a. output VNTR variant (defined by classification).
| |
− | RU repeat unit on reference sequence (CA)
| |
− | MOTIF canonical representation (AC)
| |
− | RL repeat tract length in bases (11)
| |
− | FLANKS flanking positions of repeat tract determined by exact alignment
| |
− | RU_COUNTS number of exact repeat units and total number of repeat units in
| |
− | repeat tract determined by exact alignment
| |
− | FZ_RL fuzzy repeat tract length in bases (11)
| |
− | FZ_FLANKS flanking positions of repeat tract determined by fuzzy alignment
| |
− | FZ_RU_COUNTS number of exact repeat units and total number of repeat units in
| |
− | repeat tract determined by fuzzy alignment
| |
− | FLANKSEQ flanking sequence of indel
| |
− | LARGE_REPEAT_REGION repeat region exceeding 2000bp
| |
− | b. mark indels with overlapping VNTR.
| |
− | FLANKS flanking positions of repeat tract determined by exact alignment
| |
− | FZ_FLANKS flanking positions of repeat tract determined by fuzzy alignment
| |
− | GMOTIF generating motif used in fuzzy alignment
| |
− | TR position and alleles of VNTR (20:23413:CACACACACAC:<VNTR>)
| |
− | a : annotate each indel with RU, RL, MOTIF, REF.
| |
− | -r reference sequence fasta file []
| |
− | -o output VCF file [-]
| |
| -I file containing list of intervals [] | | -I file containing list of intervals [] |
− | -i intervals | + | -i intervals [] |
| -? displays help | | -? displays help |
| </div> | | </div> |
| </div> | | </div> |
− |
| |
− |
| |
− |
| |
| | | |
| === Annotate Indels === | | === Annotate Indels === |