Line 240: |
Line 240: |
| | | |
| ==== Looking at final INDEL VCF ==== | | ==== Looking at final INDEL VCF ==== |
− | Check the number of passing INDEL variants:
| |
− | $GC/bin/vt peek ~/$SAMPLE/output/indel/final/all.genotypes.vcf.gz -f "PASS"
| |
− | Gives something like:
| |
− | no. Indels : 570661
| |
− | 2 alleles (ins/del) : 570661 (0.87) [265448/305213]
| |
− | >=3 alleles (ins/del) : 0 (-nan) [0/0]
| |
| | | |
− | Check the number of passing INDEL's with allele count > 0:
| + | Note that because this is a single sample calling, many of the INFO fields are less meaningful as many of the values like HWE p values, allele frequencies, inbreeding coefficient are a function of a population. |
| + | Nonetheless, we may examine the results. First, we see how many indels were discovered for your genome: |
| + | |
| + | $GC/bin/vt peek ~/$SAMPLE/output/indel/final/all.genotypes.vcf.gz |
| + | no. Indels : 588566 |
| + | 2 alleles (ins/del) : 588566 (0.87) [273261/315305] |
| + | |
| + | This gives use 588,566 indels with an insertion deletion ratio of 0.87. |
| + | |
| + | We next look at the filtered set. The PASS filter extracts all non overlapping variants and the INFO.AC!=0 extracts all indels that are either heterozygous or homozygous alternative. |
| + | Some indels that were originally discovered were found to be the homozygous reference genotype. Invariably, these are relative high depth calls where the |
| + | alternative allele is discovered less or is mis-specified. |
| + | |
| $GC/bin/vt peek ~/$SAMPLE/output/indel/final/all.genotypes.vcf.gz -f "PASS&&INFO.AC!=0" | | $GC/bin/vt peek ~/$SAMPLE/output/indel/final/all.genotypes.vcf.gz -f "PASS&&INFO.AC!=0" |
− | Gives something like:
| + | |
| no. Indels : 549963 | | no. Indels : 549963 |
| 2 alleles (ins/del) : 549963 (0.91) [261480/288483] | | 2 alleles (ins/del) : 549963 (0.91) [261480/288483] |
− | >=3 alleles (ins/del) : 0 (-nan) [0/0]
| |
− | Some INDELs had allele count 0.
| |
− |
| |
− |
| |
− | Check the number of passing INDEL's with allele count 2:
| |
− | $GC/bin/vt peek ~/$SAMPLE/output/indel/final/all.genotypes.vcf.gz -f "PASS&&INFO.AC==2"
| |
− | Gives something like:
| |
− | no. Indels : 216134
| |
− | 2 alleles (ins/del) : 216134 (1.17) [116511/99623]
| |
− | >=3 alleles (ins/del) : 0 (-nan) [0/0]
| |
| | | |
| + | About 38K indels were removed, the insertion deletion ratio increases to 0.91. Note that in general, for high depth data, discovered indels are reported with insertion deletion ratios |
| + | close to 1. So this is a good sign. Next generation sequencing errors are bias for deletions. |
| | | |
− | Check the number of passing INDEL's with allele balance > 0.5:
| + | It is possible to perform a slightly more stringent filtering using allele balance. The allele balance estimator in this case is meaningful still for an individual because it is a function of read depth. |
− | $GC/bin/vt peek ~/$SAMPLE/output/indel/final/all.genotypes.vcf.gz -f "PASS&&INFO.AB>0.5" | + | Note that AB>0.5 denotes reference bias and AB<0.5 denotes alternative allele bias. |
− | Gives something like:
| |
− | no. Indels : 132878
| |
− | 2 alleles (ins/del) : 132878 (0.68) [53714/79164]
| |
− | >=3 alleles (ins/del) : 0 (-nan) [0/0]
| |
| | | |
| + | $GC/bin/vt peek ~/$SAMPLE/output/indel/final/all.genotypes.vcf.gz -f "PASS&&INFO.AB<0.7&&INFO.AB>0.3" |
| + | no. Indels : 490965 |
| + | 2 alleles (ins/del) : 490965 (0.92) [235254/255711] |
| | | |
− | Check the number of passing INDEL's with allele balance < 0.5:
| + | The insertion deletion ratio increases from 0.91 to 0.92. |
− | $GC/bin/vt peek ~/$SAMPLE/output/indel/final/all.genotypes.vcf.gz -f "PASS&&INFO.AB<0.5"
| |
− | Gives something like:
| |
− | no. Indels : 169198
| |
− | 2 alleles (ins/del) : 169198 (0.89) [79504/89694]
| |
− | >=3 alleles (ins/del) : 0 (-nan) [0/0]
| |
| | | |
| </div> | | </div> |
| </div> | | </div> |