Tutorial: Low Pass Sequence Analysis Answers

From Genome Analysis Wiki
Jump to navigationJump to search

Low Pass Sequence Analysis Answers

  • Q1: What is the base quality of the fifth nucleotide of the third read in the file HG00111.lowcoverage.chr20.smallregion_1.fastq.gz?

The third read in the file is:

@ERR020230.76497044/1
CTGTACTACTAAAGTAAAACTAGTTTTCCAATAGTTTGTTGCAGGATAAGCAGTTTTACTTTTGTTGACAATATGTGTATGAATTTACTTC
+
DFEEGFKIFKIKLKIJLMMIMKMJKKKIKLMKKLKLLLKKLKLMMJLLJMKMMJLKLLJNLLLIKLJMILKLJKLKKKKKMMMJJJIFJFA

The quality string is the 4th line of each read, then the base quality of the fifth nucleotide is encoded with the character "G". Its decimal ASCII code is 71, so the base quality of this nucleotide is 38 (71-33)

  • Q2: Which is the mean depth of the sample HG00108? And the mapping rate?

The mean depth is 4.60X and mapping rate is 99.19%. However, keep in mind that these statistics are evaluated only in the 100kb included in our example dataset.

  • Q3: What is the depth of position 33594959 for the sample HG00108? What would be the most likely genotype looking at the reads? (You can answer this question by using tview or mpileup.)

The depth of the sample HG00108 at the position 33594959 is 11, there are 3 G's and 8 T's piling up at this position. Just looking at the nucleotide, the most likely genotype would be G/T

  • Q4: What is the genotype at this position? (T/T,C/T or C/C?) How many reads are covering this position? Is this consistent with the result you can obtain by using tview or mpileup?
POS       REF     ALT     FORMAT       HG00111
33514465  T       C       GT:GD:GQ:PL  1/1:3:10:117,9,0

The genotype (GT) is encoded as 1/1, which means that both chromosomes carry the alternative allele (ALT), the genotype is then C/C The depth at this position is encoded in the GD field and its value is 3. Running the mpileup:

 > samtools view  -uh  align/bams/HG00111.recal.bam 20:33514465| samtools mpileup - | grep 33514465


 [mpileup] 1 samples in 1 input files
 <mpileup> Set max per-file depth to 8000
 20      33514465        N       3       cCc     :65

At this position there are 3 C's so the result is consistent with the call in the vcf file.

  • Q5: Which is the "Total Depth at Site" for the variant at position 33500378?

The "Total Depth at Site" is encoded in the INFO field with "DP". To extract it:

> zgrep 33500378 snpcall/vcfs/chr20/chr20.filtered.vcf.gz | cut -f 1,2,8
20      33500378        DP=37;MQ=58;NS=10;AN=20;AC=15;AF=0.737200;AB=0.6246;AZ=0.9025;FIC=0.1934;SLRT=0.1851;HWEAF=0.7372;HWDAF=0.3125,0.5682;LBS=0,0,0,0,0,1,0,0;OBS=17,14,0,0,5,3,0,0;STR=0.054;STZ=0.335;CBR=0.035;CBZ=0.218;IOR=0.000;IOZ=-0.199;AOI=-180.991;AOZ=-180.792;LQR=0.025;MQ0=0.000;MQ10=0.000;MQ20=0.000;MQ30=0.026;SVM=0.995957

The total depth at this site is 37 and it is the sum of the depth of the 10 individuals at this position

  • Q6: How many alternate alleles are found at position 33505937?

The number of alternate alleles (or "Alternate Allele Counts in Samples with Coverage") is encoded in the INFO field with "AC". To extract it:

> zgrep  33505937 snpcall/vcfs/chr20/chr20.filtered.vcf.gz | cut -f 1,2,8
20      33505937        DP=54;MQ=59;NS=10;AN=20;AC=14;AF=0.670715;AB=0.4931;AZ=-0.0684;FIC=0.1444;SLRT=0.1432;HWEAF=0.6707;HWDAF=0.3779,0.4753;LBS=0,0,1,3,0,0,0,1;OBS=0,0,14,21,0,0,8,6;STR=0.150;STZ=1.051;CBR=0.295;CBZ=2.068;IOR=0.000;IOZ=-0.154;AOI=-262.472;AOZ=-262.317;LQR=0.093;MQ0=0.000;MQ10=0.000;MQ20=0.000;MQ30=0.000;SVM=1.03116

At this position, in total there are 14 alternative allele in the 10 individuals genotypes (20 alleles in total).


  • Q7: Is the genotype of HG00108 at position 33538999 consistent with what you predicted in Q3? (be careful about choosing the right column with the "cut" command)

The sample HG00108 is the 13th column of the file, so :

 > zgrep -E "CHROM|33594959" snpcall/vcfs/chr20/chr20.filtered.vcf.gz | cut -f 2,4,5,9,13
 POS     REF     ALT     FORMAT  HG00108
 33594959        G       T       GT:GD:GQ:PL     0/1:11:99:185,0,87

The predicted genotype is then G/T with depth 11, consistent with the mpileup results.

  • Q8: How many variant sites were detected in this dataset?

193 variants in total

  • Q9: Is this genotype concordant with the one found in the low pass variant calling? Which genotype do you think is more accurate?

The two genotypes are discordant (0/1 exome vs 1/1 low pass). Since the exome sequenced sample has higher depth (16 vs 3), it should be more accurate (but always double check your variants and your reads to avoid false positive!!)

  • Q10: What can be the reason of the genotype discordance?

The reason of discordance is in the lower number of reads in the low pass. All the 3 fragments, piling up at this position, belong to the chromosome containing the C allele, there are no fragment from the other chromosome containing the T allele.

  • Q11: Compare the genotype of the sample HG00111 at position 33514465 in the exome and in the LD-refined VCF. Did something change? Why?
 > zgrep -E "CHROM|33514465" snpcall/thunder/chr20/1000G/thunder/chr20.filtered.PASS.beagled.1000G.thunder.vcf.gz | cut -f 2,4,5,9,14
 POS     REF     ALT     FORMAT  HG00111
 33514465        T       C       GT:DS:GD:GQ:PL:BD       0|1:1.000:3:10:117,9,0:1.0014

The genotype in the LD-refined vcf and in the exome vcf are now consistent. The LD refinement reconstructs the haplotypes in this small region and it is able to "fix" some genotyping errors generated by the low coverage of the low pass data.

  • Q12: Check position 33523840 in the low pass VCF for sample HG00111.
  • What is the genotype assigned by the variant caller?
  • What is your predicted genotype according to the reads piling up at this site?
  • What is the genotype in the exome VCF?
  • What is the genotype after LD refinement?