Difference between revisions of "Tutorial: Low Pass Sequence Analysis Answers"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 16: Line 16:
 
The mean depth is 4.60X and mapping rate is 99.19%. However, keep in mind that these statistics are evaluated only in the 100kb included in our example dataset.
 
The mean depth is 4.60X and mapping rate is 99.19%. However, keep in mind that these statistics are evaluated only in the 100kb included in our example dataset.
  
* Q3: What is the depth of position 33538999 for the sample HG00111? What would be the most likely genotype looking at the reads? (You can answer this question by using tview or mpileup.)  
+
* Q3: What is the depth of position 33538999 for the sample HG00108? What would be the most likely genotype looking at the reads? (You can answer this question by using tview or mpileup.)  
  
The depth of the sample HG00111 at the position 3353899 is 9, there are 5 G's and 4 C's piling up at this position. Just looking at the nucleotide, the most likely genotype would be C/G
+
The depth of the sample HG00108 at the position 3353899 is 9, there are 5 G's and 4 C's piling up at this position. Just looking at the nucleotide, the most likely genotype would be C/G
 +
 
 +
Q4: What is the genotype at this position? (T/T,C/T or C/C?) How many reads are covering this position? Is this consistent with the result you can obtain by using tview or mpileup?
 +
 
 +
POS      REF    ALT    FORMAT      HG00111
 +
33514465  T      C      GT:GD:GQ:PL  1/1:3:10:117,9,0
 +
 
 +
The genotype (GT) is encoded as 1/1, which means that both chromosomes carry the alternative allele (ALT), the genotype is then C/C
 +
The depth at this position is encoded in the GD field and its value is 3.
 +
Running the mpileup:
 +
 
 +
  > samtools view  -uh  align/bams/HG00111.recal.bam 20:33514465| samtools mpileup - | grep 33514465
 +
 
 +
[bam_header_read] EOF marker is absent. The input is probably truncated.
 +
[mpileup] 1 samples in 1 input files
 +
<mpileup> Set max per-file depth to 8000
 +
20      33514465        N      3      cCc    :65
 +
 
 +
At this position there are 3 C's so the result is consistent with the call in the vcf file.
 +
 
 +
* Q5: Which is the "Total Depth at Site" for the variant at position 33500378?
 +
 
 +
The "Total Depth at Site" is encoded in the INFO field with "DP". To extract it:
 +
> zgrep 33500378 snpcall/vcfs/chr20/chr20.filtered.vcf.gz | cut -f 1,2,8
 +
 
 +
20      33500378        DP=37;MQ=58;NS=10;AN=20;AC=15;AF=0.737200;AB=0.6246;AZ=0.9025;FIC=0.1934;SLRT=0.1851;HWEAF=0.7372;HWDAF=0.3125,0.5682;LBS=0,0,0,0,0,1,0,0;OBS=17,14,0,0,5,3,0,0;STR=0.054;STZ=0.335;CBR=0.035;CBZ=0.218;IOR=0.000;IOZ=-0.199;AOI=-180.991;AOZ=-180.792;LQR=0.025;MQ0=0.000;MQ10=0.000;MQ20=0.000;MQ30=0.026;SVM=0.995957
 +
 
 +
The total depth at this site is 37 and it is the sum of the depth of the 10 individuals at this position
 +
 
 +
    Q6: How many alternate alleles are found at position 33505937?
 +
 
 +
The number of alternate alleles (or "Alternate Allele Counts in Samples with Coverage") is encoded in the INFO field with "AC". To extract it:
 +
> zgrep  33505937 snpcall/vcfs/chr20/chr20.filtered.vcf.gz | cut -f 1,2,8
 +
 
 +
20      33505937        DP=54;MQ=59;NS=10;AN=20;AC=14;AF=0.670715;AB=0.4931;AZ=-0.0684;FIC=0.1444;SLRT=0.1432;HWEAF=0.6707;HWDAF=0.3779,0.4753;LBS=0,0,1,3,0,0,0,1;OBS=0,0,14,21,0,0,8,6;STR=0.150;STZ=1.051;CBR=0.295;CBZ=2.068;IOR=0.000;IOZ=-0.154;AOI=-262.472;AOZ=-262.317;LQR=0.093;MQ0=0.000;MQ10=0.000;MQ20=0.000;MQ30=0.000;SVM=1.03116
 +
 
 +
At this position, in total there are 14 alternative allele in the 10 individuals genotypes (20 alleles in total).
 +
 
 +
 
 +
    Q7: Is the genotype of HG00108 at position 33538999 consistent with what you predicted in Q3? (be careful about choosing the right column with the "cut" command)
 +
 
 +
 
 +
 
 +
    Q8: How many variant sites were detected in this dataset?:
 +
 
 +
193 variants in total

Revision as of 09:03, 9 September 2013

Low Pass Sequence Analysis Answers

  • Q1: What is the base quality of the fifth nucleotide of the third read in the file HG00111.lowcoverage.chr20.smallregion_1.fastq.gz?

The third read in the file is:

@ERR020230.76497044/1
CTGTACTACTAAAGTAAAACTAGTTTTCCAATAGTTTGTTGCAGGATAAGCAGTTTTACTTTTGTTGACAATATGTGTATGAATTTACTTC
+
DFEEGFKIFKIKLKIJLMMIMKMJKKKIKLMKKLKLLLKKLKLMMJLLJMKMMJLKLLJNLLLIKLJMILKLJKLKKKKKMMMJJJIFJFA

The quality string is the 4th line of each read, then the base quality of the first nucleotide is encoded with the character "G". Its decimal ASCII code is 71, so the base quality of this nucleotide is 38 (71-33)

  • Q2: Which is the mean depth of the sample HG00108? And the mapping rate?

The mean depth is 4.60X and mapping rate is 99.19%. However, keep in mind that these statistics are evaluated only in the 100kb included in our example dataset.

  • Q3: What is the depth of position 33538999 for the sample HG00108? What would be the most likely genotype looking at the reads? (You can answer this question by using tview or mpileup.)

The depth of the sample HG00108 at the position 3353899 is 9, there are 5 G's and 4 C's piling up at this position. Just looking at the nucleotide, the most likely genotype would be C/G

Q4: What is the genotype at this position? (T/T,C/T or C/C?) How many reads are covering this position? Is this consistent with the result you can obtain by using tview or mpileup?

POS       REF     ALT     FORMAT       HG00111
33514465  T       C       GT:GD:GQ:PL  1/1:3:10:117,9,0

The genotype (GT) is encoded as 1/1, which means that both chromosomes carry the alternative allele (ALT), the genotype is then C/C The depth at this position is encoded in the GD field and its value is 3. Running the mpileup:

 > samtools view  -uh  align/bams/HG00111.recal.bam 20:33514465| samtools mpileup - | grep 33514465

[bam_header_read] EOF marker is absent. The input is probably truncated. [mpileup] 1 samples in 1 input files <mpileup> Set max per-file depth to 8000 20 33514465 N 3 cCc :65

At this position there are 3 C's so the result is consistent with the call in the vcf file.

  • Q5: Which is the "Total Depth at Site" for the variant at position 33500378?

The "Total Depth at Site" is encoded in the INFO field with "DP". To extract it:

> zgrep 33500378 snpcall/vcfs/chr20/chr20.filtered.vcf.gz | cut -f 1,2,8
20      33500378        DP=37;MQ=58;NS=10;AN=20;AC=15;AF=0.737200;AB=0.6246;AZ=0.9025;FIC=0.1934;SLRT=0.1851;HWEAF=0.7372;HWDAF=0.3125,0.5682;LBS=0,0,0,0,0,1,0,0;OBS=17,14,0,0,5,3,0,0;STR=0.054;STZ=0.335;CBR=0.035;CBZ=0.218;IOR=0.000;IOZ=-0.199;AOI=-180.991;AOZ=-180.792;LQR=0.025;MQ0=0.000;MQ10=0.000;MQ20=0.000;MQ30=0.026;SVM=0.995957

The total depth at this site is 37 and it is the sum of the depth of the 10 individuals at this position

   Q6: How many alternate alleles are found at position 33505937? 

The number of alternate alleles (or "Alternate Allele Counts in Samples with Coverage") is encoded in the INFO field with "AC". To extract it:

> zgrep  33505937 snpcall/vcfs/chr20/chr20.filtered.vcf.gz | cut -f 1,2,8
20      33505937        DP=54;MQ=59;NS=10;AN=20;AC=14;AF=0.670715;AB=0.4931;AZ=-0.0684;FIC=0.1444;SLRT=0.1432;HWEAF=0.6707;HWDAF=0.3779,0.4753;LBS=0,0,1,3,0,0,0,1;OBS=0,0,14,21,0,0,8,6;STR=0.150;STZ=1.051;CBR=0.295;CBZ=2.068;IOR=0.000;IOZ=-0.154;AOI=-262.472;AOZ=-262.317;LQR=0.093;MQ0=0.000;MQ10=0.000;MQ20=0.000;MQ30=0.000;SVM=1.03116

At this position, in total there are 14 alternative allele in the 10 individuals genotypes (20 alleles in total).


   Q7: Is the genotype of HG00108 at position 33538999 consistent with what you predicted in Q3? (be careful about choosing the right column with the "cut" command) 


   Q8: How many variant sites were detected in this dataset?: 

193 variants in total