Difference between revisions of "Haploxt"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 1: Line 1:
 
haploxt is a C/C++ software developed by [https://www.sph.umich.edu/csg/yli/ Yun Li] and [https://www.sph.umich.edu/csg/abecasis/ Goncalo Abecasis]. It calculates LD (D' and r<sup>2</sup>) from phased haplotypes.  
 
haploxt is a C/C++ software developed by [https://www.sph.umich.edu/csg/yli/ Yun Li] and [https://www.sph.umich.edu/csg/abecasis/ Goncalo Abecasis]. It calculates LD (D' and r<sup>2</sup>) from phased haplotypes.  
  
= Input Files =
+
= Input Files =
== Required ==
 
=== Haplotype File ===
 
The input haplotype file is in the form of one haplotype per line (no delimiter between alleles). The file could contain fields other than the actual haplotypes but must proceed the actual haplotype field.
 
  
Sample haplotype file 1:
+
== Required  ==
Indiv1 HAPLO1 142344111132344413444421333122313342113321324233112323134222
 
Indiv1 HAPLO2 144222444221443323334431431242131234311333144223411321334422
 
Indiv2 HAPLO1 144224441221443323334331431243131232311321123323432321334412
 
Indiv2 HAPLO2 342222444221344312343433333122313344113321324223112323334412
 
  
Sample haplotype file 2:
+
=== Haplotype File  ===
142344111132344413444421333122313342113321324233112323134222
+
 
144222444221443323334431431242131234311333144223411321334422
+
The input haplotype file is in the form of one haplotype per line (no delimiter between alleles). The file could contain fields other than the actual haplotypes but must proceed the actual haplotype field.<br><br>
144224441221443323334331431243131232311321123323432321334412
+
 
342222444221344312343433333122313344113321324223112323334412
+
Sample haplotype file 1:<br>
 +
Indiv1 HAPLO1 142344111132344413444421333122313342113321324233112323134222<br>
 +
Indiv1 HAPLO2 144222444221443323334431431242131234311333144223411321334422<br>
 +
Indiv2 HAPLO1 144224441221443323334331431243131232311321123323432321334412<br>
 +
Indiv2 HAPLO2 342222444221344312343433333122313344113321324223112323334412<br>
 +
 
 +
Sample haplotype file 2:<br>
 +
142344111132344413444421333122313342113321324233112323134222<br>
 +
144222444221443323334431431242131234311333144223411321334422<br>
 +
144224441221443323334331431243131232311321123323432321334412<br>
 +
342222444221344312343433333122313344113321324223112323334412<br>
  
 
= Options  =
 
= Options  =
Line 37: Line 40:
  
 
     (1) SNP&nbsp;: SNP name
 
     (1) SNP&nbsp;: SNP name
  (2) gErr&nbsp;: genotypic discordance rate
+
  (2) gErr&nbsp;: genotypic discordance rate
  (3) aErr&nbsp;: allelic discordance rate
+
  (3) aErr&nbsp;: allelic discordance rate
  (4) matchedG&nbsp;: number of genotypes matched
+
  (4) matchedG&nbsp;: number of genotypes matched
  (5) matchedA: number of alleles matched
+
  (5) matchedA: number of alleles matched
  (6) maskedG: total number of genotypes evaluated/masked (&lt;=n of course) (I should change the naming to comparedG or evaluatedG)
+
  (6) maskedG: total number of genotypes evaluated/masked (&lt;=n of course) (I should change the naming to comparedG or evaluatedG)
  
 
<br>  
 
<br>  
Line 50: Line 53:
  
 
     (7) hetAerr&nbsp;: allelic discordance rate among heterozygotes
 
     (7) hetAerr&nbsp;: allelic discordance rate among heterozygotes
  (8) AL1: allele 1 (an arbitrary allele)
+
  (8) AL1: allele 1 (an arbitrary allele)
  (9) AL2: allele 2
+
  (9) AL2: allele 2
  (10) freq1: frequency of AL1
+
  (10) freq1: frequency of AL1
  (11) MAF
+
  (11) MAF
  (12) #true 1/1: # individuals with experimental genotype AL1/AL1
+
  (12) #true 1/1: # individuals with experimental genotype AL1/AL1
  (13) mm1/2: # of true AL1/AL1 being imputed as AL1/AL2
+
  (13) mm1/2: # of true AL1/AL1 being imputed as AL1/AL2
  (14) mm2/2: # of true AL1/AL1 being imputed as AL2/AL2
+
  (14) mm2/2: # of true AL1/AL1 being imputed as AL2/AL2
  (15) #true 1/2
+
  (15) #true 1/2
  (16) mm1/1
+
  (16) mm1/1
  (17) mm2/2
+
  (17) mm2/2
  (18) #true 2/2
+
  (18) #true 2/2
  (19) mm1/1
+
  (19) mm1/1
  (20) mm1/2
+
  (20) mm1/2
  
 
<br>  
 
<br>  
Line 73: Line 76:
  
 
     (A) almajor: major allele
 
     (A) almajor: major allele
  (B) alminor: minor allele
+
  (B) alminor: minor allele
  (C) freq1: major allele frequency
+
  (C) freq1: major allele frequency
  (D) accuracy11: allelic concordance rate for homozygotes major allele
+
  (D) accuracy11: allelic concordance rate for homozygotes major allele
  (E) accuracy12: allelic concordance rate for heterozygotes
+
  (E) accuracy12: allelic concordance rate for heterozygotes
  (F) accuracy22: allelic concordance rate for homozygotes minor allele
+
  (F) accuracy22: allelic concordance rate for homozygotes minor allele
  
 
<br>  
 
<br>  
Line 86: Line 89:
  
 
     (1) famid
 
     (1) famid
  (2) subjID
+
  (2) subjID
  (3) gErr
+
  (3) gErr
  (4) aErr
+
  (4) aErr
  (5) matchedG
+
  (5) matchedG
  (6) matchedA
+
  (6) matchedA
  (7) maskedG
+
  (7) maskedG
  
 
<br> This --byPerson option is useful if there is potential sample swap or inter-individual difference, e.g., sequencing depth, number of markers genotyped.  
 
<br> This --byPerson option is useful if there is potential sample swap or inter-individual difference, e.g., sequencing depth, number of markers genotyped.  

Revision as of 09:57, 21 October 2010

haploxt is a C/C++ software developed by Yun Li and Goncalo Abecasis. It calculates LD (D' and r2) from phased haplotypes.

Input Files

Required

Haplotype File

The input haplotype file is in the form of one haplotype per line (no delimiter between alleles). The file could contain fields other than the actual haplotypes but must proceed the actual haplotype field.

Sample haplotype file 1:
Indiv1 HAPLO1 142344111132344413444421333122313342113321324233112323134222
Indiv1 HAPLO2 144222444221443323334431431242131234311333144223411321334422
Indiv2 HAPLO1 144224441221443323334331431243131232311321123323432321334412
Indiv2 HAPLO2 342222444221344312343433333122313344113321324223112323334412

Sample haplotype file 2:
142344111132344413444421333122313342113321324233112323134222
144222444221443323334431431242131234311333144223411321334422
144224441221443323334331431243131232311321123323432321334412
342222444221344312343433333122313344113321324223112323334412

Options

--impped --impdat

specify one input pedigree set.

--trueped --truedat

specify the other input pedigree set.

--match

generates a matrix taking values 0,1,2 indicating # of matched alleles. The dimension of the matrix is # of overlapping individuals times # of overlapping markers of the two input pedigree sets.

--bySNP

is turned on by default to generate SNP specific measures. The output .bySNP will contain the following 6 fields for each SNP:

   (1) SNP : SNP name
 (2) gErr : genotypic discordance rate
 (3) aErr : allelic discordance rate
 (4) matchedG : number of genotypes matched
 (5) matchedA: number of alleles matched
 (6) maskedG: total number of genotypes evaluated/masked (<=n of course) (I should change the naming to comparedG or evaluatedG)


--byGeno

can be added on top of --bySNP. It will generates the following fields after the 6 fields above:

   (7) hetAerr : allelic discordance rate among heterozygotes
 (8) AL1: allele 1 (an arbitrary allele)
 (9) AL2: allele 2
 (10) freq1: frequency of AL1
 (11) MAF
 (12) #true 1/1: # individuals with experimental genotype AL1/AL1
 (13) mm1/2: # of true AL1/AL1 being imputed as AL1/AL2
 (14) mm2/2: # of true AL1/AL1 being imputed as AL2/AL2
 (15) #true 1/2
 (16) mm1/1
 (17) mm2/2
 (18) #true 2/2
 (19) mm1/1
 (20) mm1/2



--accuracyByGeno

Similar to --byGeno, it is used on top of --bySNP. It may be used together with --byGeno. It will generate the following fields, after (7-20) is --byGeno is turned on or after the 6th field otherwise.

   (A) almajor: major allele
 (B) alminor: minor allele
 (C) freq1: major allele frequency
 (D) accuracy11: allelic concordance rate for homozygotes major allele
 (E) accuracy12: allelic concordance rate for heterozygotes
 (F) accuracy22: allelic concordance rate for homozygotes minor allele


--byPerson

generates a separate output file .byPerson and contains the following information for each person:

   (1) famid
 (2) subjID
 (3) gErr
 (4) aErr
 (5) matchedG
 (6) matchedA
 (7) maskedG


This --byPerson option is useful if there is potential sample swap or inter-individual difference, e.g., sequencing depth, number of markers genotyped.


--maskflag --maskped --maskdat

CalcMatch compares all genotypes overlapping the two input sets. However, when --maskflag is turned on AND --maskped and --maskdat are specified (I know ...) it compares only the following subset of the overlapping genotypes: genotypes either not found (i.e., individual or marker not included) or missing (included but with value 0/0, N/N, ./. etc) in --maskped / --maskdat. These options are useful when some individuals were masked for some SNPs while others masked for a different set of SNPs.

example command lines

 CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --byPerson 

Will generate CalcMatch.Output.bySNP (6 fields only) and CalcMatch.Output.byPerson.

 CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --byGeno --byPerson 

Will generate CalcMatch.Output.bySNP (6+20 fields) and CalcMatch.Output.byPerson.

 CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --accuracyByGeno --byPerson 

Will generate CalcMatch.Output.bySNP (6+6 fields only) and CalcMatch.Output.byPerson.

 CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --accuracyByGeno --byGeno --byPerson 

Will generate CalcMatch.Output.bySNP (6+20+6 fields only) and CalcMatch.Output.byPerson.