Difference between revisions of "Haploxt"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
haploxt is a C/C++ software developed by [https://www.sph.umich.edu/csg/yli/ Yun Li] and [https://www.sph.umich.edu/csg/abecasis/ Goncalo Abecasis]. It calculates LD (D' and r<sup>2</sup>) from phased haplotypes.  
 
haploxt is a C/C++ software developed by [https://www.sph.umich.edu/csg/yli/ Yun Li] and [https://www.sph.umich.edu/csg/abecasis/ Goncalo Abecasis]. It calculates LD (D' and r<sup>2</sup>) from phased haplotypes.  
  
= Input Files =
+
= Input Files =
== Required ==
 
=== Haplotype File ===
 
The input haplotype file is in the form of one haplotype per line (no delimiter between alleles). The file could contain fields other than the actual haplotypes but must proceed the actual haplotype field.
 
  
Sample haplotype file 1:
+
== Required Input ==
Indiv1 HAPLO1 142344111132344413444421333122313342113321324233112323134222
 
Indiv1 HAPLO2 144222444221443323334431431242131234311333144223411321334422
 
Indiv2 HAPLO1 144224441221443323334331431243131232311321123323432321334412
 
Indiv2 HAPLO2 342222444221344312343433333122313344113321324223112323334412
 
  
Sample haplotype file 2:
+
=== Haplotype File  ===
142344111132344413444421333122313342113321324233112323134222
 
144222444221443323334431431242131234311333144223411321334422
 
144224441221443323334331431243131232311321123323432321334412
 
342222444221344312343433333122313344113321324223112323334412
 
  
= Options  =
+
Fed to the option --haps or -h, the input haplotype file is in the form of one haplotype per line (no delimiter between alleles). The file could contain fields other than the actual haplotypes but must proceed the actual haplotype field. Allele coding is flexible: accepted coding includes 1234, ACGT, 12, AB etc. <br><br>
  
== --impped --impdat <br> ==
+
Sample haplotype file 1:<br>
 +
Indiv1 HAPLO1 142344111132344413444421333122313342113321324233112323134222<br>
 +
Indiv1 HAPLO2 144222444221443323334431431242131234311333144223411321334422<br>
 +
Indiv2 HAPLO1 144224441221443323334331431243131232311321123323432321334412<br>
 +
Indiv2 HAPLO2 342222444221344312343433333122313344113321324223112323334412<br>  
  
specify one input pedigree set.
+
Sample haplotype file 2:<br>
 +
142344111132344413444421333122313342113321324233112323134222<br>
 +
144222444221443323334431431242131234311333144223411321334422<br>
 +
144224441221443323334331431243131232311321123323432321334412<br>
 +
342222444221344312343433333122313344113321324223112323334412<br>
  
== --trueped --truedat <br>  ==
+
=== SNP File ===
  
specify the other input pedigree set.  
+
Fed to the option --snps or -s, the input SNP file is a list of SNP name in the same order as in the haplotype file. One SNP per line. <br> <br>
  
== --match  ==
+
Sample SNP file: <br>
 +
SNP1<br>
 +
SNP2<br>
 +
SNP3<br>
 +
...<br> <br>
  
generates a matrix taking values 0,1,2 indicating # of matched alleles. The dimension of the matrix is # of overlapping individuals times # of overlapping markers of the two input pedigree sets.  
+
== Optional Input ==
 +
=== Relevant SNP File ===
 +
Fed to the option --relevant or -r, the relevant SNP file is a list of SNPs among which LD values are desired.  
  
== --bySNP  ==
+
= Options =
 +
== --allelefrequency ==
 +
Option to calculate allele frequency and output to prefix.freq. <br>
  
is turned on by default to generate SNP specific measures. The output .bySNP will contain the following 6 fields for each SNP:
+
== --allelecounts ==
 +
Option to calculate allele counts and output to prefix.ac. <br>
  
    (1) SNP&nbsp;: SNP name
+
== --ld ==
  (2) gErr&nbsp;: genotypic discordance rate
+
Option to calculate LD. Note that this option has to be turned on for LD to be calculated. <br>
  (3) aErr&nbsp;: allelic discordance rate
 
  (4) matchedG&nbsp;: number of genotypes matched
 
  (5) matchedA: number of alleles matched
 
  (6) maskedG: total number of genotypes evaluated/masked (&lt;=n of course) (I should change the naming to comparedG or evaluatedG)
 
  
<br>  
+
== --windowSize or -w ==
 +
Option to specify the # of flanking SNPs with which LD values are calculated for each SNP. Default is 1,000, meaning that LD with 1,000 SNPs on each side (2,000 total) will be calculated for each SNP. <br>
  
== --byGeno  ==
+
== --r2Threshold or -t ==
 +
Minimum r<sup>2</sup> value for a pair of SNPs to be in output. Default is 0.00. <br>
  
can be added on top of --bySNP. It will generates the following fields after the 6 fields above:
+
== --DprimeThreshold or -d ==
 +
Minimum D' value for a pair of SNPs to be in output. Default is 0.00. <br>
  
    (7) hetAerr&nbsp;: allelic discordance rate among heterozygotes
+
== --pairWithSNP ==
  (8) AL1: allele 1 (an arbitrary allele)
+
Option to calcuate LD only with a particular SNP. <br>
  (9) AL2: allele 2
 
  (10) freq1: frequency of AL1
 
  (11) MAF
 
  (12) #true 1/1: # individuals with experimental genotype AL1/AL1
 
  (13) mm1/2: # of true AL1/AL1 being imputed as AL1/AL2
 
  (14) mm2/2: # of true AL1/AL1 being imputed as AL2/AL2
 
  (15) #true 1/2
 
  (16) mm1/1
 
  (17) mm2/2
 
  (18) #true 2/2
 
  (19) mm1/1
 
  (20) mm1/2
 
  
<br>  
+
== --pairWithList ==
 +
A list of SNPs with which LD values will be calculated. <br>
  
<br>  
+
== --coupling ==
 +
Option to output for each pair the alleles that are positively correlated. <br>
  
== --accuracyByGeno  ==
+
== --prefix or -o ==
 +
Option to specify output prefix. <br>
  
Similar to --byGeno, it is used on top of --bySNP. It may be used together with --byGeno. It will generate the following fields, after (7-20) is --byGeno is turned on or after the 6th field otherwise.  
+
= Output =
 +
== .freq ==
 +
Generated when option --allelefrequency is turned on. <br> <br>
  
    (A) almajor: major allele
+
sample.freq <br>
  (B) alminor: minor allele
+
SNP AL1 AL2 Freq1 MAF<br>
  (C) freq1: major allele frequency
+
chr12:16099 1 3 0.4000 0.4000<br>
  (D) accuracy11: allelic concordance rate for homozygotes major allele
+
chr12:16163 4 2 0.9000 0.1000<br>
  (E) accuracy12: allelic concordance rate for heterozygotes
+
rs7358779 2 3 0.1000 0.1000<br>
  (F) accuracy22: allelic concordance rate for homozygotes minor allele
+
chr12:17063 1 3 0.8000 0.2000<br>
 +
...<br>
 +
<br>
  
<br>  
+
== .ac ==
 +
Generated when option --allelecounts is turned on. <br> <br>
  
== --byPerson  ==
+
sample.ac<br>
 +
SNP AL1 AL2 AC1 MAC<br>
 +
chr12:16099 1 3 4 4<br>
 +
chr12:16163 4 2 9 1<br>
 +
rs7358779 2 3 1 1<br>
 +
chr12:17063 1 3 8 2<br>
 +
<br>
  
generates a separate output file .byPerson and contains the following information for each person:
+
== .xt ==
 +
Generated when option --ld is turned on. <br> <br>
  
    (1) famid
+
sample.xt<br>
  (2) subjID
+
M1 M2 DPRIME DELTASQ COUPLING<br>
  (3) gErr
+
chr12:16252 chr12:16585 1.0000 0.6667 2,1<br>
  (4) aErr
+
chr12:16252 chr12:16665 1.0000 1.0000 2,3<br>
  (5) matchedG
+
chr12:16252 chr12:16693 1.0000 1.0000 2,4<br>
  (6) matchedA
+
...<br>
  (7) maskedG
+
<br>
  
<br> This --byPerson option is useful if there is potential sample swap or inter-individual difference, e.g., sequencing depth, number of markers genotyped.  
+
= download =
 +
You can download the source codes and example files [https://www.sph.umich.edu/csg/yli/haploxt_V108.tgz haploxt].
  
<br>
+
To install, simply type the following command:
  
== --maskflag --maskped --maskdat  ==
+
  ./build.csh
  
CalcMatch compares all genotypes overlapping the two input sets. However, when --maskflag is turned on AND --maskped and --maskdat are specified (I know ...) it compares only the following subset of the overlapping genotypes: genotypes either not found (i.e., individual or marker not included) or missing (included but with value 0/0, N/N, ./. etc) in --maskped / --maskdat. These options are useful when some individuals were masked for some SNPs while others masked for a different set of SNPs.
+
= sample command line =
  
= example command lines  =
+
  ./haploxt_names -s sample.snps -h sample.hap --allelefreq --ld -w 500 -t 0.5 --coupling -o sample.out
  
  CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --byPerson
+
= Additional Questions =
 
+
Please email [mailto:yunli@med.unc.edu Yun Li].
Will generate CalcMatch.Output.bySNP (6 fields only) and CalcMatch.Output.byPerson.
 
 
 
  CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --byGeno --byPerson
 
 
 
Will generate CalcMatch.Output.bySNP (6+20 fields) and CalcMatch.Output.byPerson.
 
 
 
  CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --accuracyByGeno --byPerson
 
 
 
Will generate CalcMatch.Output.bySNP (6+6 fields only) and CalcMatch.Output.byPerson.
 
 
 
  CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --accuracyByGeno --byGeno --byPerson
 
 
 
Will generate CalcMatch.Output.bySNP (6+20+6 fields only) and CalcMatch.Output.byPerson.
 

Latest revision as of 18:22, 21 October 2010

haploxt is a C/C++ software developed by Yun Li and Goncalo Abecasis. It calculates LD (D' and r2) from phased haplotypes.

Input Files

Required Input

Haplotype File

Fed to the option --haps or -h, the input haplotype file is in the form of one haplotype per line (no delimiter between alleles). The file could contain fields other than the actual haplotypes but must proceed the actual haplotype field. Allele coding is flexible: accepted coding includes 1234, ACGT, 12, AB etc.

Sample haplotype file 1:
Indiv1 HAPLO1 142344111132344413444421333122313342113321324233112323134222
Indiv1 HAPLO2 144222444221443323334431431242131234311333144223411321334422
Indiv2 HAPLO1 144224441221443323334331431243131232311321123323432321334412
Indiv2 HAPLO2 342222444221344312343433333122313344113321324223112323334412

Sample haplotype file 2:
142344111132344413444421333122313342113321324233112323134222
144222444221443323334431431242131234311333144223411321334422
144224441221443323334331431243131232311321123323432321334412
342222444221344312343433333122313344113321324223112323334412

SNP File

Fed to the option --snps or -s, the input SNP file is a list of SNP name in the same order as in the haplotype file. One SNP per line.

Sample SNP file:
SNP1
SNP2
SNP3
...

Optional Input

Relevant SNP File

Fed to the option --relevant or -r, the relevant SNP file is a list of SNPs among which LD values are desired.

Options

--allelefrequency

Option to calculate allele frequency and output to prefix.freq.

--allelecounts

Option to calculate allele counts and output to prefix.ac.

--ld

Option to calculate LD. Note that this option has to be turned on for LD to be calculated.

--windowSize or -w

Option to specify the # of flanking SNPs with which LD values are calculated for each SNP. Default is 1,000, meaning that LD with 1,000 SNPs on each side (2,000 total) will be calculated for each SNP.

--r2Threshold or -t

Minimum r2 value for a pair of SNPs to be in output. Default is 0.00.

--DprimeThreshold or -d

Minimum D' value for a pair of SNPs to be in output. Default is 0.00.

--pairWithSNP

Option to calcuate LD only with a particular SNP.

--pairWithList

A list of SNPs with which LD values will be calculated.

--coupling

Option to output for each pair the alleles that are positively correlated.

--prefix or -o

Option to specify output prefix.

Output

.freq

Generated when option --allelefrequency is turned on.

sample.freq
SNP AL1 AL2 Freq1 MAF
chr12:16099 1 3 0.4000 0.4000
chr12:16163 4 2 0.9000 0.1000
rs7358779 2 3 0.1000 0.1000
chr12:17063 1 3 0.8000 0.2000
...

.ac

Generated when option --allelecounts is turned on.

sample.ac
SNP AL1 AL2 AC1 MAC
chr12:16099 1 3 4 4
chr12:16163 4 2 9 1
rs7358779 2 3 1 1
chr12:17063 1 3 8 2

.xt

Generated when option --ld is turned on.

sample.xt
M1 M2 DPRIME DELTASQ COUPLING
chr12:16252 chr12:16585 1.0000 0.6667 2,1
chr12:16252 chr12:16665 1.0000 1.0000 2,3
chr12:16252 chr12:16693 1.0000 1.0000 2,4
...

download

You can download the source codes and example files haploxt.

To install, simply type the following command:

 ./build.csh

sample command line

 ./haploxt_names -s sample.snps -h sample.hap --allelefreq --ld -w 500 -t 0.5 --coupling -o sample.out

Additional Questions

Please email Yun Li.