Difference between revisions of "CalcMatch"

From Genome Analysis Wiki
Jump to: navigation, search
(Created page with 'CalcMatch is a C/C++ software developed by Yun Li that compares two sets of pedigree files. It was initially written to compare imputed genotypes with their true/experimental cou…')
 
Line 1: Line 1:
 
CalcMatch is a C/C++ software developed by Yun Li that compares two sets of pedigree files. It was initially written to compare imputed genotypes with their true/experimental counterpart but can be used to compare the concordance between any two sets of pedigree files. The input data are in standard Merlin/QTDT format (http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html).  
 
CalcMatch is a C/C++ software developed by Yun Li that compares two sets of pedigree files. It was initially written to compare imputed genotypes with their true/experimental counterpart but can be used to compare the concordance between any two sets of pedigree files. The input data are in standard Merlin/QTDT format (http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html).  
  
--impped --impdat specify one input pedigree set.
+
--impped --impdat specify one input pedigree set. <br>
--trueped --truedat specify the other input pedigree set.
 
  
--match generates a matrix taking values 0,1,2 indicating # of matched alleles. The dimension of the matrix is # of overlapping individuals times # of overlapping markers of the two input pedigree sets.
+
--trueped --truedat specify the other input pedigree set.  
  
--bySNP is turned on by default to generate SNP specific measures. The output .bySNP will contain the following 6 fields for each SNP:
+
--match generates a matrix taking values 0,1,2 indicating # of matched alleles. The dimension of the matrix is # of overlapping individuals times # of overlapping markers of the two input pedigree sets.  
  
     (1) SNP : SNP name
+
--bySNP is turned on by default to generate SNP specific measures. The output .bySNP will contain the following 6 fields for each SNP:
     (2) gErr : genotypic discordance rate
+
 
     (3) aErr : allelic discordance rate
+
     (1) SNP&nbsp;: SNP name
     (4) matchedG : number of genotypes matched
+
     (2) gErr&nbsp;: genotypic discordance rate
 +
     (3) aErr&nbsp;: allelic discordance rate
 +
     (4) matchedG&nbsp;: number of genotypes matched
 
     (5) matchedA: number of alleles matched
 
     (5) matchedA: number of alleles matched
     (6) maskedG: total number of genotypes evaluated/masked (<=n of course) (I should change the naming to comparedG or evaluatedG)
+
     (6) maskedG: total number of genotypes evaluated/masked (&lt;=n of course) (I should change the naming to comparedG or evaluatedG)
 +
 
 +
<br> --byGeno can be added on top of --bySNP. It will generates the following fields after the 6 fields above:
 +
 
 +
    (7) hetAerr : allelic discordance rate among heterozygotes
 +
    (8) AL1: allele 1 (an arbitrary allele)
 +
    (9) AL2: allele 2
 +
    (10) freq1: frequency of AL1
 +
    (11) MAF
 +
    (12) #true 1/1: # individuals with experimental genotype AL1/AL1
 +
    (13) mm1/2: # of true AL1/AL1 being imputed as AL1/AL2
 +
    (14) mm2/2: # of true AL1/AL1 being imputed as AL2/AL2
 +
    (15) #true 1/2
 +
    (16) mm1/1
 +
    (17) mm2/2
 +
    (18) #true 2/2
 +
    (19) mm1/1
 +
    (20) mm1/2
  
  
--byGeno can be added on top of --bySNP. It will generates the following fields after the 6 fields above:
 
  
(7) hetAerr : allelic discordance rate among heterozygotes
+
<br>
(8) AL1: allele 1 (an arbitrary allele)
 
(9) AL2: allele 2
 
(10) freq1: frequency of AL1
 
(11) MAF
 
(12) #true 1/1: # individuals with experimental genotype AL1/AL1
 
(13) mm1/2: # of true AL1/AL1 being imputed as AL1/AL2
 
(14) mm2/2: # of true AL1/AL1 being imputed as AL2/AL2
 
(15) #true 1/2
 
(16) mm1/1
 
(17) mm2/2
 
(18) #true 2/2
 
(19) mm1/1
 
(20) mm1/2
 
  
--accuracyByGeno is an option I added most recently to represent the above (7-20) information in a different way. Similar to --byGeno, it is used on top of --bySNP. It can be used together with --byGeno. It will generate the following fields, after (7-20) is --byGeno is turned on or after the 6th field otherwise.
+
--accuracyByGeno is an option I added most recently to represent the above (7-20) information in a different way. Similar to --byGeno, it is used on top of --bySNP. It can be used together with --byGeno. It will generate the following fields, after (7-20) is --byGeno is turned on or after the 6th field otherwise.  
  
 
     (A) almajor: major allele
 
     (A) almajor: major allele
Line 42: Line 46:
 
     (F) accuracy22: allelic concordance rate for homozygotes minor allele
 
     (F) accuracy22: allelic concordance rate for homozygotes minor allele
  
 
+
<br> --byPerson generates a separate output file .byPerson and contains the following information for each person:  
--byPerson generates a separate output file .byPerson and contains the following information for each person:
 
  
 
     (1) famid
 
     (1) famid
Line 53: Line 56:
 
     (7) maskedG
 
     (7) maskedG
  
 +
<br> This --bySNP option is useful if there is potential sample swap or inter-individual difference, e.g., sequencing depth, number of markers genotyped.
  
This --bySNP option is useful if there is potential sample swap or inter-individual difference, e.g., sequencing depth, number of markers genotyped.
+
<br> CalcMatch compares all genotypes overlapping the two input sets. However, when --maskflag is turned on AND --maskped and --maskdat are specified (I know ...) it compares only the following subset of the overlapping genotypes: genotypes either not found (i.e., individual or marker not included) or missing (included but with value 0/0, N/N, ./. etc) in --maskped / --maskdat. These options are useful when some individuals were masked for some SNPs while others masked for a different set of SNPs.
 
 
 
 
CalcMatch compares all genotypes overlapping the two input sets. However, when --maskflag is turned on AND --maskped and --maskdat are specified (I know ...) it compares only the following subset of the overlapping genotypes: genotypes either not found (i.e., individual or marker not included) or missing (included but with value 0/0, N/N, ./. etc) in --maskped / --maskdat. These options are useful when some individuals were masked for some SNPs while others masked for a different set of SNPs.
 

Revision as of 22:51, 22 June 2010

CalcMatch is a C/C++ software developed by Yun Li that compares two sets of pedigree files. It was initially written to compare imputed genotypes with their true/experimental counterpart but can be used to compare the concordance between any two sets of pedigree files. The input data are in standard Merlin/QTDT format (http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html).

--impped --impdat specify one input pedigree set.

--trueped --truedat specify the other input pedigree set.

--match generates a matrix taking values 0,1,2 indicating # of matched alleles. The dimension of the matrix is # of overlapping individuals times # of overlapping markers of the two input pedigree sets.

--bySNP is turned on by default to generate SNP specific measures. The output .bySNP will contain the following 6 fields for each SNP:

   (1) SNP : SNP name
   (2) gErr : genotypic discordance rate
   (3) aErr : allelic discordance rate
   (4) matchedG : number of genotypes matched
   (5) matchedA: number of alleles matched
   (6) maskedG: total number of genotypes evaluated/masked (<=n of course) (I should change the naming to comparedG or evaluatedG)


--byGeno can be added on top of --bySNP. It will generates the following fields after the 6 fields above:

   (7) hetAerr : allelic discordance rate among heterozygotes
   (8) AL1: allele 1 (an arbitrary allele)
   (9) AL2: allele 2
   (10) freq1: frequency of AL1
   (11) MAF
   (12) #true 1/1: # individuals with experimental genotype AL1/AL1
   (13) mm1/2: # of true AL1/AL1 being imputed as AL1/AL2
   (14) mm2/2: # of true AL1/AL1 being imputed as AL2/AL2
   (15) #true 1/2
   (16) mm1/1
   (17) mm2/2
   (18) #true 2/2
   (19) mm1/1
   (20) mm1/2



--accuracyByGeno is an option I added most recently to represent the above (7-20) information in a different way. Similar to --byGeno, it is used on top of --bySNP. It can be used together with --byGeno. It will generate the following fields, after (7-20) is --byGeno is turned on or after the 6th field otherwise.

   (A) almajor: major allele
   (B) alminor: minor allele
   (C) freq1: major allele frequency
   (D) accuracy11: allelic concordance rate for homozygotes major allele
   (E) accuracy12: allelic concordance rate for heterozygotes
   (F) accuracy22: allelic concordance rate for homozygotes minor allele


--byPerson generates a separate output file .byPerson and contains the following information for each person:

   (1) famid
   (2) subjID
   (3) gErr
   (4) aErr
   (5) matchedG
   (6) matchedA
   (7) maskedG


This --bySNP option is useful if there is potential sample swap or inter-individual difference, e.g., sequencing depth, number of markers genotyped.


CalcMatch compares all genotypes overlapping the two input sets. However, when --maskflag is turned on AND --maskped and --maskdat are specified (I know ...) it compares only the following subset of the overlapping genotypes: genotypes either not found (i.e., individual or marker not included) or missing (included but with value 0/0, N/N, ./. etc) in --maskped / --maskdat. These options are useful when some individuals were masked for some SNPs while others masked for a different set of SNPs.