Difference between revisions of "CalcMatch"
(Created page with 'CalcMatch is a C/C++ software developed by Yun Li that compares two sets of pedigree files. It was initially written to compare imputed genotypes with their true/experimental cou…') |
|||
Line 1: | Line 1: | ||
CalcMatch is a C/C++ software developed by Yun Li that compares two sets of pedigree files. It was initially written to compare imputed genotypes with their true/experimental counterpart but can be used to compare the concordance between any two sets of pedigree files. The input data are in standard Merlin/QTDT format (http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html). | CalcMatch is a C/C++ software developed by Yun Li that compares two sets of pedigree files. It was initially written to compare imputed genotypes with their true/experimental counterpart but can be used to compare the concordance between any two sets of pedigree files. The input data are in standard Merlin/QTDT format (http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html). | ||
− | --impped --impdat specify one input pedigree set. | + | --impped --impdat specify one input pedigree set. <br> |
− | |||
− | -- | + | --trueped --truedat specify the other input pedigree set. |
− | -- | + | --match generates a matrix taking values 0,1,2 indicating # of matched alleles. The dimension of the matrix is # of overlapping individuals times # of overlapping markers of the two input pedigree sets. |
− | (1) SNP : SNP name | + | --bySNP is turned on by default to generate SNP specific measures. The output .bySNP will contain the following 6 fields for each SNP: |
− | (2) gErr : genotypic discordance rate | + | |
− | (3) aErr : allelic discordance rate | + | (1) SNP : SNP name |
− | (4) matchedG : number of genotypes matched | + | (2) gErr : genotypic discordance rate |
+ | (3) aErr : allelic discordance rate | ||
+ | (4) matchedG : number of genotypes matched | ||
(5) matchedA: number of alleles matched | (5) matchedA: number of alleles matched | ||
− | (6) maskedG: total number of genotypes evaluated/masked ( | + | (6) maskedG: total number of genotypes evaluated/masked (<=n of course) (I should change the naming to comparedG or evaluatedG) |
+ | |||
+ | <br> --byGeno can be added on top of --bySNP. It will generates the following fields after the 6 fields above: | ||
+ | |||
+ | (7) hetAerr : allelic discordance rate among heterozygotes | ||
+ | (8) AL1: allele 1 (an arbitrary allele) | ||
+ | (9) AL2: allele 2 | ||
+ | (10) freq1: frequency of AL1 | ||
+ | (11) MAF | ||
+ | (12) #true 1/1: # individuals with experimental genotype AL1/AL1 | ||
+ | (13) mm1/2: # of true AL1/AL1 being imputed as AL1/AL2 | ||
+ | (14) mm2/2: # of true AL1/AL1 being imputed as AL2/AL2 | ||
+ | (15) #true 1/2 | ||
+ | (16) mm1/1 | ||
+ | (17) mm2/2 | ||
+ | (18) #true 2/2 | ||
+ | (19) mm1/1 | ||
+ | (20) mm1/2 | ||
− | |||
− | + | <br> | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | --accuracyByGeno is an option I added most recently to represent the above (7-20) information in a different way. Similar to --byGeno, it is used on top of --bySNP. It can be used together with --byGeno. It will generate the following fields, after (7-20) is --byGeno is turned on or after the 6th field otherwise. | + | --accuracyByGeno is an option I added most recently to represent the above (7-20) information in a different way. Similar to --byGeno, it is used on top of --bySNP. It can be used together with --byGeno. It will generate the following fields, after (7-20) is --byGeno is turned on or after the 6th field otherwise. |
(A) almajor: major allele | (A) almajor: major allele | ||
Line 42: | Line 46: | ||
(F) accuracy22: allelic concordance rate for homozygotes minor allele | (F) accuracy22: allelic concordance rate for homozygotes minor allele | ||
− | + | <br> --byPerson generates a separate output file .byPerson and contains the following information for each person: | |
− | --byPerson generates a separate output file .byPerson and contains the following information for each person: | ||
(1) famid | (1) famid | ||
Line 53: | Line 56: | ||
(7) maskedG | (7) maskedG | ||
+ | <br> This --bySNP option is useful if there is potential sample swap or inter-individual difference, e.g., sequencing depth, number of markers genotyped. | ||
− | + | <br> CalcMatch compares all genotypes overlapping the two input sets. However, when --maskflag is turned on AND --maskped and --maskdat are specified (I know ...) it compares only the following subset of the overlapping genotypes: genotypes either not found (i.e., individual or marker not included) or missing (included but with value 0/0, N/N, ./. etc) in --maskped / --maskdat. These options are useful when some individuals were masked for some SNPs while others masked for a different set of SNPs. | |
− | |||
− | |||
− | CalcMatch compares all genotypes overlapping the two input sets. However, when --maskflag is turned on AND --maskped and --maskdat are specified (I know ...) it compares only the following subset of the overlapping genotypes: genotypes either not found (i.e., individual or marker not included) or missing (included but with value 0/0, N/N, ./. etc) in --maskped / --maskdat. These options are useful when some individuals were masked for some SNPs while others masked for a different set of SNPs. |
Revision as of 22:51, 22 June 2010
CalcMatch is a C/C++ software developed by Yun Li that compares two sets of pedigree files. It was initially written to compare imputed genotypes with their true/experimental counterpart but can be used to compare the concordance between any two sets of pedigree files. The input data are in standard Merlin/QTDT format (http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html).
--impped --impdat specify one input pedigree set.
--trueped --truedat specify the other input pedigree set.
--match generates a matrix taking values 0,1,2 indicating # of matched alleles. The dimension of the matrix is # of overlapping individuals times # of overlapping markers of the two input pedigree sets.
--bySNP is turned on by default to generate SNP specific measures. The output .bySNP will contain the following 6 fields for each SNP:
(1) SNP : SNP name (2) gErr : genotypic discordance rate (3) aErr : allelic discordance rate (4) matchedG : number of genotypes matched (5) matchedA: number of alleles matched (6) maskedG: total number of genotypes evaluated/masked (<=n of course) (I should change the naming to comparedG or evaluatedG)
--byGeno can be added on top of --bySNP. It will generates the following fields after the 6 fields above:
(7) hetAerr : allelic discordance rate among heterozygotes (8) AL1: allele 1 (an arbitrary allele) (9) AL2: allele 2 (10) freq1: frequency of AL1 (11) MAF (12) #true 1/1: # individuals with experimental genotype AL1/AL1 (13) mm1/2: # of true AL1/AL1 being imputed as AL1/AL2 (14) mm2/2: # of true AL1/AL1 being imputed as AL2/AL2 (15) #true 1/2 (16) mm1/1 (17) mm2/2 (18) #true 2/2 (19) mm1/1 (20) mm1/2
--accuracyByGeno is an option I added most recently to represent the above (7-20) information in a different way. Similar to --byGeno, it is used on top of --bySNP. It can be used together with --byGeno. It will generate the following fields, after (7-20) is --byGeno is turned on or after the 6th field otherwise.
(A) almajor: major allele (B) alminor: minor allele (C) freq1: major allele frequency (D) accuracy11: allelic concordance rate for homozygotes major allele (E) accuracy12: allelic concordance rate for heterozygotes (F) accuracy22: allelic concordance rate for homozygotes minor allele
--byPerson generates a separate output file .byPerson and contains the following information for each person:
(1) famid (2) subjID (3) gErr (4) aErr (5) matchedG (6) matchedA (7) maskedG
This --bySNP option is useful if there is potential sample swap or inter-individual difference, e.g., sequencing depth, number of markers genotyped.
CalcMatch compares all genotypes overlapping the two input sets. However, when --maskflag is turned on AND --maskped and --maskdat are specified (I know ...) it compares only the following subset of the overlapping genotypes: genotypes either not found (i.e., individual or marker not included) or missing (included but with value 0/0, N/N, ./. etc) in --maskped / --maskdat. These options are useful when some individuals were masked for some SNPs while others masked for a different set of SNPs.