http://genome.sph.umich.edu/w/api.php?action=feedcontributions&user=Ppwhite&feedformat=atomGenome Analysis Wiki - User contributions [en]2024-03-29T12:47:15ZUser contributionsMediaWiki 1.35.9http://genome.sph.umich.edu/w/index.php?title=Thunder&diff=14658Thunder2017-02-21T16:08:50Z<p>Ppwhite: /* (step 2) Genotype/haplotype calling using thunder thunder_glf_freq */</p>
<hr />
<div>This page documents how to perform variant calling from low-coverage sequencing data using glfmultiples and thunder. The pipeline was originally developed by [mailto:yunli@med.unc.edu Yun Li] and for [mailto:goncalo@umich.edu Goncalo Abecasis] the 1000 Genomes Low Coverage Pilot Project. <br />
<br />
== Input Data ==<br />
<br />
To get started, you will need glf files in the standard format [http://samtools.sourceforge.net/SAM1.pdf glf format]. Sample files are available at [ftp://share.sph.umich.edu/1000genomes/pilot1/examples/glf.tgz sample glf files]. <br />
<br />
If you do not have glf files, you can generate them from bam files (bam format also specified in [http://samtools.sourceforge.net/SAM1.pdf glf format bam format]) using the following command line: <br />
<br />
samtools pileup -g -T 1 -f ref.fa my.bam &gt; my.glf<br />
<br />
Note: you will need the reference fasta file ref.fa to create glf file from bam file.<br />
<br />
== How to Run ==<br />
<br />
This variant calling pipeline has two steps. (step 1) promotion of a set of potential polymorphisms; and (step 2) genotype/haplotype calling using LD information. <br />
<br />
=== (step 1) Site promotion using software glfMultiples [https://csg.sph.umich.edu//yli/GPT_Freq.011.source.tgz GPT_Freq] ===<br />
<br />
GPT_Freq -b my.out -p 0.9 --minDepth 10 --maxDepth 1000 *.glf <br />
<br />
minDepth and maxDepth are the cutoffs on total depth (across all individuals). We have found it useful to exclude sites with extremely low and high total depth. Please see Important Filters below.<br />
<br />
=== (step 2) Genotype/haplotype calling using thunder [https://csg.sph.umich.edu//yli/thunder/thunder.V011.source.tgz thunder_glf_freq] ===<br />
<br />
thunder_glf_freq --shotgun my.out.$chr --detailedInput -r 100 --states 200 --dosage --phase --interim 25 -o my.final.out<br />
<br />
Notes: <br />
<br />
(1) The program thunder used in step 2 is an extension of MaCH, the genotype imputation software we have previously developed. For details regarding the shared options, please check out [http://sph.umich.edu/csg/abecasis/mach/index.html MaCH website] and [http://genome.sph.umich.edu/wiki/Mach MaCH wiki]. <br />
<br />
(2) Check out example files and command lines under examples/thunder/ in the thunder package [https://sph.umich.edu/csg/abecasis/thunder/thunder.V011.source.tgz thunder_glf_freq].<br />
<br />
== Example Showing the Whole Pipeline ==<br />
In the thunder [https://csg.sph.umich.edu//yli/thunder/thunder.V011.source.tgz thunder_glf_freq] tarball, you can find under example/thunder/ folder, input files extracted from real data and a C-shell script that executes the whole analysis pipeline.<br />
<br />
== Ligate Haplotypes ==<br />
Please use [http://csg.sph.umich.edu//yli/ligateHap.V004.tgz ligateHaplotypes].<br />
<br />
== Important Filters ==<br />
<br />
We have found that the following filters are helpful.<br />
<br />
=== allelic imbalance ===<br />
A statistic developed by Dr. Tom Blackwell [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_Based_Allele_Balance allelic imbalance]. <br />
<br />
=== indel filter ===<br />
We recommend distance to known indels >= 5bp. A catalog of known indels can be found at [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/ indel catalog].<br />
<br />
=== site promotion filter ===<br />
We recommend setting parameter -p at least >= 0.9 in step 1 (running glfMultiples).<br />
<br />
=== strand bias filter ===<br />
<br />
=== total depth filter ===<br />
For the 1000 Genomes Project (average depth per individual ~4X), we have found it useful to exclude sites with average total depth per individual < 0.5X or > 20X.<br />
<br />
=== coverage filter ===<br />
We recommend the filter of >50% individuals with coverage.<br />
<br />
=== flanking sequence filter ===<br />
We recommend excluding sites with >0.1% flanking 10-mer frequency among candidate sites. samtools calmd -br performs this base quality re-calibration.<br />
<br />
== Citation ==<br />
Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: Implications for design of complex trait association studies. <em>Genome Res.</em> 2011 Jun;21(6):940-51. <br><br />
<br />
== Inference with External Reference ==<br />
<br />
Please refer to [http://genome.sph.umich.edu/wiki/UMAKE UMAKE]. <br><br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=Thunder&diff=14657Thunder2017-02-21T16:06:47Z<p>Ppwhite: /* (step 2) Genotype/haplotype calling using thunder thunder_glf_freq */</p>
<hr />
<div>This page documents how to perform variant calling from low-coverage sequencing data using glfmultiples and thunder. The pipeline was originally developed by [mailto:yunli@med.unc.edu Yun Li] and for [mailto:goncalo@umich.edu Goncalo Abecasis] the 1000 Genomes Low Coverage Pilot Project. <br />
<br />
== Input Data ==<br />
<br />
To get started, you will need glf files in the standard format [http://samtools.sourceforge.net/SAM1.pdf glf format]. Sample files are available at [ftp://share.sph.umich.edu/1000genomes/pilot1/examples/glf.tgz sample glf files]. <br />
<br />
If you do not have glf files, you can generate them from bam files (bam format also specified in [http://samtools.sourceforge.net/SAM1.pdf glf format bam format]) using the following command line: <br />
<br />
samtools pileup -g -T 1 -f ref.fa my.bam &gt; my.glf<br />
<br />
Note: you will need the reference fasta file ref.fa to create glf file from bam file.<br />
<br />
== How to Run ==<br />
<br />
This variant calling pipeline has two steps. (step 1) promotion of a set of potential polymorphisms; and (step 2) genotype/haplotype calling using LD information. <br />
<br />
=== (step 1) Site promotion using software glfMultiples [https://csg.sph.umich.edu//yli/GPT_Freq.011.source.tgz GPT_Freq] ===<br />
<br />
GPT_Freq -b my.out -p 0.9 --minDepth 10 --maxDepth 1000 *.glf <br />
<br />
minDepth and maxDepth are the cutoffs on total depth (across all individuals). We have found it useful to exclude sites with extremely low and high total depth. Please see Important Filters below.<br />
<br />
=== (step 2) Genotype/haplotype calling using thunder [https://csg.sph.umich.edu//yli/thunder/thunder.V011.source.tgz thunder_glf_freq] ===<br />
<br />
thunder_glf_freq --shotgun my.out.$chr --detailedInput -r 100 --states 200 --dosage --phase --interim 25 -o my.final.out<br />
<br />
Notes: <br />
<br />
(1) The program thunder used in step 2 is an extension of MaCH, the genotype imputation software we have previously developed. For details regarding the shared options, please check out [http://sph.umich.edu/csg/yli/mach/index.html MaCH website] and [http://genome.sph.umich.edu/wiki/Mach MaCH wiki]. <br />
<br />
(2) Check out example files and command lines under examples/thunder/ in the thunder package [https://sph.umich.edu/csg/yli/thunder/thunder.V011.source.tgz thunder_glf_freq].<br />
<br />
== Example Showing the Whole Pipeline ==<br />
In the thunder [https://csg.sph.umich.edu//yli/thunder/thunder.V011.source.tgz thunder_glf_freq] tarball, you can find under example/thunder/ folder, input files extracted from real data and a C-shell script that executes the whole analysis pipeline.<br />
<br />
== Ligate Haplotypes ==<br />
Please use [http://csg.sph.umich.edu//yli/ligateHap.V004.tgz ligateHaplotypes].<br />
<br />
== Important Filters ==<br />
<br />
We have found that the following filters are helpful.<br />
<br />
=== allelic imbalance ===<br />
A statistic developed by Dr. Tom Blackwell [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_Based_Allele_Balance allelic imbalance]. <br />
<br />
=== indel filter ===<br />
We recommend distance to known indels >= 5bp. A catalog of known indels can be found at [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/ indel catalog].<br />
<br />
=== site promotion filter ===<br />
We recommend setting parameter -p at least >= 0.9 in step 1 (running glfMultiples).<br />
<br />
=== strand bias filter ===<br />
<br />
=== total depth filter ===<br />
For the 1000 Genomes Project (average depth per individual ~4X), we have found it useful to exclude sites with average total depth per individual < 0.5X or > 20X.<br />
<br />
=== coverage filter ===<br />
We recommend the filter of >50% individuals with coverage.<br />
<br />
=== flanking sequence filter ===<br />
We recommend excluding sites with >0.1% flanking 10-mer frequency among candidate sites. samtools calmd -br performs this base quality re-calibration.<br />
<br />
== Citation ==<br />
Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: Implications for design of complex trait association studies. <em>Genome Res.</em> 2011 Jun;21(6):940-51. <br><br />
<br />
== Inference with External Reference ==<br />
<br />
Please refer to [http://genome.sph.umich.edu/wiki/UMAKE UMAKE]. <br><br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=Famrvtest&diff=14656Famrvtest2017-02-21T14:34:41Z<p>Ppwhite: /* PED and DAT Files */</p>
<hr />
<div>== Useful Wiki Pages ==<br />
<br />
There are a few pages in this Wiki that may be useful to famRvTest users. Here are links to key pages:<br />
<br />
* The [[FamRvTest_command|'''famrvtest''' Command Reference]]<br />
* The [[FamRvTest_tutorial|'''famrvtest''' Tutorial]]<br />
<br />
== Brief Description ==<br />
<br />
'''famrvtest''' is a computationally efficient tool for family-based rare variant association analyses using genotyping array or sequencing data. '''famrvtest''' supports both single variant and gene-level associations. <br />
<br />
For any questions, please contact [[Shuang_Feng |Shuang Feng]] (sfengsph at umich.edu) or [[Goncalo_Abecasis|Gonçalo Abecasis]] (goncalo at umich.edu).<br />
<br />
== Download and Installation ==<br />
* University of Michigan CSG users can go to the following:<br />
/net/fantasia/home/sfengsph/code/famrvtest/bin/famrvtest<br />
<br />
=== Where to Download ===<br />
* Source code can be downloaded in the following<br />
<br />
[[Media:LINUX_famrvtest.2.4.tgz|Source for '''LINUX''']]<br />
[[Media:MAC_famrvtest.2.4.tgz|Source for '''MAC''']]<br />
[[Media:MINGW_famrvtest.2.4.tgz|Source for '''MINGW''']]<br />
[[Media:CYGWIN64_famrvtest.2.4.tgz|Source for '''CYGWIN64''']]<br />
<br />
* Executable can be downloaded in the following:<br />
<br />
[[Media:Famrvtest.2.4.linux.executable.tgz |Executable for '''LINUX''']]<br />
<br />
=== How to Compile ===<br />
* Save it to your local path and decompress using the following command:<br />
tar xvzf LINUX_famrvtest.2.4.tgz<br />
* Go to promp>famrvtest and type the following command to compile:<br />
make<br />
<br />
=== How to Execute ===<br />
* Go to famrvtest/bin and use the following:<br />
./famrvtest<br />
<br />
==Command Reference==<br />
Please go to [[FamRvTest_command|Command Reference Page]] for details.<br />
<br />
==Approach==<br />
'''famrvtest''' uses linear mixed model approach, incorporating efficient optimization algorithm, to account for familial relationship, where kinship is either quantified based upon pedigree structures or estimated from genotypes of markers from genome-wide. Single marker associations including score, likelihood ratio and ward tests and gene-level associations methods (weighted and un-weighted burden, SKAT and variable threshold tests) have been implemented. Manuscript is under preparation.<br />
<br />
== Input Files ==<br />
famrvtest needs the following files as input: PED and DAT file in Merlin format, '''AND/OR''' a VCF file. When genotypes are stored in PED and DAT file, the VCF file is not needed. However, even if genotypes are saved in a VCF file, PED and DAT files are still needed for carrying covariate and trait information. <br />
<br />
=== PED and DAT Files ===<br />
* When PED file has genotypes saved, there is no need for a VCF file as input.<br />
* '''famrvtest''' takes PED/DAT file in [http://www.sph.umich.edu/csg/abecasis/Merlin/index.html '''Merlin'''] format. Please refer to [http://sph.umich.edu/csg/abecasis/merlin/tour/input_files.html PED/DAT format description] for details.<br />
* An example PED file is in the following:<br />
1 1 0 0 1 1.5 1 23 A A A A A A A A A A<br />
2 1 0 0 1 1.0 1 34 A C A C A C A C A C<br />
3 1 0 0 2 0.4 1 43 A A A A A A A A A A<br />
4 1 0 0 2 0.9 1 13 A C A C A C A C A C<br />
* The matching DAT file is in the following:<br />
T YourTraitName<br />
C SEX<br />
C AGE<br />
M 1:123456a<br />
M 1:234567<br />
M 2:111111<br />
M 2:222222<br />
M X:12345<br />
* DAT file must have variant names in the following format "M chr:pos". <br />
* Orders of labels in DAT file have to match the order of fields in PED file. <br />
* '''Markers in PED and DAT file must be sorted by chromosome and position.'''<br />
<br />
* Covariate and trait values are saved in PED file. Covariate and trait descriptions are saved in DAT file.<br />
<br />
=== VCF File ===<br />
* Another option is to use VCF as input. Please refer to the following link for VCF file specification: [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 1000 genome wiki VCF specs] <br />
* VCF file should be compressed by bgzip and indexed by tabix, using the following command:<br />
bgzip input.vcf ## this command will generate input.vcf.gz<br />
tabix -p vcf -f input.vcf.gz ## this command will generate input.vcf.gz.tbi<br />
* Even with the presence of VCF file, PED/DAT files are still needed for covariates and phenotypes.<br />
<br />
=== Group File for Gene-level Tests===<br />
* Grouping methods are only necessary for gene-level tests.<br />
* With --groupFile option, you can specify particular set of variants to be grouped for burden tests.<br />
* The group file must be a tab or space delimited file in the following format:<br />
GROUP_ID MARKER1_ID MARKER2_ID MARKER3_ID ... <br />
* MARKER_ID must be in the following format:<br />
CHR:POS:ALLELE1:ALLELE2<br />
* An example group file is:<br />
PLEKHN1 1:901922:G:A 1:901923:C:A 1:902088:G:A 1:902128:C:T 1:902133:C:G 1:902176:C:T 1:905669:C:G <br />
HES4 1:934735:A:C 1:934770:G:A 1:934801:C:T 1:935085:G:A 1:935089:C:G<br />
* '''Version 2.4 and later allow variants from different chromosomes to be grouped for testing. This might be useful for pathway analysis.'''<br />
* '''Note: any variants that have different alleles from listed in group file will be excluded from gene-level tests.'''<br />
<br />
== Example Command Line ==<br />
===Single Variant Analysis===<br />
The following command lines let you run single variant association analysis of trait "LDL" using score test, after inverse normalization of the quantitative trait and adjusting covariates. --traitName specifies the single trait or traits you want to analyze in this batch. If this option is not used, then all traits coded in data file will be analyzed accordingly. --SingleVarLRT provides essentially the same test as in merlin --fastAssoc option. <br />
./famrvtest --ped your.ped --dat your.dat --vcf your.vcf.gz --SingleVarScore --inverseNormal --useCovariates --traitName LDL<br />
<br />
All the above commands will let you do family-based association analysis using kinship matrices generated using pedigree structure coded in pedigree file. The following command lines show examples of using genotype to estimate empirical relationship matrix to do the work. <br />
./famrvtest --ped your.ped --dat your.dat --SingleVarScore --inverseNormal --useCovariates --traitName LDL --kinPedigree<br />
<br />
===Gene-level Association===<br />
<br />
The following command lines let you run gene-level association analysis of genes listed in "your.genes.groupfile" for trait "LDL" using SKAT, Madsen-Browning weighted burden, rare allele counts un-weighted burden and collapsing burden and variable threshold tests, after inverse normalization of the quantitative trait and adjusting covariates. Only rare variants with maf less than or equal to 0.05 and minor allele count greater than or equal to 3 are grouped.<br />
./famrvtest -ped your.ped -dat your.dat --SKAT_BETA --MB --burden --VT --inverseNormal --useCovariates --traitName LDL --groupFile your.genes.groupfile --maf 0.05<br />
<br />
== Change Log ==<br />
<br />
* Released version 0.0.9 with a bug fixed for potential compiling error. (10/10/2013)<br />
* Released version 2.0, a faster version and added family-based single variant permutation test. (7/14/2014)<br />
* Released version 2.2, a bug fixed which causes single variant test can not be run alone. (7/15/2014)<br />
* Uploaded new source code package for version2.2, with updated makefiles. (8/4/14)<br />
* Released version 2.3. Fixed a bug which causes compiling error (not finding the correct makefile). (8/20/14)<br />
* Released version 2.4. Enable analyzing pathways where variants from different chromosomes can be grouped. (9/27/2014)</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=CalcMatch&diff=14653CalcMatch2017-02-02T16:03:38Z<p>Ppwhite: /* Download */</p>
<hr />
<div>CalcMatch is a C/C++ software developed by [https://csg.sph.umich.edu//yli/ Yun Li]. It compares two sets of pedigree files. It was initially written to compare imputed genotypes with their true/experimental counterpart but can be used to compare the concordance between any two sets of pedigree files. The input data are in standard Merlin/QTDT format (http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html). <br />
<br />
= Options =<br />
== --impped --impdat <br> ==<br />
specify one input pedigree set. <br />
<br />
== --trueped --truedat <br> ==<br />
specify the other input pedigree set.<br />
<br />
== --match == <br />
generates a matrix taking values 0,1,2 indicating # of matched alleles. The dimension of the matrix is # of overlapping individuals times # of overlapping markers of the two input pedigree sets. <br />
<br />
== --bySNP == <br />
is turned on by default (which means: if you put --bySNP in command line, it will be turned OFF!) to generate SNP specific measures. The output .bySNP will contain the following 6 fields for each SNP: <br />
<br />
(1) SNP&nbsp;: SNP name<br />
(2) gErr&nbsp;: genotypic discordance rate<br />
(3) aErr&nbsp;: allelic discordance rate<br />
(4) matchedG&nbsp;: number of genotypes matched<br />
(5) matchedA: number of alleles matched<br />
(6) maskedG: total number of genotypes evaluated/masked (&lt;=n of course) (I should change the naming to comparedG or evaluatedG)<br />
<br />
<br> <br />
<br />
== --byGeno ==<br />
NOTE: this option is turned on by default. If you put --byGeno in command line, it will be turned OFF!<br />
can be added on top of --bySNP. It will generates the following fields after the 6 fields above: <br />
<br />
(7) hetAerr : allelic discordance rate among heterozygotes<br />
(8) AL1: allele 1 (an arbitrary allele)<br />
(9) AL2: allele 2<br />
(10) freq1: frequency of AL1<br />
(11) MAF<br />
(12) #true 1/1: # individuals with experimental genotype AL1/AL1<br />
(13) mm1/2: # of true AL1/AL1 being imputed as AL1/AL2<br />
(14) mm2/2: # of true AL1/AL1 being imputed as AL2/AL2<br />
(15) #true 1/2<br />
(16) mm1/1<br />
(17) mm2/2<br />
(18) #true 2/2<br />
(19) mm1/1<br />
(20) mm1/2<br />
<br />
<br />
<br />
<br><br />
<br />
== --accuracyByGeno ==<br />
Similar to --byGeno, it is used on top of --bySNP. It may be used together with --byGeno. It will generate the following fields, after (7-20) is --byGeno is turned on or after the 6th field otherwise. <br />
<br />
(A) almajor: major allele<br />
(B) alminor: minor allele<br />
(C) freq1: major allele frequency<br />
(D) accuracy11: allelic concordance rate for homozygotes major allele<br />
(E) accuracy12: allelic concordance rate for heterozygotes<br />
(F) accuracy22: allelic concordance rate for homozygotes minor allele<br />
<br />
<br> <br />
== --byPerson ==<br />
generates a separate output file .byPerson and contains the following information for each person: <br />
<br />
(1) famid<br />
(2) subjID<br />
(3) gErr<br />
(4) aErr<br />
(5) matchedG<br />
(6) matchedA<br />
(7) maskedG<br />
<br />
<br> This --byPerson option is useful if there is potential sample swap or inter-individual difference, e.g., sequencing depth, number of markers genotyped. <br />
<br />
<br> <br />
<br />
== --maskflag --maskped --maskdat ==<br />
CalcMatch compares all genotypes overlapping the two input sets. However, when --maskflag is turned on AND --maskped and --maskdat are specified (I know ...) it compares only the following subset of the overlapping genotypes: genotypes either not found (i.e., individual or marker not included) or missing (included but with value 0/0, N/N, ./. etc) in --maskped / --maskdat. These options are useful when some individuals were masked for some SNPs while others masked for a different set of SNPs.<br />
<br />
= output files =<br />
== .bySNP ==<br />
See option --bySNP <br><br />
<br />
== .byPerson ==<br />
See option --byPerson <br><br />
<br />
== .minusstrand ==<br />
Reports the list of SNPs that appear in minus strand (that is, SNPs for which more than two alleles are seen when combining imputed and true pedigree files. This file will only be generated if --byGeno or --accuracyByGeno is turned on. The former option --byGeno is turned on by default. <br><br />
<br />
= example command lines =<br />
<br />
CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --byPerson <br />
<br />
Will generate CalcMatch.Output.bySNP (6 fields only) and CalcMatch.Output.byPerson.<br />
<br />
CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --byGeno --byPerson <br />
<br />
Will generate CalcMatch.Output.bySNP (6+20 fields) and CalcMatch.Output.byPerson.<br />
<br />
CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --accuracyByGeno --byPerson <br />
<br />
Will generate CalcMatch.Output.bySNP (6+6 fields only) and CalcMatch.Output.byPerson.<br />
<br />
CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --accuracyByGeno --byGeno --byPerson <br />
<br />
Will generate CalcMatch.Output.bySNP (6+20+6 fields only) and CalcMatch.Output.byPerson.<br />
<br />
= Download =<br />
Please go to http://csg.sph.umich.edu//yli/software.html</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH:_machX&diff=14652MaCH: machX2017-02-02T16:02:50Z<p>Ppwhite: </p>
<hr />
<div>This page documents how to perform X chromosome (non-pseudo-autosomal part) imputation using MaCH [http://csg.sph.umich.edu/csg/yli/mach] and minimac [http://genome.sph.umich.edu/wiki/Minimac]. <br />
<br />
== Getting Started ==<br />
<br />
=== Your Own Data ===<br />
<br />
To get started, you will need to store your data in [[Merlin]] format pedigree and data files, one per chromosome. For details of the Merlin file format, see the Merlin tutorial [http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html]. <br><br />
<br />
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele). <br><br />
<br />
Note that for males hemizygotes are coded as homozygotes. <br><br />
<br />
=== Reference Haplotypes ===<br />
<br />
You can download the reference haplotypes from MaCH download page [http://csg.sph.umich.edu//yli/mach/download/chrX.html].<br />
<br />
== Two-Step Imputation ==<br />
<br />
=== Phase Your Own Data ===<br />
<br />
If there is no missing genotypes in males, you will only need to phase the females. Make sure that alleles are all stored in forward strand before phasing. <br />
<br />
mach1 -d sample.dat -p sample.ped --states 200 -r 20 --phase -o sample.phased > sample.phased.log<br />
<br />
=== Impute ===<br />
<br />
Imputation will then be performed on the phased haplotypes using minimac [http://genome.sph.umich.edu/wiki/Minimac].<br />
<br />
minimac --refHaps ref.hap.gz --refSnps ref.snps --haps sample.phased.gz --snps sample.snps --rounds 5 --states 200 --prefix sample.imputed > sample.imputed.log<br />
<br />
== FAQ ==<br />
=== Shall I phase/impute males and females together or separately? ===<br />
Phasing males together with or separately from females doesn't seem to affect imputation quality. <br />
<br />
Imputing males together with or separately from females doesn't seem to affect imputation quality either. <br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=BAM_Review_Action_Items&diff=14651BAM Review Action Items2017-02-02T16:01:04Z<p>Ppwhite: /* Accessing String Values */</p>
<hr />
<div>[[Category:libStatGen]]<br />
[[Category:libStatGen BAM]]<br />
<br />
== Review Sept 20th ==<br />
=== Notes ===<br />
* returning const char*<br />
* SamFileHeader change referenceContigs, etc to private from public<br />
* Add way to copy a SAM record.<br />
<br />
== Review Sept 17th ==<br />
=== Topics Discussed ===<br />
* [[#Return Statuses|Checking if methods succeeded/failed (checking return values/return statuses)]]<br />
* [[#Accessing String Values|Strings as return values]]<br />
<br />
=== NOTES From Meeting ===<br />
* General Notes:<br />
**InputFile should not use <code>long int</code>. Should instead use: <code>long long</code><br />
* Error Handling Notes:<br />
**Anytime have an error could call handleError which would have a switch to return the error, throw exception, or abort. Call it with an error code and a string. Maybe an error handler class where you could use everywhere. Each class would have a member of that class type that would contain that information.<br />
*Returning values of Strings Notes:<br />
** Problems with returning const char*<br />
*** If the pointer is stored when returned, it becomes invalid if the class modifies the underlying string.<br />
** Problems with passing in std::string& as a parameter to be set.<br />
*** people typically want to operate on the return of the method.<br />
** One idea was returning a reference to a string<br />
*** Does that solve the problem? Won't the contents change when a new one is read? Is that what we want?<br />
<br />
<br />
=== Useful Links ===<br />
BAM Library FAQs: http://genome.sph.umich.edu/wiki/SAM/BAM_Library_FAQs<br />
<br />
Source Code: http://csg.sph.umich.edu//mktrost/doxygen/html/<br />
<br />
Test code for setting values in the library: http://csg.sph.umich.edu//mktrost/doxygen/html/WriteFiles_8cpp-source.html<br />
<br />
=== Topics for Discussion ===<br />
==== Return Statuses ====<br />
Currently anytime you do anything on a SAM/BAM file, you have to check the status for failure:<br />
<source lang="cpp"><br />
SamFile samIn;<br />
if(!samIn.OpenForRead(argv[1]))<br />
{<br />
fprintf(stderr, "%s\n", samIn.GetStatusMessage());<br />
return(samIn.GetStatus());<br />
}<br />
<br />
// Read the sam header.<br />
SamFileHeader samHeader;<br />
if(!samIn.ReadHeader(samHeader))<br />
{<br />
fprintf(stderr, "%s\n", samIn.GetStatusMessage());<br />
return(samIn.GetStatus());<br />
}<br />
</source><br />
A previous recommendation was to "Add an option by class that says whether or not to abort on failure. (or even an option on each method)"<br />
<br />
I am proposing modifying the classes to throw exceptions on failures. It would then be up to the user to catch them if they want to handle them or to let them exit the program (which would print out the error message)<br />
<source lang="cpp"><br />
SamFile samIn;<br />
samIn.OpenForRead(argv[1]);<br />
<br />
// Read the sam header.<br />
SamFileHeader samHeader;<br />
samIn.ReadHeader(samHeader);<br />
<br />
// Open the output file for writing.<br />
SamFile samOut;<br />
try<br />
{<br />
samOut.OpenForWrite(argv[2]);<br />
samOut.WriteHeader(samHeader);<br />
}<br />
catch(GenomeException e)<br />
{<br />
std::cout << "Caught an Exception" << e.what() << std::endl;<br />
}<br />
std::cout << "Continue Processing\n";<br />
</source><br />
For caught exceptions, you would see the following and processing would continue:<br />
<pre><br />
Caught Exception:<br />
FAIL_IO: Failed to Open testFiles/unknown for writing<br />
Continue Processing<br />
</pre><br />
<br />
For an uncaught exception, you would see the following and processing would be stopped:<br />
<pre><br />
terminate called after throwing an instance of 'GenomeException'<br />
what(): <br />
FAIL_IO: Failed to Open testFiles/unknown for reading<br />
Aborted<br />
</pre><br />
<br />
<br />
==== Accessing String Values ====<br />
SAM/BAM files have strings in them that people will want to read out.<br />
How should we handle this interface?<br />
Currently we do a mix of returning const char*, like:<br />
<source lang="cpp"><br />
const char* SamRecord::getSequence()<br />
{<br />
myStatus = SamStatus::SUCCESS;<br />
if(mySequence.Length() == 0)<br />
{<br />
// 0 Length, means that it is in the buffer, but has not yet<br />
// been synced to the string, so do the sync.<br />
setSequenceAndQualityFromBuffer();<br />
}<br />
return mySequence.c_str();<br />
}<br />
const std::string& SamRecord::getSequence()<br />
{<br />
myStatus = SamStatus::SUCCESS;<br />
if(mySequence.Length() == 0)<br />
{<br />
// 0 Length, means that it is in the buffer, but has not yet<br />
// been synced to the string, so do the sync.<br />
setSequenceAndQualityFromBuffer();<br />
}<br />
return &mySequence;<br />
}<br />
<br />
</source><br />
and passing in references to strings, like:<br />
<source lang="cpp"><br />
// Set the passed in string to the header line at the specified index.<br />
// It does NOT clear the current contents of header.<br />
// NOTE: some indexes will return blank if the entry was deleted.<br />
bool SamFileHeader::getHeaderLine(unsigned int index, std::string& header) const<br />
{<br />
// Check to see if the index is in range of the header records vector.<br />
if(index < myHeaderRecords.size())<br />
{<br />
// In range of the header records vector, so get the string for<br />
// that record.<br />
SamHeaderRecord* hdrRec = myHeaderRecords[index];<br />
hdrRec->appendString(header);<br />
return(true);<br />
}<br />
else<br />
{<br />
unsigned int commentIndex = index - myHeaderRecords.size();<br />
// Check to see if it is in range of the comments.<br />
if(commentIndex < myComments.size())<br />
{<br />
// It is in range of the comments, so add the type.<br />
header += "@CO\t";<br />
// Add the comment.<br />
header += myComments[commentIndex];<br />
// Add the new line.<br />
header += "\n";<br />
return(true);<br />
}<br />
}<br />
// Invalid index.<br />
return(false);<br />
}<br />
</source><br />
<br />
http://csg.sph.umich.edu//mktrost/doxygen/html/SamRecord_8h-source.html<br />
<br />
==== SamFileHeader ====<br />
*Should this be renamed to SamHeader?<br />
*Do we like the classes being named starting with Sam? Should it be Bam?<br />
<br />
Should we add the following to SamFileHeader:<br />
<source lang="cpp"><br />
//////////////////////////////////<br />
// Set methods for header fields.<br />
bool setVersion(const char* version);<br />
bool setSortOrder(const char* sortOrder);<br />
bool addSequenceName(const char* sequenceName);<br />
bool setSequenceLength(const char* keyID, int sequenceLength);<br />
bool setGenomeAssemblyId(const char* keyID, const char* genomeAssemblyId);<br />
bool setMD5Checksum(const char* keyID, const char* md5sum);<br />
bool setURI(const char* keyID, const char* uri);<br />
bool setSpecies(const char* keyID, const char* species);<br />
bool addReadGroupID(const char* readGroupID);<br />
bool setSample(const char* keyID, const char* sample);<br />
bool setLibrary(const char* keyID, const char* library);<br />
bool setDescription(const char* keyID, const char* description);<br />
bool setPlatformUnit(const char* keyID, const char* platform);<br />
bool setPredictedMedianInsertSize(const char* keyID, const char* isize);<br />
bool setSequencingCenter(const char* keyID, const char* center);<br />
bool setRunDate(const char* keyID, const char* runDate);<br />
bool setTechnology(const char* keyID, const char* technology);<br />
bool addProgram(const char* programID);<br />
bool setProgramVersion(const char* keyID, const char* version);<br />
bool setCommandLine(const char* keyID, const char* commandLine);<br />
<br />
///////////////////////////////////<br />
// Get methods for header fields.<br />
// Returns the number of SQ entries in the header.<br />
int32_t getSequenceDictionaryCount();<br />
// Return the Sort Order value that is set in the Header.<br />
// If this field does not exist, "" is returned.<br />
const char* getSortOrder();<br />
/// Additional gets for the rest of the fields.<br />
</source><br />
Should these also be added to SamHeaderRG, SamHeaderSQ, etc as appropriate....<br />
<br />
== Review June 7th ==<br />
<br />
* <S>Move the examples from the SamFile wiki page to their own page</s><br />
** <S>include links from the main library page and the SamFile page.</s><br />
** <S>look into why the one example have two if checks on SamIn status</s> <span style="color:blue">- one was printing the result and one was setting the return value - cleaned up to be in one if statement.</span><br />
* <S>Create 1 library for all of our library code rather than having libcsg, libbam, libfqf separated.</s><br />
** <S>What should this library be called?</s> <span style="color:blue">- Created library: libstatgen and reorganized into a new repository: statgen.</span><br />
*** <S>libdna</s><br />
*** <S>libdna++</s><br />
*** <S>libsequence++</s><br />
*** <S>libDNA</s><br />
*** <S>libgenotype</s><br />
* Add an option by class that says whether or not to abort on failure. (or even an option on each method)<br />
** This allows calling code to set that option and then not have to check for failures since the code it calls would abort on a failure.<br />
** Could/should this be achieved using exceptions? User can decide to catch them or let them terminate the program.<br />
*<S>SamFile add a constructor that takes the filename and a flag to indicate open for read/write. (abort on failure to open)</s><br />
** <S>Also have 2 subclasses one that opens for read, one for write: SamReadFile, SamWriteFile? Or SamFileRead, SamFileWrite?</s> <span style="color:blue">- went with SamFileReader and SamFileWriter</span><br />
* Add a function that says: skipInvalidRecords, validateRecords, etc.<br />
** That way, ReadRecord will keep reading records until a valid/parseable one is found.<br />
*SamFileHeader::setTag - instead of having separate ones for PG, RG, etc, have a generic one that takes as a parameter which one it is.<br />
** KeyID, then Value as parameters....(keyID first, then value)<br />
* SamFileHEader::setProgramName, etc...have specific methods for setting fields so users don't need to know the specific tags, etc used for certain values in the header.<br />
** KeyID, then Value as parameters....(keyID first, then value)<br />
* BAM write utility could add a PG field with default settings (user could specify alternate settings) when it writes a file.<br />
* Future methods to add:<br />
** <S>SamFile::setReadSection(const std::string& refName) - take in the reference name by string since that is what most people will know.</s><br />
*** <S>"" would indicate the ones not associated with a reference.</s></div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=C%2B%2B_Class:_CigarRoller&diff=14650C++ Class: CigarRoller2017-02-02T16:00:06Z<p>Ppwhite: /* Cigar */</p>
<hr />
<div>[[Category:C++]]<br />
[[Category:libStatGen]]<br />
[[Category:libStatGen general]]<br />
<br />
= Cigar=<br />
This class is part of [[libStatGen: general]].<br />
<br />
The purpose of this class is to provide utilities for processing CIGARs. It has read-only operators that do not allow modification to the class other than for lazy-evaluation.<br />
<br />
See: http://csg.sph.umich.edu//mktrost/doxygen/current/classCigar.html for documentation.<br />
<br />
The static methods are helpful for determining information about the operator.<br />
<br />
See [[C++ Class: CigarRoller#Mapping Between Reference and Read/Query|Mapping Between Reference and Read/Query]] for a more detailed explanation with examples as to how the mapping between the read/query works.<br />
<br />
See [[C++ Class: CigarRoller#Determining the Number of Reference and Read/Query Overlaps|Determining the Number of Reference and Read/Query Overlaps]] for a more detailed explanation with examples as to how determining overlaps works.<br />
<br />
= CigarRoller=<br />
This class is part of [[libStatGen: general]].<br />
<br />
The purpose of this class is to provide accessors for setting, updating, modifying the CIGAR object. It is a child class of Cigar.<br />
<br />
See: http://csg.sph.umich.edu//mktrost/doxygen/current/classCigarRoller.html for documentation.<br />
<br />
= Mapping Between Reference and Read/Query =<br />
<code>int32_t Cigar::getRefOffset(int32_t queryIndex)</code> and <code>int32_t Cigar::getQueryIndex(int32_t refOffset)</code> are used to map between the reference and the read.<br />
<br />
The queryIndex is the index in the read - from 0 to (read length - 1).<br />
The refOffset is the offset into the reference from the starting position of the read.<br />
<br />
For Example:<br />
Reference: ACTGAACCTTGGAAACTGCCGGGGACT<br />
Read: ACTGACTGAAACCATT<br />
CIGAR: 4M10N4M3I2M4D3M<br />
POS: 5<br />
<br />
This means it aligns:<br />
Reference: ACTGAACCTTGGAAACTG CCGGGGACT<br />
Read: ACTG ACTGAAACC ATT<br />
<br />
Adding the position:<br />
RefPos: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31<br />
Reference: A C T G A A C C T T G G A A A C T G C C G G G G A C T<br />
Read: A C T G A C T G A A A C C A T T<br />
<br />
Adding the offsets:<br />
RefPos: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31<br />
refOffset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26<br />
Reference: A C T G A A C C T T G G A A A C T G C C G G G G A C T<br />
Read: A C T G A C T G A A A C C A T T<br />
queryIndex: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15<br />
<br />
The results of a call to getRefOffset for each value passed in (where NA stands for INDEX_NA):<br />
queryIndex: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16(and any value over 16)<br />
Return: 0 1 2 3 14 15 16 17 NA NA NA 18 19 24 25 26 NA<br />
<br />
The results of a call to getQueryIndex for each value passed in (where NA stands for INDEX_NA):<br />
refOffset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27(and any value over 27)<br />
Return: 0 1 2 3 NA NA NA NA NA NA NA NA NA NA 4 5 6 7 11 12 NA NA NA NA 13 14 15 NA<br />
<br />
The results of a call to getRefPosition passing in start position 5 (where NA stands for INDEX_NA):<br />
queryIndex: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16(and any value over 16)<br />
Return: 5 6 7 8 19 20 21 22 NA NA NA 23 24 29 30 31 NA<br />
<br />
The results of a call to getQueryIndex using refPosition and start position 5 (where NA stands for INDEX_NA):<br />
refPosition:5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32(and any value over 32)<br />
Return: 0 1 2 3 NA NA NA NA NA NA NA NA NA NA 4 5 6 7 11 12 NA NA NA NA 13 14 15 NA<br />
<br />
<br />
== Determining the Number of Reference and Read/Query Overlaps ==<br />
<br />
A useful concept is determining the number of bases that overlap between the reference and the read in a given region.<br />
<br />
To do this, use <code>getNumOverlaps</code>, passing in the reference start and end positions for the region as well as the reference position where the read begins. start is inclusive, while end is exclusive.<br />
<br />
Using the above example:<br />
RefPos: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31<br />
refOffset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26<br />
Reference: A C T G A A C C T T G G A A A C T G C C G G G G A C T<br />
Read: A C T G A C T G A A A C C A T T<br />
queryIndex: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15<br />
<br />
getNumOverlaps(5,32,5) = 13 - [5, 32) covers the whole read - 13 cigar positions are "M" (found in both the reference and the read)<br />
getNumOverlaps(5,31,5) = 12 - skips the last overlapping position<br />
getNumOverlaps(0,100,5) = 13 - covers the whole read.<br />
getNumOverlaps(-1, -1,5) = 13 - covers the whole read.<br />
getNumOverlaps(-1,10,5) = 4<br />
getNumOverlaps(10,-1,5) = 9<br />
getNumOverlaps(9,19,5) = 0 - all skipped<br />
getNumOverlaps(9,20,5) = 1<br />
getNumOverlaps(9,6,5) = 0 - start is before end<br />
getNumOverlaps(0,5,5) = 0 - outside of read<br />
getNumOverlaps(32,40,5) = 0 - outside of read<br />
getNumOverlaps(0,5,1) = 4 - with a different start position, this range overlaps the read with 4 bases<br />
getNumOverlaps(32,40,32) = 4 - with a different start position, this range overlaps the read with 4 bases</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=VcfCodingSnps&diff=14649VcfCodingSnps2017-02-02T15:58:32Z<p>Ppwhite: </p>
<hr />
<div>'''vcfCodingSnps'''[http://csg.sph.umich.edu//liyanmin/vcfCodingSnps/index.shtml] is a SNP annotation tool that annotates coding variants in a [[VCF]] format input file. It takes a VCF as input and generates an annotated VCF file as output. The tool is currently under development by Yanming Li, a doctoral student at the University of Michigan Center for Statistical Genetics. For any issues with the program, please contact [mailto:liyanmin@umich.edu Yanming]. A detailed tutorial and download page can be found at [http://csg.sph.umich.edu//liyanmin/vcfCodingSnps/index.shtml] <br />
<br />
== Basic Usage Example ==<br />
<br />
Here is an example of how <code>vcfCodingSnps</code> works: <br />
<br />
vcfCodingSnps -s chrom22-CHB.vcf -g genelist.txt -o annotated-chrom22-CHB.vcf<br />
<br />
== Command Line Options ==<br />
<br />
-s SNP file Specifies the name of input VCF-format SNP file<br />
-g genefile Specifies the name of input gene file, by default use a NCBI36 version gene list file (genelist.txt) in UCSC known gene format generated by UCSC genome browser<br />
-o output file Specifies the name of output VCF-format SNP file, by default will be named vcfCodingSNP.out.vcf<br />
-r reference genome Specifies the name of reference genome file, by default will use NCBI build 36 reference genome<br />
-l log file Specifies the name of log file, log file gives more detailed information for each annotated SNP, by default will be named vcfCodingSNP.log<br />
--n1 parameter user defined number of bps into intron for splice site, by default will be set to 8<br />
--n2 parameter user defined number of bps into extron for splice site, by default will be set to 3<br />
--ns parameter user defined number of kbps for the range of upstream or downstream of a gene, by default will be set t0 5<br />
<br />
== Library Compiling Guideline ==<br />
<br />
To Compile the source code, please first re-compile the .c functions in the library folder on your local machine:<br />
1. Get into folder "libcsg". Type syntax "gcc -c -O2 *.cpp -D_FILE_OFFSET_BITS=32" to re-compile the c files in the library if use a 32-bit local machine<br />
(Type "gcc -c -O2 *.cpp -D_FILE_OFFSET_BITS=64" to re-compile the .c files in the library if use a 64-bit local machine)<br />
2. In the same folder, type "ar -rc libcsg.a *.o"<br />
3. Go to the root folder, type "make clean" and "make". (Don't need to change anything in Makefile in this step)<br />
<br />
== Input File Information ==<br />
<br />
1. Example headlines of input VCF-format SNP file: <br />
<br />
##format=VCFv3.2<br />
##NA12891=../depthFilter/filtered.NA12891.chrom22.SLX.maq.SRP000032.2009_07.glf<br />
##NA12892=../depthFilter/filtered.NA12892.chrom22.SLX.maq.SRP000032.2009_07.glf<br />
##NA12878=../merged/NA12878.chrom22.merged.glf<br />
##minTotalDepth=0<br />
##maxTotalDepth=1000<br />
##minMapQuality=30<br />
##minPosterior=0.9990<br />
##program=glfTrio<br />
##versionDate=Tue Dec 1 00:42:24 2009<br />
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878<br />
22 14439753 . a t 100 mapQ=0 depth=68;duples=homs;mac=2 GT:GQ:DP 1|1:100:40 0|0:81:28 1|0:84:0<br />
22 14441250 . t c 59 mapQ=0 depth=40 GT:GQ:DP 1|1:56:25 1|1:31:15 1|1:32:0<br />
22 14443154 . t g 45 mapQ=9 depth=92;duples=homs;mac=2 GT:GQ:DP 1|1:49:21 0|0:60:20 1|0:100:51<br />
... ...<br />
<br />
2. The gene list and the reference genome that user provided can be of various gene tracks and assemblies. The latest version takes gene list tracks such as UCSC known genes, RefSeq genes, Genecode genes, CCDS genes and Emsembl genes, and the assembly of the gene list and the reference genome can be of either hg16, hg17, hg18 or hg19. One can explore UCSC genome browser for a better understanding of different tracks and assemblies. By default vcfColdingSnps uses a hg18 UCSC known gene list and the hg18 reference genome. It also provides versions of other tracks and assemblies at the user's conveinience so that they don't need to download those themselves. Input gene file should be a plain text file generated by [http://genome.ucsc.edu/ ucsc genome browser]. A sample pathway of generating an input gene file is <br />
<br />
Go to http://genome.ucsc.edu/ ►► Click "table" ►► Specify the fields required (clade: mammal, genome:human etc.) ►► In "track" filed, select "UCSC gene" ►► get output gene file<br />
<br />
1. Gene file used should be of [http://genome.ucsc.edu/FAQ/FAQformat#format9 GenePred table format]. The following 11 tab delimited fields are required and must be of the same order as shown below:<br />
string name; "Name of gene"<br />
string chrom; "Chromosome name"<br />
char[1] strand; "+ or - for strand"<br />
uint txStart; "Transcription start position"<br />
uint txEnd; "Transcription end position"<br />
uint cdsStart; "Coding region start"<br />
uint cdsEnd; "Coding region end"<br />
uint exonCount; "Number of exons"<br />
uint[exonCount] exonStarts; "Exon start positions"<br />
uint[exonCount] exonEnds; "Exon end positions"<br />
string symbol; "Standard gene symbol"<br />
<br />
Note: the 11th field is a mandatory field for running vcfCodingSnps. In the genelists provided with the package, this field gives the standard gene symbols such as "APOE", "LDL-R" etc. <br />
If a genelist downloaded by you own that does not contain such a field, you can simply make the 11th field equal to the first field which is the gene name in a specific track by a syntax like<br />
<br />
awk `{FS="\t"; print $0"\t"$1 }` yourGenelist &gt; yourNewGenelist<br />
<br />
2. If gene file assumes an [http://genome.ucsc.edu/FAQ/FAQformat#format9 extended GenePred format], there will be an exctra "exonframe" field. Please refer to [https://lists.soe.ucsc.edu/pipermail/genome/2006-November/012218.html here] for the definition of "exonframe". For some genes, due to translational frame shifts or other <br />
reasons, the exonframe might not match what one would compute using mod 3 in counting codons. In such cases, the program will report a warning massage that "number of base pairs between code start and code end is<br />
not a multiple of three". While we will use the usual mod 3 method for counting codons.<br />
3. A detailed instruction on using the table browser could be found at [http://genome.ucsc.edu/cgi-bin/hgTables?command=start#Help genome.ucsc.edu/cgi-bin/hgTables].<br />
4. One can specify the region to be the whole genome or any particular gene position (e.g. chr21:33031597-33041570).<br />
<br />
Here is an example of input gene file headlines: <br />
<br />
#name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID<br />
uc001aaa.3 chr1 + 11873 14409 11873 11873 3 11873,12612,13220, 12227,12721,14409, uc001aaa.3<br />
uc010nxq.1 chr1 + 11873 14409 12189 13639 3 11873,12594,13402, 12227,12721,14409, B7ZGX9 uc010nxq.1<br />
uc010nxr.1 chr1 + 11873 14409 11873 11873 3 11873,12645,13220, 12227,12697,14409, uc010nxr.1<br />
uc009vis.2 chr1 - 14362 16765 14362 14362 4 14362,14969,15795,16606, 14829,15038,15942,16765, uc009vis.2<br />
uc009vjc.1 chr1 - 16857 17751 16857 16857 2 16857,17232, 17055,17751, uc009vjc.1<br />
uc009vjd.2 chr1 - 15795 18061 15795 15795 5 15795,16606,16857,17232,17605, 15947,16765,17055,17368,18061, uc009vjd.2<br />
<br />
== Output File ==<br />
<br />
Some possible annotating results for a single SNP with the meanings of their output format are listed below: <br />
<br />
5'UTR=A26C2[-] means the SNP is in the 5'UTR region of gene A26C2 with a minus strand.<br />
INTRONIC=POTEG[-] means the SNP is in the intronic region of gene POTEG with a minus strand.<br />
SYNONYMOUS_CODING=BARD1(uc002veu.2):His506His[-] means the SNP is synonymous coding at the 506th codon in gene BARD1 with a minus strand and it keeps amino-acid His unchanged.<br />
NON_SYNONYMOUS_CODING=BARD1(uc002veu.2):Arg658Cys[-] means the SNP is non_synonymous coding at the 658th codon in gene BARD1 (ucsc gene name uc002veu.2)with a minus strand and it changes amino-acid from Arg to Cys.<br />
SPLICE_SITE=FARP2(uc002wbi.1)[+] means the SNP is in the SPLICE_SITE (5 bp within exon start or end positions in the coding region) of gene FARP2 (ucsc gene name uc002wbi.1) with a plus strand.<br />
STOP_GAINED=C2orf83(uc002vph.1):Trp141stop[-] means the SNP is the 141th codon in gene MAPK12 (ucsc gene name uc002vph.1) with a minus strand and it changes amino-acid Trp to a stop codon.<br />
STOP_LOST=OR2M3(uc001ieb.1):stop313Arg[+] means the SNP is the 313th codon in gene OR2M3 (ucsc gene name uc001ieb.1) with a plus strand and it changes a stop codon to amino-acid Arg.<br />
<br />
The annotating result will be added to the entry "INFO" of the input VCF SNP file and outputted together with other information. If a SNP is annotated differently with respect to different genes (or different isoforms of the same gene), all the annotated results will be added into the entry "INFO". If the SNP is NOT in any gene coding region, then the original "INFO" will be outputted. Here is an example of input and output VCF file headlines: <br />
<br />
Input VCF headlines: <br />
<br />
##format=VCFv3.2<br />
##NA12891=../GLF/NA12891.chrom8.SLX.SRP000032.2009_07.glf<br />
##NA12892=../GLF/NA12892.chrom8.SLX.SRP000032.2009_07.glf<br />
##NA12878=../merged/NA12878.chrom8.merged.glf<br />
##minTotalDepth=0<br />
##maxTotalDepth=1000<br />
##minMapQuality=40<br />
##minPosterior=0.9990<br />
##program=glfTrio<br />
##versionDate=Thu Aug 27 18:23:18 2009<br />
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878<br />
8 146284 . c a 54 . depth=29;duples=hets;mac=2;tdt=0/2 GT:GQ:GD 1/0:31:12 1/0:32:3 0/0:28:14<br />
8 146703 . c t 92 . depth=41;mac=1;tdt=0/1 GT:GQ:GD 1/1:42:14 0/1:54:9 1/1:24:18<br />
8 151532 . t c 100 . depth=131 GT:GQ:GD 0/0:8:37 1/0:100:26 1/0:100:68<br />
8 151573 . g t 72 . depth=113;mac=1;tdt=1/1 GT:GQ:GD 0/1:48:35 0/0:39:26 0/1:100:52<br />
8 151638 . a c 100 . depth=124;duples=hets;mac=2;tdt=1/2 GT:GQ:GD 0/1:100:55 0/1:100:58 0/1:87:11<br />
8 151651 . c g 100 . depth=124;duples=hets;mac=2;tdt=1/2 GT:GQ:GD 0/1:87:56 0/1:100:56 0/1:24:12<br />
8 151763 . t a 100 . depth=127;duples=hets;mac=2;tdt=1/2 GT:GQ:GD 1/0:100:49 1/0:100:54 1/0:100:24<br />
8 151936 . a g 32 . depth=105;duples=hets;mac=2;tdt=0/2 GT:GQ:GD 0/1:42:44 0/1:23:47 0/0:39:14<br />
8 152578 . c t 87 . depth=108 GT:GQ:GD 1/1:95:31 1/1:89:30 1/1:100:47<br />
<br />
Output VCF headlines: <br />
<br />
##format=VCFv3.2 <br />
##NA12891=../GLF/NA12891.chrom8.SLX.SRP000032.2009_07.glf <br />
##NA12892=../GLF/NA12892.chrom8.SLX.SRP000032.2009_07.glf <br />
##NA12878=../merged/NA12878.chrom8.merged.glf <br />
##minTotalDepth=0 <br />
##maxTotalDepth=1000 <br />
##minMapQuality=40 <br />
##minPosterior=0.9990 <br />
##program=glfTrio <br />
##versionDate=Thu Aug 27 18:23:18 2009 <br />
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878<br />
8 146284 . c a 54 . depth=29;duples=hets;mac=2;tdt=0/2 GT:GQ:GD 1/0:31:12 1/0:32:3 0/0:28:14 <br />
8 146703 . c t 92 . depth=41;mac=1;tdt=0/1 GT:GQ:GD 1/1:42:14 0/1:54:9 1/1:24:18 <br />
8 151532 . t c 100 . depth=131;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/0:8:37 1/0:100:26 1/0:100:68 <br />
8 151573 . g t 72 . depth=113;mac=1;tdt=1/1;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:48:35 0/0:39:26 0/1:100:52 <br />
8 151638 . a c 100 . depth=124;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:100:55 0/1:100:58 0/1:87:11 <br />
8 151651 . c g 100 . depth=124;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:87:56 0/1:100:56 0/1:24:12 <br />
8 151763 . t a 100 . depth=127;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 1/0:100:49 1/0:100:54 1/0:100:24 <br />
8 151936 . a g 32 . depth=105;duples=hets;mac=2;tdt=0/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:42:44 0/1:23:47 0/0:39:14 <br />
8 152578 . c t 87 . depth=108;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 1/1:95:31 1/1:89:30 1/1:100:47<br />
<br />
Output log file headlines: <br />
<br />
##chr pos ref alt ucsc_name genestrend genestart geneend ref_codon ref_AA alt_codon alt_AA codon_start codon_end genesymbol codonCount type<br />
chr2 214811129 T c uc010fuz.1 + 213857360 214814327 CTA Leu CCA Pro 214811128 214811130 SPAG16 433 NON_SYNONYMOUS_CODING<br />
chr2 214811129 T c uc002veq.1 + 213857360 214983470 . . . . . . SPAG16 . INTRONIC<br />
chr2 214811129 T c uc002ver.1 + 213857360 214983470 . . . . . . SPAG16 . INTRONIC<br />
chr2 214811174 T a uc010fuz.1 + 213857360 214814327 . . . . . . SPAG16 . 3'UTR<br />
chr2 214811174 T a uc002veq.1 + 213857360 214983470 . . . . . . SPAG16 . INTRONIC</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=TabAnno&diff=14648TabAnno2017-02-02T15:57:26Z<p>Ppwhite: /* Contact */</p>
<hr />
<div>= Introduction =<br />
<br />
TabAnno is short for "annotation", it is used for annotate variants. Our goal is to provide abundant information for genetic variants promptly. For example, annotations to all transcripts of a gene will be provided instead of just listing one single annotation. TabAnno support various file format: VCF file, plain file, plink association output file and [http://genome.sph.umich.edu/wiki/EPACTS epacts] file.<br />
<br />
= Quick tutorial =<br />
<br />
You have an input file in VCF format, and your goal is to annotate it using refFlat genes database.<br />
Then you just need the following command:<br />
<br />
anno -i input.vcf -o output.vcf -r hs37d5.fa -g refFlat_hg19.txt.gz -p priority.txt -c codon.txt<br />
<br />
Required files:<br />
<br />
* [http://csg.sph.umich.edu//zhanxw/software/anno/resources/hs37d5.fa hs37d5.fa] Reference genome in NCBI build 37 (you also need to download [http://csg.sph.umich.edu//zhanxw/software/anno/resources/hs37d5.fa.fai hs37d5.fai] )<br />
<br />
* [http://csg.sph.umich.edu//zhanxw/software/anno/resources/refFlat_hg19.txt.gz refFlag_hg19.txt.gz] Gene database in refFlat format (from UCSC website). You can also use [http://csg.sph.umich.edu//zhanxw/software/anno/resources/refFlat.gencode.v7.gz Gencode version 7] or [http://csg.sph.umich.edu//zhanxw/software/anno/resources/refFlat.gencode.v11.gz Gencode version 11].<br />
<br />
* [http://csg.sph.umich.edu//zhanxw/software/anno/codon.txt codon.txt] Human codon table.<br />
<br />
* [http://csg.sph.umich.edu//zhanxw/software/anno/priority.txt priority.txt] Priority file, which determines wich annotation type is more important<br />
<br />
Outputs:<br />
<br />
Annotated VCF will be named ''output.vcf''. Annotations comes in two tags: ANNO and ANNOFULL. The first tag ANNO showed the most important annotation types (determined by priority file); the second tag ANNOFULL showed detailed annotations. Let's see one example annotation:<br />
ANNO=Nonsynonymous:GENE1|GENE3;ANNOFULL=GENE1/CODING_GENE:+:Nonsynonymous(CCT/Pro/P->CAT/His/H:Base3/30:Codon1/10:Exon1/5):Normal_Splice_Site:Exon|GENE3/CODING_GENE:-:Nonsynonymous(AGG/Arg/R->ATG/Met/M:Base30/30:Codon10/10:Exon5/5):Normal_Splice_Site:Exon|GENE2/NON_CODING_GENE:+:Upstream<br />
<br />
ANNO tag displayed the most important variant type is ''Nonsynonymous'' and that happened at GENE1 and GENE3;<br />
ANNOFULL tag is the full set of annotation and it consists of two parts, for GENE1 and GENE3 respectively. The first part is for GENE1:<br />
<br />
GENE1/CODING_GENE:+:Nonsynonymous(CCT/Pro/P->CAT/His/H:Base3/30:Codon1/10:Exon1/5):Normal_Splice_Site:Exon<br />
<br />
The format can be explained by sections, and we use annotation for GENE1 as an example:<br />
<br />
* GENE1 : gene name<br />
* CODING_GENE : transcript name<br />
* ''+'' : forward strand<br />
* Nonsynonymous, Normal_Splice_Site, Exon : various annotation types<br />
* CCT/Pro/P->CAT/His/H : Proline to Histidine mutation<br />
* Base3/30 : mutation happens on the 3rd base of the total 10 bases<br />
* Codon1/10 : mutation happens on the 1st codon of the total 10 codons<br />
* Exon1/5 : mutation happens on the 1st exon of the total 5 exons<br />
<br />
= Where to Find It =<br />
<br />
TabAnno code is hosted online [https://github.com/zhanxw/anno anno]. You can download the source and compile (type 'make release').<br />
<br />
For CSG internal users, the compiled executable file is at: /net/fantasia/home/zhanxw/anno/executable/anno<br />
<br />
The source code is located at:/net/fantasia/home/zhanxw/anno<br />
You can type 'make release' to compile your own executable file.<br />
Type "make test1" or "make test2" will demonstrate the command line to annotate example VCF files.<br />
<br />
= Usage =<br />
<br />
== Command line ==<br />
After you obtain the anno executable (either by compiling the source code or by downloading the pre-compiled binary file), you will find the executable file under executable/anno. <br />
<br />
Here is the anno help page by invoking anno without any command line arguments:<br />
<br />
some_linux_host > executable/anno<br />
.............................................. <br />
... G(ene) A(nnotation) ... <br />
... Xiaowei Zhan, Goncalo Abecasis ... <br />
... zhanxw@umich.edu ... <br />
... Dec 2012 ... <br />
................................................<br />
<br />
Required Parameters<br />
-i : Specify input VCF file<br />
-o : Specify output VCF file<br />
Gene Annotation Parameters<br />
-g : Specify gene file<br />
-r : Specify reference genome position<br />
--inputFormat : Specify format (default: vcf). "-f plain " will use fir<br />
st 4 columns as chrom, pos, ref, alt<br />
--checkReference : Check whether reference alleles matches genome reference<br />
-f : Specify gene file format (default: refFlat, other optio<br />
ns: knownGene, refGene)<br />
-p : Specify priority of annotations<br />
-c : Specify codon file (default: codon.txt)<br />
-u : Specify upstream range (default: 50)<br />
-d : Specify downstream range (default: 50)<br />
--se : Specify splice into extron range (default: 3)<br />
--si : Specify splice into intron range (default: 8)<br />
--outputFormat : Specify predefined annotation words (default or epact)<br />
Other Annotation Tools<br />
--genomeScore : Specify the folder of genome score (e.g. GERP=dirGerp/,<br />
SIFT=dirSift/)<br />
--bed : Specify the bed file and tag (e.g. ONTARGET1=a1.bed,ONT<br />
ARGET2=a2.bed)<br />
--tabix : Specify the tabix file and tag (e.g. abc.txt.gz(chrom=1<br />
,pos=7,ref=3,alt=4,SIFT=7,PolyPhen=10)<br />
Please specify input file<br />
<br />
== Input files ==<br />
<br />
TabAnno runs on the input VCF file specified on the command-line using flag '-i'.<br />
<br />
Additionally, you need to specify gene file using flag '-g'. You can use the default refFlat file (using HG19 genome build): /net/fantasia/home/zhanxw/anno/refFlat_hg19.txt.gz <br />
<br />
== Parameters ==<br />
<br />
Some of the command line parameters are described here, but most are self explanatory.<br />
<br />
*Reference genome file<br />
<br />
-r Specify a FASTA format reference genome file. <br />
<br />
Specify ''-r'' option enable TabAnno to give more detailed information, for example, instead of annotating a variant as exon, it will tell you which codon, which exon that variant locates, whether its synonymous/non-synonymous and etc.<br />
<br />
TabAnno requires Fasta index file and that will save running memory and speed up annotation. You can use "samtools faidx XXX.fa" to generate Fasta index file.<br />
<br />
For example, you can specify Fasta file of the whole genome and use "-r /data/local/ref/karma.ref/human.g1k.v37.fa"<br />
<br />
*Gene file format<br />
<br />
Currently, TabAnno support gene file in refFlat format. A prepared list of all gene obtained of UCSC website is: "/net/fantasia/home/zhanxw/anno/refFlat_hg19.txt.gz" . <br />
To use that file, you use flag ''-g''. <br />
<br />
As TabAnno support refFlat file by default, you can use refFlat format without specify gene format flag ''-f''.<br />
<br />
To use knownGene or refGene format, you need to specify both ''-g'' and ''-f'' flat to tell TabAnno which gene file and which format it is.<br />
For example, ''-g /net/fantasia/home/zhanxw/anno/knownGene.txt.gz -f knownGene''.<br />
<br />
*Codon file<br />
<br />
Codon file can tell the relationship between triplet codons and amino acids. A default file is located in: ''/net/fantasia/home/zhanxw/anno/codon.txt''.<br />
If you have special codon file, you can specify that using flag ''-c'', otherwise, TabAnno will use the default codon file:<br />
<br />
''default codon file''<br />
# DNA codon table. Stop codon is coded as: Stp, O, Stop<br />
#Codon AminoAcid Letter FullName<br />
AAA Lys K Lysine<br />
AAC Asn N Asparagine<br />
AAG Lys K Lysine<br />
AAT Asn N Asparagine<br />
ACA Thr T Threonine<br />
...<br />
*Annotation ranges<br />
-u how far apart from 5'-end of the gene are counted as upstream<br />
-d how far apart from 3'-end of the gene are counted as downstream<br />
<br />
The ''-u'' and ''-d'' parameters define the range of upstream and downstream.<br />
<br />
-se how far apart from 5'-end of the gene are counted as upstream<br />
-si how far apart from 3'-end of the gene are counted as upstream<br />
<br />
The ''-se'' and ''-si'' defines the splice region. By default, 3 bases into the exon and 8 bases into the introns are defined as splicing sites. If mutations happen in these regions, we will annotate it as "Normal_Splice_Site" unless the mutations happens in the traditionally important "GU...AG" region in the intron, in that case, we will "Essential_Splice_Site" as annotation.<br />
<br />
* Annotate by range<br />
<br />
--bed : Specify the bed file and tag (e.g. ONTARGET1=a1.bed,ONTARGET2=a2.bed)<br />
<br />
BED file is commonly used to represent range. Here you will check if certain genome position is contained in one or more ranges specified in the BED format.<br />
An example BED file, example.bed, is as follows:<br />
1 10 20 a<br />
1 15 40 b<br />
<br />
There are two ranges: range a is from chromsome 1 position 10 (inclusive) to chromosome 1 position 20 (exclusive); range b is from chromsome 1 position 15 (inclusive) to chromosome 1 position 40 (exclusive).<br />
<br />
When use ''--bed ONTARGET=example.bed'' as parameter, the output may look like this: ''ONTARGET=a,b'' in the VCF INFO field.<br />
<br />
* Annotate by genome score<br />
<br />
--genomeScore : Specify the folder of genome score (e.g. GERP=dirGerp/,SIFT=dirSift/)<br />
<br />
Genome score is a special binary file format (designed by Hyun) storing per-position genome score, such as GERP score, SIFT score.<br />
You will need to pre-process those scores in a directory. <br />
To annotate genome score, use ''--genomeScore TAG_NAME=DIRECTORY''. In VCF output file, you will have ''TAG_NAME=some_value'' in the INFO field. In other output format, you will see a separate column.<br />
<br />
* Annotate by tabix input<br />
<br />
--tabix : Specify the tabix file and tag (e.g. abc.txt.gz(chrom=1,pos=7,ref=3,alt=4,SIFT=7,PolyPhen=10)<br />
<br />
Tabix file can be used in annotation. Here we required you provide bgzipped, indexed file. In above example, we use abc.txt.gz as input, and column 1,7,3,4 as chromosome, position, reference allele, alternative allele. We also use column 7 as SIFT score and column 10 as PolyPhen score. <br />
Similar syntax as previous, you will get outputs similar to 'SIFT=0.110;PolyPhen=0.00' in the VCF INFO field.<br />
NOTE, in bash, please use quote around parenthesis, otherwise parenthesis is not correctly treated. A working example is as follows:<br />
--tabix '/net/fantasia/home/hmkang/bin/annovar/humandb/hg19_ljb_all.txt.gz(chrom=1,pos=2,ref=4,alt=5,mySift=6,mySC=7,myPP2=8,myPC=9)' <br />
<br />
<br />
= Example =<br />
<br />
TabAnno can annotate a VCF file and also output statistics of 4 frequency table: annotation type; base change; codon change; indel size. More details will be given below.<br />
<br />
== Built-in example ==<br />
<br />
In example/ folder, you can see test.vcf, which is a toy example. You can invoke anno using the following command line:<br />
<br />
cd example; ./anno -i test.vcf -r test.fa -g test.gene.txt -c ../codon.txt -o test.out.vcf<br />
<br />
Sample outputs are listed below:<br />
<br />
1) Annotated VCF file, ''test.out.vcf'' <br />
<br />
#VCF_test<br />
#from http://csg.sph.umich.edu//liyanmin/vcfCodingSnps/Tutorial.shtml<br />
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878<br />
1 3 . A G 50 depth=20 ANNO=GENE1/CODING_GENE:+:Exon:Utr5:Normal_Splice_Site|GENE3/CODING_GENE:-:Exon:Utr3:Normal_Splice_Site|GENE2/NON_CODING_GENE:+:Upstream GT:GQ:GD 1/0:31:12 0/0:28:14<br />
1 5 . C A 50 depth=20 ANNO=GENE1/CODING_GENE:+:Exon:Nonsynonymous(CCT/Pro/P->CAT/His/H:Base3/30:Codon1/10:Exon1/5):Normal_Splice_Site|GENE3/CODING_GENE:-:Exon:Nonsynonymous(AGG/Arg/R->ATG/Met/M:Base30/30:Codon10/10:Exon5/5):Normal_Splice_Site|GENE2/NON_CODING_GENE:+:Upstream GT:GQ:GD 1/0:31:12 0/0:28:14<br />
...<br />
<br />
The annotation results are stored in the INFO column after ANN= tag. <br />
The annotation format is defined as following:<br />
<br />
"|" separates different transcripts, e.g. in the first line, chromosome 1 position 3, there are 3 annotations: "GENE1/CODING_GENE:+:Exon:Utr5:Normal_Splice_Site" and "GENE3/CODING_GENE:-:Exon:Utr3:Normal_Splice_Site" and "GENE2/NON_CODING_GENE:+:Upstream"<br />
<br />
":" separates within gene annotation in the following order: gene, strand, exon/intron, details.<br />
<br />
2) Statistics files:<br />
<br />
Four frequency table will be generated after annotation. For example:<br />
<br />
''test.out.vcf.anno.frq'' <br />
Stop_Loss 1<br />
Utr5 2<br />
Utr3 2<br />
CodonRegion 2<br />
CodonGain 2<br />
Frameshift 2<br />
Synonymous 3<br />
StructuralVariation 3<br />
Noncoding 3<br />
Nonsynonymous 4<br />
Deletion 6<br />
Upstream 6<br />
Insertion 6<br />
Essential_Splice_Site 8 <br />
Downstream 8<br />
Intron 12<br />
Exon 21<br />
Normal_Splice_Site 25<br />
<br />
''test.out.vcf.base.frq''<br />
A->G 1<br />
T->C 1<br />
T->G 2<br />
A->C 2<br />
C->A 5<br />
<br />
''test.out.vcf.codon.frq''<br />
Arg->Met 1<br />
Pro->Thr 1<br />
Arg->Arg 1<br />
Pro->His 1<br />
Gly->Gly 1<br />
Pro->Pro 1<br />
Stp->Tyr 1<br />
Leu->Val 1<br />
<br />
''test.out.vcf.indel.frq''<br />
1 1<br />
-4 1<br />
3 1<br />
-3 1<br />
<br />
= Contact =<br />
<br />
Questions and requests should be sent to Xiaowei Zhan ([mailto:zhanxw@umich.edu zhanxw@umich.edu]) or Goncalo Abecasis ([mailto:goncalo@umich.edu goncalo@umich.edu])<br />
<br />
Author sincerly appreicate Yanming Li for his wonderful tutorial on gene annotation software [http://csg.sph.umich.edu//liyanmin/vcfCodingSnps/Tutorial.shtml vcfCodingSnps], and Hyun Ming Kang for his code related to genome scores and his consistent suggestions and feedbacks.</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=TabAnno&diff=14647TabAnno2017-02-02T15:55:09Z<p>Ppwhite: /* Quick tutorial */</p>
<hr />
<div>= Introduction =<br />
<br />
TabAnno is short for "annotation", it is used for annotate variants. Our goal is to provide abundant information for genetic variants promptly. For example, annotations to all transcripts of a gene will be provided instead of just listing one single annotation. TabAnno support various file format: VCF file, plain file, plink association output file and [http://genome.sph.umich.edu/wiki/EPACTS epacts] file.<br />
<br />
= Quick tutorial =<br />
<br />
You have an input file in VCF format, and your goal is to annotate it using refFlat genes database.<br />
Then you just need the following command:<br />
<br />
anno -i input.vcf -o output.vcf -r hs37d5.fa -g refFlat_hg19.txt.gz -p priority.txt -c codon.txt<br />
<br />
Required files:<br />
<br />
* [http://csg.sph.umich.edu//zhanxw/software/anno/resources/hs37d5.fa hs37d5.fa] Reference genome in NCBI build 37 (you also need to download [http://csg.sph.umich.edu//zhanxw/software/anno/resources/hs37d5.fa.fai hs37d5.fai] )<br />
<br />
* [http://csg.sph.umich.edu//zhanxw/software/anno/resources/refFlat_hg19.txt.gz refFlag_hg19.txt.gz] Gene database in refFlat format (from UCSC website). You can also use [http://csg.sph.umich.edu//zhanxw/software/anno/resources/refFlat.gencode.v7.gz Gencode version 7] or [http://csg.sph.umich.edu//zhanxw/software/anno/resources/refFlat.gencode.v11.gz Gencode version 11].<br />
<br />
* [http://csg.sph.umich.edu//zhanxw/software/anno/codon.txt codon.txt] Human codon table.<br />
<br />
* [http://csg.sph.umich.edu//zhanxw/software/anno/priority.txt priority.txt] Priority file, which determines wich annotation type is more important<br />
<br />
Outputs:<br />
<br />
Annotated VCF will be named ''output.vcf''. Annotations comes in two tags: ANNO and ANNOFULL. The first tag ANNO showed the most important annotation types (determined by priority file); the second tag ANNOFULL showed detailed annotations. Let's see one example annotation:<br />
ANNO=Nonsynonymous:GENE1|GENE3;ANNOFULL=GENE1/CODING_GENE:+:Nonsynonymous(CCT/Pro/P->CAT/His/H:Base3/30:Codon1/10:Exon1/5):Normal_Splice_Site:Exon|GENE3/CODING_GENE:-:Nonsynonymous(AGG/Arg/R->ATG/Met/M:Base30/30:Codon10/10:Exon5/5):Normal_Splice_Site:Exon|GENE2/NON_CODING_GENE:+:Upstream<br />
<br />
ANNO tag displayed the most important variant type is ''Nonsynonymous'' and that happened at GENE1 and GENE3;<br />
ANNOFULL tag is the full set of annotation and it consists of two parts, for GENE1 and GENE3 respectively. The first part is for GENE1:<br />
<br />
GENE1/CODING_GENE:+:Nonsynonymous(CCT/Pro/P->CAT/His/H:Base3/30:Codon1/10:Exon1/5):Normal_Splice_Site:Exon<br />
<br />
The format can be explained by sections, and we use annotation for GENE1 as an example:<br />
<br />
* GENE1 : gene name<br />
* CODING_GENE : transcript name<br />
* ''+'' : forward strand<br />
* Nonsynonymous, Normal_Splice_Site, Exon : various annotation types<br />
* CCT/Pro/P->CAT/His/H : Proline to Histidine mutation<br />
* Base3/30 : mutation happens on the 3rd base of the total 10 bases<br />
* Codon1/10 : mutation happens on the 1st codon of the total 10 codons<br />
* Exon1/5 : mutation happens on the 1st exon of the total 5 exons<br />
<br />
= Where to Find It =<br />
<br />
TabAnno code is hosted online [https://github.com/zhanxw/anno anno]. You can download the source and compile (type 'make release').<br />
<br />
For CSG internal users, the compiled executable file is at: /net/fantasia/home/zhanxw/anno/executable/anno<br />
<br />
The source code is located at:/net/fantasia/home/zhanxw/anno<br />
You can type 'make release' to compile your own executable file.<br />
Type "make test1" or "make test2" will demonstrate the command line to annotate example VCF files.<br />
<br />
= Usage =<br />
<br />
== Command line ==<br />
After you obtain the anno executable (either by compiling the source code or by downloading the pre-compiled binary file), you will find the executable file under executable/anno. <br />
<br />
Here is the anno help page by invoking anno without any command line arguments:<br />
<br />
some_linux_host > executable/anno<br />
.............................................. <br />
... G(ene) A(nnotation) ... <br />
... Xiaowei Zhan, Goncalo Abecasis ... <br />
... zhanxw@umich.edu ... <br />
... Dec 2012 ... <br />
................................................<br />
<br />
Required Parameters<br />
-i : Specify input VCF file<br />
-o : Specify output VCF file<br />
Gene Annotation Parameters<br />
-g : Specify gene file<br />
-r : Specify reference genome position<br />
--inputFormat : Specify format (default: vcf). "-f plain " will use fir<br />
st 4 columns as chrom, pos, ref, alt<br />
--checkReference : Check whether reference alleles matches genome reference<br />
-f : Specify gene file format (default: refFlat, other optio<br />
ns: knownGene, refGene)<br />
-p : Specify priority of annotations<br />
-c : Specify codon file (default: codon.txt)<br />
-u : Specify upstream range (default: 50)<br />
-d : Specify downstream range (default: 50)<br />
--se : Specify splice into extron range (default: 3)<br />
--si : Specify splice into intron range (default: 8)<br />
--outputFormat : Specify predefined annotation words (default or epact)<br />
Other Annotation Tools<br />
--genomeScore : Specify the folder of genome score (e.g. GERP=dirGerp/,<br />
SIFT=dirSift/)<br />
--bed : Specify the bed file and tag (e.g. ONTARGET1=a1.bed,ONT<br />
ARGET2=a2.bed)<br />
--tabix : Specify the tabix file and tag (e.g. abc.txt.gz(chrom=1<br />
,pos=7,ref=3,alt=4,SIFT=7,PolyPhen=10)<br />
Please specify input file<br />
<br />
== Input files ==<br />
<br />
TabAnno runs on the input VCF file specified on the command-line using flag '-i'.<br />
<br />
Additionally, you need to specify gene file using flag '-g'. You can use the default refFlat file (using HG19 genome build): /net/fantasia/home/zhanxw/anno/refFlat_hg19.txt.gz <br />
<br />
== Parameters ==<br />
<br />
Some of the command line parameters are described here, but most are self explanatory.<br />
<br />
*Reference genome file<br />
<br />
-r Specify a FASTA format reference genome file. <br />
<br />
Specify ''-r'' option enable TabAnno to give more detailed information, for example, instead of annotating a variant as exon, it will tell you which codon, which exon that variant locates, whether its synonymous/non-synonymous and etc.<br />
<br />
TabAnno requires Fasta index file and that will save running memory and speed up annotation. You can use "samtools faidx XXX.fa" to generate Fasta index file.<br />
<br />
For example, you can specify Fasta file of the whole genome and use "-r /data/local/ref/karma.ref/human.g1k.v37.fa"<br />
<br />
*Gene file format<br />
<br />
Currently, TabAnno support gene file in refFlat format. A prepared list of all gene obtained of UCSC website is: "/net/fantasia/home/zhanxw/anno/refFlat_hg19.txt.gz" . <br />
To use that file, you use flag ''-g''. <br />
<br />
As TabAnno support refFlat file by default, you can use refFlat format without specify gene format flag ''-f''.<br />
<br />
To use knownGene or refGene format, you need to specify both ''-g'' and ''-f'' flat to tell TabAnno which gene file and which format it is.<br />
For example, ''-g /net/fantasia/home/zhanxw/anno/knownGene.txt.gz -f knownGene''.<br />
<br />
*Codon file<br />
<br />
Codon file can tell the relationship between triplet codons and amino acids. A default file is located in: ''/net/fantasia/home/zhanxw/anno/codon.txt''.<br />
If you have special codon file, you can specify that using flag ''-c'', otherwise, TabAnno will use the default codon file:<br />
<br />
''default codon file''<br />
# DNA codon table. Stop codon is coded as: Stp, O, Stop<br />
#Codon AminoAcid Letter FullName<br />
AAA Lys K Lysine<br />
AAC Asn N Asparagine<br />
AAG Lys K Lysine<br />
AAT Asn N Asparagine<br />
ACA Thr T Threonine<br />
...<br />
*Annotation ranges<br />
-u how far apart from 5'-end of the gene are counted as upstream<br />
-d how far apart from 3'-end of the gene are counted as downstream<br />
<br />
The ''-u'' and ''-d'' parameters define the range of upstream and downstream.<br />
<br />
-se how far apart from 5'-end of the gene are counted as upstream<br />
-si how far apart from 3'-end of the gene are counted as upstream<br />
<br />
The ''-se'' and ''-si'' defines the splice region. By default, 3 bases into the exon and 8 bases into the introns are defined as splicing sites. If mutations happen in these regions, we will annotate it as "Normal_Splice_Site" unless the mutations happens in the traditionally important "GU...AG" region in the intron, in that case, we will "Essential_Splice_Site" as annotation.<br />
<br />
* Annotate by range<br />
<br />
--bed : Specify the bed file and tag (e.g. ONTARGET1=a1.bed,ONTARGET2=a2.bed)<br />
<br />
BED file is commonly used to represent range. Here you will check if certain genome position is contained in one or more ranges specified in the BED format.<br />
An example BED file, example.bed, is as follows:<br />
1 10 20 a<br />
1 15 40 b<br />
<br />
There are two ranges: range a is from chromsome 1 position 10 (inclusive) to chromosome 1 position 20 (exclusive); range b is from chromsome 1 position 15 (inclusive) to chromosome 1 position 40 (exclusive).<br />
<br />
When use ''--bed ONTARGET=example.bed'' as parameter, the output may look like this: ''ONTARGET=a,b'' in the VCF INFO field.<br />
<br />
* Annotate by genome score<br />
<br />
--genomeScore : Specify the folder of genome score (e.g. GERP=dirGerp/,SIFT=dirSift/)<br />
<br />
Genome score is a special binary file format (designed by Hyun) storing per-position genome score, such as GERP score, SIFT score.<br />
You will need to pre-process those scores in a directory. <br />
To annotate genome score, use ''--genomeScore TAG_NAME=DIRECTORY''. In VCF output file, you will have ''TAG_NAME=some_value'' in the INFO field. In other output format, you will see a separate column.<br />
<br />
* Annotate by tabix input<br />
<br />
--tabix : Specify the tabix file and tag (e.g. abc.txt.gz(chrom=1,pos=7,ref=3,alt=4,SIFT=7,PolyPhen=10)<br />
<br />
Tabix file can be used in annotation. Here we required you provide bgzipped, indexed file. In above example, we use abc.txt.gz as input, and column 1,7,3,4 as chromosome, position, reference allele, alternative allele. We also use column 7 as SIFT score and column 10 as PolyPhen score. <br />
Similar syntax as previous, you will get outputs similar to 'SIFT=0.110;PolyPhen=0.00' in the VCF INFO field.<br />
NOTE, in bash, please use quote around parenthesis, otherwise parenthesis is not correctly treated. A working example is as follows:<br />
--tabix '/net/fantasia/home/hmkang/bin/annovar/humandb/hg19_ljb_all.txt.gz(chrom=1,pos=2,ref=4,alt=5,mySift=6,mySC=7,myPP2=8,myPC=9)' <br />
<br />
<br />
= Example =<br />
<br />
TabAnno can annotate a VCF file and also output statistics of 4 frequency table: annotation type; base change; codon change; indel size. More details will be given below.<br />
<br />
== Built-in example ==<br />
<br />
In example/ folder, you can see test.vcf, which is a toy example. You can invoke anno using the following command line:<br />
<br />
cd example; ./anno -i test.vcf -r test.fa -g test.gene.txt -c ../codon.txt -o test.out.vcf<br />
<br />
Sample outputs are listed below:<br />
<br />
1) Annotated VCF file, ''test.out.vcf'' <br />
<br />
#VCF_test<br />
#from http://csg.sph.umich.edu//liyanmin/vcfCodingSnps/Tutorial.shtml<br />
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878<br />
1 3 . A G 50 depth=20 ANNO=GENE1/CODING_GENE:+:Exon:Utr5:Normal_Splice_Site|GENE3/CODING_GENE:-:Exon:Utr3:Normal_Splice_Site|GENE2/NON_CODING_GENE:+:Upstream GT:GQ:GD 1/0:31:12 0/0:28:14<br />
1 5 . C A 50 depth=20 ANNO=GENE1/CODING_GENE:+:Exon:Nonsynonymous(CCT/Pro/P->CAT/His/H:Base3/30:Codon1/10:Exon1/5):Normal_Splice_Site|GENE3/CODING_GENE:-:Exon:Nonsynonymous(AGG/Arg/R->ATG/Met/M:Base30/30:Codon10/10:Exon5/5):Normal_Splice_Site|GENE2/NON_CODING_GENE:+:Upstream GT:GQ:GD 1/0:31:12 0/0:28:14<br />
...<br />
<br />
The annotation results are stored in the INFO column after ANN= tag. <br />
The annotation format is defined as following:<br />
<br />
"|" separates different transcripts, e.g. in the first line, chromosome 1 position 3, there are 3 annotations: "GENE1/CODING_GENE:+:Exon:Utr5:Normal_Splice_Site" and "GENE3/CODING_GENE:-:Exon:Utr3:Normal_Splice_Site" and "GENE2/NON_CODING_GENE:+:Upstream"<br />
<br />
":" separates within gene annotation in the following order: gene, strand, exon/intron, details.<br />
<br />
2) Statistics files:<br />
<br />
Four frequency table will be generated after annotation. For example:<br />
<br />
''test.out.vcf.anno.frq'' <br />
Stop_Loss 1<br />
Utr5 2<br />
Utr3 2<br />
CodonRegion 2<br />
CodonGain 2<br />
Frameshift 2<br />
Synonymous 3<br />
StructuralVariation 3<br />
Noncoding 3<br />
Nonsynonymous 4<br />
Deletion 6<br />
Upstream 6<br />
Insertion 6<br />
Essential_Splice_Site 8 <br />
Downstream 8<br />
Intron 12<br />
Exon 21<br />
Normal_Splice_Site 25<br />
<br />
''test.out.vcf.base.frq''<br />
A->G 1<br />
T->C 1<br />
T->G 2<br />
A->C 2<br />
C->A 5<br />
<br />
''test.out.vcf.codon.frq''<br />
Arg->Met 1<br />
Pro->Thr 1<br />
Arg->Arg 1<br />
Pro->His 1<br />
Gly->Gly 1<br />
Pro->Pro 1<br />
Stp->Tyr 1<br />
Leu->Val 1<br />
<br />
''test.out.vcf.indel.frq''<br />
1 1<br />
-4 1<br />
3 1<br />
-3 1<br />
<br />
= Contact =<br />
<br />
Questions and requests should be sent to Xiaowei Zhan ([mailto:zhanxw@umich.edu zhanxw@umich.edu]) or Goncalo Abecasis ([mailto:goncalo@umich.edu goncalo@umich.edu])<br />
<br />
Author sincerly appreicate Yanming Li for his wonderful tutorial on gene annotation software [http://www.sph.umich.edu/csg/liyanmin/vcfCodingSnps/Tutorial.shtml vcfCodingSnps], and Hyun Ming Kang for his code related to genome scores and his consistent suggestions and feedbacks.</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=UMAKE&diff=14646UMAKE2017-02-02T15:53:30Z<p>Ppwhite: /* Exercise with Example Resources */</p>
<hr />
<div>[[Category:Software|UMAKE]]<br />
<br />
<!-- BANNER ACROSS TOP OF PAGE --><br />
{| style="width:100%; background:#ffb6c1; margin-top:1.2em; border:1px solid #ccc;" |<br />
| style="width:100%; text-align:center; color:#000;" | <br />
<div style="font-size:162%; border:none; margin:0; padding:.1em; color:#000;">'''NOTE: The UMAKE pipeline is now included in [[GotCloud]] package and is no longer maintained as a separate download. Please see [[GotCloud]] instead. The documentation on this page is outdated. Please visit [[GotCloud]] page for more details on the latest version'''</div><br />
|}<br />
<br />
<br />
'''UMAKE''' is a software pipeline to detect SNPs and call their genotypes from a list of BAM files. '''UMAKE''' pipeline has been successfully applied in detecting SNPs from many large-scale next-generation sequencing studies. <br />
<br />
''Updated version of UMAKE including SVM filtering will be available very soon''<br />
<br />
== Download UMAKE ==<br />
<br />
To get a copy go to the [http://csg.sph.umich.edu//kang/umake/download UMAKE Download] download page.<br />
<br />
== Build UMAKE ==<br />
<br />
To build UMAKE, download the UMAKE package from the link above and run the following series of commands.<br />
tar xzvf umake.v1.0.1.20110706.tar.gz<br />
cd umake<br />
make<br />
<br />
UMAKE is designed to be portable. However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, and later, there are likely other portability issues. <br />
<br />
Currently we support UMAKE only on Ubuntu 9.10 and later on 64-bit processors. perl (5.0 or higher) must be installed with IO::File, IO::Zlib, and Getopt::Long packages.<br />
<br />
Note that UMAKE requires external software packages to be copied to <code>UMAKE_HOME/ext/</code> directory <br />
* Create an "ext" folder under UMAKE_HOME (the path to the UMAKE package)<br />
<br />
* <code>bgzip</code> and <code>tabix</code> - To download, go to [http://sourceforge.net/projects/samtools/files/tabix/ TABIX Download] (after compiling the source code, copy bgzip and tabix to the "ext" folder above)<br />
<br />
* <code>beagle</code> - To download, go to [http://faculty.washington.edu/browning/beagle/beagle.html#download BEAGLE Download] (rename "beagle.jar" to "beagle.20101226.jar" and copy it to the "ext" folder)<br />
<br />
* Copy the executables <code>bgzip</code> and <code>tabix</code> to /usr/cluster/bin/ OR you could replace "/usr/cluster/bin" with the complete path of the above "ext" folder at line 652 and 654 of umake.pl under UMAKE_HOME/scripts/<br />
<br />
== Basic Usage Example ==<br />
<br />
Here is a typical command line:<br />
<br />
perl $(UMAKE_HOME)/scripts/umake.pl --conf [conf.file]<br />
<br />
Example configuration file can be found at examples/umake-example.conf. Users have to modify the configuration files to <br />
<br />
The full pipeline of UMAKE has to be be partitioned into three parts, (1) SNP detection (2) LD-aware genotype refinement using beagle (3) MaCH/Thunder genotype refinement on top of beagle haplotypes. These three steps can be run with the same configuration file using the following options<br />
<br />
perl $(UMAKE_HOME)/scripts/umake.pl --conf [conf.file] --snpcall<br />
perl $(UMAKE_HOME)/scripts/umake.pl --conf [conf.file] --beagle<br />
perl $(UMAKE_HOME)/scripts/umake.pl --conf [conf.file] --thunder<br />
perl $(UMAKE_HOME)/scripts/umake.pl --conf [conf.file] --extract<br />
<br />
== Exercise with Example Resources ==<br />
Example input files can be downloaded at [http://csg.sph.umich.edu//kang/umake/download UMAKE Download]. These example resource files includs sequence alignment files over 60 individuals from the 1000 Genomes project, focusing on 300kb region in chromosome 20. Note that the reference genome FASTA file has also been modified to use chromosome 20 only.<br />
<br />
Let <code>UMAKE_HOME</code> be the path to the UMAKE package and <code>EXAMPLE_HOME</code> be the path to the example resource files. <br />
<br />
* First, modify <code>UMAKE_ROOT, INPUT_ROOT, OUTPUT_ROOT</code> parameters accordingly. <br />
* You need to run all following commands under the <code>EXAMPLE_HOME</code> folder. After each perl script is done, run the two make commands printed from the perl script.<br />
* Second, perform SNP calling procedure using the following command <br />
perl $(UMAKE_HOME)/scripts/umake.pl --snpcall<br />
* Third, run BEAGLE genotype refinement using the <br />
perl $(UMAKE_HOME)/scripts/umake.pl --beagle<br />
* Finally, run BEAGLE/THUNDER genotype refinement using the <br />
perl $(UMAKE_HOME)/scripts/umake.pl --thunder<br />
* If using MOSIX nodes, change default MOSIX_PREFIX as following:<br />
MOS_PREFIX = mosrun -E/tmp -t -i -m 2000 # PREFIX FOR MOSIX COMMAND (BLANK IF UNUSED)<br />
<br />
== Preparing Your Own Input Files ==<br />
<br />
UMAKE requires three types of input files (1) a set of BAM files (2) index file (3) configuration file<br />
<br />
* BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls.<br />
* Each line of Index file represents each individual under the following format. Note that multiple BAMs per individual may be provided.<br />
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...<br />
* Additional input Files including Pedigree files (PED format) (to specify gender information in chrX calling), Target information (UCSC's BED format) in targeted or whole exome capture sequencing may be provided.<br />
* Configuration file contains core information of run-time options including the software binaries and command line arguments. Refer to the example configuration file for further information<br />
<br />
== Configuration File ==<br />
<br />
The example configuration file below illustrate how to configure the UMAKE configuration file. Here are a few highlights<br />
* Steps to run could be automatically set using --snpcall, --beagle, --thunder, --extract, or manually set by uncommenting options in STEPS_TO_RUN. Note that the steps to run should be in a consecutive order<br />
* To run on full genome data, the resource file should be download at [[ftp://share.sph.umich.edu/1000genomes/umake-resources/ | FTP Download of Full Resource Files]].<br />
* You need to uncomment the target-related configuration lines in order to run in whole genome data<br />
* FILTER_ARGS needs to be carefully calibrated in order to obtain a good set of filtered set of SNPs<br />
<br />
##################################################################<br />
# UMAKE CONFIGURATION FILE<br />
# This configuration file contains run-time configuration of<br />
# UMAKE SNP calling pipeline<br />
###############################################################################<br />
## KEY ELEMENTS TO CONFIGURE : NEED TO MODIFY<br />
###############################################################################<br />
#UMAKE_ROOT = FULL_PATH_TO_UMAKE ## e.g. /home/myid/code/umake<br />
#INPUT_ROOT = FULL_PATH_TO_CURRENT_DIR ## e.g. /home/myid/data/umake-examples<br />
#OUTPUT_ROOT = FULL_PATH_TO_OUTPUT_DIR ## e.g. /home/myid/data/umake-examples/out<br />
BAM_INDEX = $(INPUT_ROOT)/umake-example.index # SAMPLE INDEX FILE (See documentation for detailed format)<br />
CHRS = 20 # List of chromosomes to call SNPs. For multiple chromosomes, separate by whitespace<br />
OUT_DIR = $(OUTPUT_ROOT) # output directory<br />
OUT_PREFIX = umake-example # prefix of output Makefile $(OUT_PREFIX).Makefile will be generated<br />
#PED_INDEX = $(INPUT_ROOT)/umake-example.ped # SAMPLE PED FILE (required only for chrX calling)<br />
#<br />
###############################################################################<br />
## STEPS TO RUN : COMMENT OUT TO EXCLUDE CERTAIN STEPS<br />
## --snpcall, --extract, --beagle, --thunder commands automatically set them<br />
###############################################################################<br />
#RUN_INDEX = TRUE # create BAM index file<br />
#RUN_PILEUP = TRUE # create GLF file from BAM<br />
#RUN_GLFMULTIPLES = TRUE # create unfiltered SNP calls<br />
#RUN_VCFPILEUP = TRUE # create PVCF files using vcfPileup and run infoCollector<br />
#RUN_FILTER = TRUE # filter SNPs using vcfCooker<br />
#RUN_SPLIT = TRUE # split SNPs into chunks for genotype refinement<br />
#RUN_BEAGLE = TRUE # BEAGLE - MUST SET AFTER FINISHING PREVIOUS STEPS<br />
#RUN_SUBSET = TRUE # SUBSET FOR THUNDER - MAY BE SET WITH BEAGLE STEP TOGETHER<br />
#RUN_THUNDER = TRUE # THUNDER - MUST SET AFTER FINISHING PREVIOUS STEPS<br />
#<br />
###############################################################################<br />
## OPTIONS FOR GLFEXTRACT (GLFMULTIPLES, VCFPILEUP, FILTER MUST BE TURNED OFF)<br />
###############################################################################<br />
#RUN_EXTRACT = TRUE # Instead of discovering SNPs, extract genotype liklihood in the site of VCF_EXTRACT<br />
#VCF_EXTRACT = # whole-genome (gzipped and tabixed) .vcf.gz file to extract the site information to genotype (such as 1000 Genomes site list)<br />
#<br />
###############################################################################<br />
## OPTIONS FOR EXOME/TARGETED SEQUENCING : COMMENT OUT IF WHOLE GENOME SEQUENCING<br />
###############################################################################<br />
WRITE_TARGET_LOCI = TRUE # FOR TARGETED SEQUENCING ONLY -- Write loci file when performing pileup<br />
UNIFORM_TARGET_BED = $(INPUT_ROOT)/umake-example.bed # Targeted sequencing : When all individuals has the same target. Otherwise, comment it out<br />
OFFSET_OFF_TARGET = 50 # Extend target by given # of bases <br />
MULTIPLE_TARGET_MAP = # Target per individual : Each line contains [SM_ID] [TARGET_BED]<br />
TARGET_DIR = target # Directory to store target information<br />
SAMTOOLS_VIEW_TARGET_ONLY = TRUE # When performing samtools view, exclude off-target regions (may make command line too long)<br />
#<br />
###############################################################################<br />
## RESOURCE FILES : Download the full resources for full genome calling<br />
###############################################################################<br />
REF = $(INPUT_ROOT)/data/ref/human_g1k_v37_chr20.fa # Reference FASTA sequence. Note that the FASTA file in the example package is only chr20.<br />
INDEL_PREFIX = $(INPUT_ROOT)/data/indels/1kg.pilot_release.merged.indels.sites.hg19 # 1000 Genomes Pilot 1 indel VCF prefix<br />
DBSNP_PREFIX = $(INPUT_ROOT)/data/dbsnp/dbsnp_129_b37.rod # dbSNP file prefix<br />
HM3_PREFIX = $(INPUT_ROOT)/data/HapMap/hapmap3_r3_b37_fwd.consensus.qc.poly # HapMap3 polymorphic site prefix<br />
#<br />
###############################################################################<br />
## BINARIES<br />
###############################################################################<br />
SAMTOOLS_FOR_PILEUP = $(UMAKE_ROOT)/bin/samtools-hybrid # for samtools pileup<br />
SAMTOOLS_FOR_OTHERS = $(UMAKE_ROOT)/bin/samtools-hybrid # for samtools view and calmd<br />
GLFMERGE = $(UMAKE_ROOT)/bin/glfMerge # glfMerge when multiple BAMs exist per indvidual<br />
GLFMULTIPLES = $(UMAKE_ROOT)/bin/glfMultiples --minMapQuality 0 --minDepth 1 --maxDepth 10000000 --uniformTsTv --smartFilter # glfMultiples and options<br />
GLFEXTRACT = $(UMAKE_ROOT)/bin/glfExtract # glfExtract for obtaining VCF for known sites<br />
VCFPILEUP = $(UMAKE_ROOT)/bin/vcfPileup # vcfPileup to generate rich per-site information<br />
INFOCOLLECTOR = $(UMAKE_ROOT)/bin/infoCollector # create filtering statistics<br />
VCFMERGE = perl $(UMAKE_ROOT)/scripts/bams2vcfMerge.pl # merge multiple BAMs separated by chunk of genomes<br />
VCFCOOKER = $(UMAKE_ROOT)/bin/vcfCooker # vcfCooker for filtering<br />
VCFSUMMARY = perl $(UMAKE_ROOT)/scripts/vcfSummary.pl # Get summary statistics of discovered site<br />
VCFSPLIT = perl $(UMAKE_ROOT)/scripts/vcfSplit.pl # split VCF into overlapping chunks for genotype refinement<br />
VCFPASTE = perl $(UMAKE_ROOT)/scripts/vcfPaste.pl # vcfPaste to generate filtered genotype VCF<br />
BEAGLE = java -Xmx4g -jar $(UMAKE_ROOT)/ext/beagle.20101226.jar seed=993478 gprobs=true niterations=50 lowmem=true # BEAGLE BINARY : NEED TO COPY BEAGLE TO $(UMAKE_ROOT)/ext DIRECTORY BEFORE RUNNING PIPELINE<br />
VCF2BEAGLE = perl $(UMAKE_ROOT)/scripts/vcf2Beagle.pl --PL # convert VCF (with PL tag) into beagle input<br />
BEAGLE2VCF = perl $(UMAKE_ROOT)/scripts/beagle2Vcf.pl # convert beagle output to VCF<br />
THUNDER = $(UMAKE_ROOT)/bin/thunderVCF -r 30 --phase --dosage --compact --inputPhased # MaCH/Thunder genotype refinement step<br />
LIGATEVCF = perl $(UMAKE_ROOT)/scripts/ligateVcf.pl # ligate multiple phased VCFs while resolving the phase between VCFs<br />
BGZIP = $(UMAKE_ROOT)/ext/bgzip # NEED TO COPY BGZIP TO $(UMAKE_ROOT)/ext DIRECTORY BEFORE RUNNING PIPELINE<br />
TABIX = $(UMAKE_ROOT)/ext/tabix # NEED TO COPY TABIX TO $(UMAKE_ROOT)/ext DIRECTORY BEFORE RUNNING PIPELINE<br />
#<br />
###############################################################################<br />
## ARGUMENT FOR FILTERING<br />
###############################################################################<br />
SAMTOOLS_VIEW_FILTER = -q 20 -F 0x0704 # samtools view filter (-q by MQ, -F by flag)<br />
FILTER_MAX_SAMPLE_DP = 20 # Max Depth per Sample (20x default) -- will generate FILTER_MAX_TOTAL_DP automatically<br />
FILTER_MIN_SAMPLE_DP = 0.5 # Min Depth per Sample (0.5x defaul) -- will generate FILTER_MIN_TOTAL_DP automatically<br />
FILTER_ARGS = --write-vcf --filter --maxDP $(FILTER_MAX_TOTAL_DP) --minDP $(FILTER_MIN_TOTAL_DP) --maxAB 70 --maxSTR 20 --minSTR -20 --winIndel 5 --maxSTZ 5 --minSTZ -5 --maxAOI 5 # arguments for filtering (refer to vcfCooker for details)<br />
#<br />
#############################################################################<br />
## RELATIVE DIRECTORY UNDER OUT_DIR<br />
#############################################################################<br />
BAM_GLF_DIR = glfs/bams # BAM level GLF<br />
SM_GLF_DIR = glfs/samples # sample level GLF (after glfMerge if necessary)<br />
VCF_DIR = vcfs # unfiltered and filtered VCF<br />
PVCF_DIR = pvcfs # vcfPileup results<br />
SPLIT_DIR = split # chunks split to multiple overlappingpieces <br />
BEAGLE_DIR = beagle # beagle output<br />
THUNDER_DIR = thunder # MaCH/thunder output<br />
GLF_INDEX = glfIndex.ped # glfMultiples/glfExtract index file info<br />
#<br />
#############################################################################<br />
## OTHER OPTIONS<br />
#############################################################################<br />
UNIT_CHUNK = 5000000 # Chunk size of SNP calling : 5Mb is default<br />
LD_NSNPS = 10000 # Chunk size of genotype refinement : 10,000 SNPs<br />
LD_OVERLAP = 1000 # Overlapping # of SNPs between chinks : 1,000 SNPs<br />
RUN_INDEX_FORCE = FALSE # Regenerate BAM index file even if it exists<br />
MERGE_BEFORE_FILTER = FALSE # Merge across the chromosome before filtering<br />
NOBAQ_SUBSTRINGS = SOLID # Avoid BAQ if the BAM file contains the substring<br />
ASSERT_BAM_EXIST = FALSE # Check if BAM file exists<br />
#<br />
#############################################################################<br />
## CLUSTER SETTING : CURRENTLY COMPATIBLE WITH MOSIX PLATFORM<br />
#############################################################################<br />
MOS_PREFIX = # PREFIX FOR MOSIX COMMAND (BLANK IF UNUSED)<br />
MOS_NODES = # COMMA-SEPARATED LIST OF NODES TO SUBMIT JOBS<br />
REMOTE_PREFIX = # REMOTE_PREFIX : Set if cluster node see the directory differently (e.g. /net/mymachine/[original-dir])<br />
<br />
<br />
<br />
== Software Components ==<br />
UMAKE pipeline consists of the following software components (details TBA)<br />
* [[samtools-hybrid]]<br />
* [[glfMerge]]<br />
* [[glfMultiples]]<br />
* [[vcfPileup]]<br />
* [[infoCollector]]<br />
* [[vcfCooker]]<br />
* [[thunderVCF]]<br />
<br />
<br />
== Common Problems ==<br />
If you ran UMAKE without success, please double check these pre-requeists. <br />
<br />
* Input region file (BED) should only contain chromosome 1, 2, ... 22, X, Y<br />
<br />
* BAM index files (e.g. bams.index) should use absolute path for each BAM files<br />
<br />
* Your working partition (where OUTPUT_ROOT specified) should support symbolic link. So Windows partition will not work.<br />
<br />
* If UMAKE complains about resoures file (e.g. Cannot find HapMap SNPs or 1000 Genome indels), please check your UMAKE configuration file have correct set up resources files in the correct location. A common mistake is to use path HapMap/ while the resource folder is HapMap3/ .<br />
<br />
== Acknowledgements ==<br />
<br />
UMAKE is a result from collaborative effort by Hyun Min Kang, Goo Jun, Carlo Sidore, Yun Li, Paul Anderson, Mary Kate Wing, Wei Chen, Tom Blackwell, and Goncalo Abecasis. Please email to Hyun Min Kang [[mailto:hmkang@umich.edu| hmkang@umich.edu ]] for any questions.</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=LASER&diff=14645LASER2017-02-02T15:51:32Z<p>Ppwhite: /* Advanced options */</p>
<hr />
<div>= Introduction =<br />
<br />
LASER, which stands for Locating Ancestry using SEquencing Reads, is a C++ software package that can estimate individual ancestry directly from genome-wide shortgun sequencing reads without calling genotypes. The method relies on the availability of a set of reference individuals whose genome-wide SNP genotypes and ancestral information are known. We first construct a reference coordinate system by applying principal components analysis (PCA) to the genotype data of the reference individuals. Then, for each sequencing sample, use the genome-wide sequencing reads to place the sample into the reference PCA space. With an appropriate reference panel, the estimated coordinates of the sequencing samples identify their ancestral background and can be directly used to correct for population structure in association studies or to ensure adequate matching of cases and controls.<br />
<br />
<br />
Note:<br />
The goal of this wiki page is to help you get start using LASER.<br />
This page was created for LASER 1.0. Some of the information might be outdated for LASER 2.0. <br />
A more updated wiki page can be found at [http://genome.sph.umich.edu/wiki/SeqShop:_Estimates_of_Genetic_Ancestry_Practical 2014 UM Sequencing Workshop].<br />
We also encourage you to read the [http://csg.sph.umich.edu/chaolong/LASER/LASER_Manual.pdf manual] for more details of the software.<br />
<br />
= Download =<br />
<br />
To get a copy of the software and manual, go to the [http://csg.sph.umich.edu//chaolong/LASER/ LASER Download] page.<br />
<br />
= Workflow =<br />
<br />
LASER generates the coordinates from both reference individuals and sequence samples. It requires essentially two input files: <br />
<br />
[[File:LASER-Workflow.png|thumb|center|alt=LASER workflow|400px|LASER Workflow]] <br />
<br />
*Seq file: a text file processed from BAM (alignment) files. (See [[#Process sequencing file (BAM)|Processing sequencing file]] for how to prepare seq file) <br />
*Geno file: genotypes of reference individuals. (See [[#Geno file|Geno file]] to understand geno file format)<br />
<br />
LASER typically outputs two coord files: (1) in reference individuals' coord file(Reference.coord), LASER outputs the reference coordinates in the PCA space; (2) in sequence samples' coord files(AllSamples.coord), LASER infers their ancestries by placing their ancestry coordinates onto reference samples' PCA space.<br />
<br />
An example result of the coord file of sequence samples is shown below:<br />
<br />
popID indivID L1 Ci t PC1 PC2<br />
YRI NA19238 1409 0.304122 0.98933 52.7634 -39.7924<br />
CEU NA12892 1552 0.330037 0.989709 9.82674 25.2898<br />
CEU NA12891 1609 0.362198 0.988082 0.439573 26.8872<br />
CEU NA12878 1579 0.334825 0.988677 8.83775 28.1342<br />
YRI NA19239 1558 0.34898 0.988302 53.9104 -39.1727<br />
YRI NA19240 1735 0.404142 0.990264 59.8379 -45.2765<br />
<br />
In the header line, popID means "population ID", indivID means "individual ID", L1 means number of loci that has been covered by at least one read, Ci means "average coverage", t means Procrustes similarity. PC1 and PC2 mean coordinates of the first and second principal components.<br />
<br />
= Tutorial =<br />
<br />
In this tutorial, we will show you how to prepare data and run LASER.<br />
<br />
== Process sequencing file (BAM) ==<br />
<br />
We illustrate how to obtain .seq file from BAM files in this section. <br />
In this example, we use HGDP data set as a reference, which contains 938 individuals and 632,958 markers.<br />
[[File:LASER-DataProcessing.png|thumb|center|alt=LASER workflow|400px|LASER Data Processing Procedure]] <br />
<br />
1. Obtain pileup files from BAM files <br />
<br />
The first step is to generate a BED file:<br />
<br />
cat ../resource/HGDP/HGDP_938.site |awk '{if (NR > 1) {print $1, $2-1, $2;}}' > HGDP_938.bed<br />
<br />
This BED file contains the positions of all the reference markers. <br />
<br />
Then we use ''samtools'' to extract the sequence bases overlapping these 632,958 reference markers.<br />
Assuming your BAM file name is ''NA12878.chrom22.recal.bam'' (our example BAM file), you can use this:<br />
<br />
samtools mpileup -q 30 -Q 20 -f ../../LASER-resource/reference/hs37d5.fa -l HGDP_938.bed exampleBAM/NA12878.chrom22.recal.bam > NA12878.chrom22.pileup<br />
<br />
to obtain a pileup file named ''NA12878.chrom22.pileup''. It is required to keep the ''.pileup' suffix.<br />
<br />
2. Obtain a seq file from pileup files. <br />
<br />
After obtaining pileup files from each BAM file, you can convert them into a single seq file before running LASER. <br />
Use the same site file and all generated pileup files from step 1 to generate a seq file:<br />
<br />
python pileup2seq.py -m ../resource/HGDP/HGDP_938.site -o test NA12878.chrom22.pileup<br />
<br />
You should obtain test.seq file after this step.<br />
<br />
== Estimate ancestries of sequence samples ==<br />
<br />
The easiest way to perform LASER using its exemplar data is: <br />
<br />
./laser -s pileup2seq/test.seq -g resource/HGDP/HGDP_938.geno -c resource/HGDP/HGDP_938.RefPC.coord -o test -k 2<br />
<br />
Upon successful calculation, you will find a result file "test.SeqPC.coord".<br />
<br />
<br><br />
<br />
<br />
== Interpret LASER outputs ==<br />
<br />
Upon successfully launching LASER command line as above, the output messages should be similar to below: <br />
<br />
===================================================================<br />
==== LASER: Locating Ancestry from SEquencing Reads ====<br />
==== Version 1.0 | (c) Chaolong Wang 2013 ====<br />
====================================================================<br />
Started at: Fri Nov 15 01:05:48 2013<br />
<br />
938 individuals are detected in the GENO_FILE.<br />
632958 loci are detected in the GENO_FILE.<br />
1 individuals are detected in the SEQ_FILE.<br />
632958 loci are detected in the SEQ_FILE.<br />
938 individuals are detected in the COORD_FILE.<br />
100 PCs are detected in the COORD_FILE.<br />
<br />
Parameter values used in execution:<br />
-------------------------------------------------<br />
GENO_FILE (-g)resource/HGDP/HGDP_938.geno<br />
SEQ_FILE (-s)pileup2seq/test.seq<br />
COORD_FILE (-c)resource/HGDP/HGDP_938.RefPC.coord<br />
OUT_PREFIX (-o)test<br />
DIM (-k)2<br />
MIN_LOCI (-l)100<br />
SEQ_ERR (-e)0.01<br />
FIRST_IND (-x)1<br />
LAST_IND (-y)1<br />
REPS (-r)1<br />
OUTPUT_REPS (-R)0<br />
CHECK_FORMAT (-fmt)10<br />
CHECK_COVERAGE (-cov)0<br />
PCA_MODE (-pca)0<br />
-------------------------------------------------<br />
<br />
Fri Nov 15 01:05:50 2013<br />
Checking data format ...<br />
GENO_FILE: OK.<br />
SEQ_FILE: OK.<br />
COORD_FILE: OK.<br />
<br />
Fri Nov 15 01:06:01 2013<br />
Reading reference genotypes ...<br />
<br />
Fri Nov 15 01:09:15 2013<br />
Reading reference PCA coordinates ...<br />
<br />
Fri Nov 15 01:09:15 2013<br />
Analyzing sequence samples ...<br />
Results for the sequence samples are output to 'test.SeqPC.coord'.<br />
<br />
Finished at: Fri Nov 15 01:09:21 2013<br />
====================================================================<br />
<br />
The ancestry of input samples are store in the file '''test.SeqPC.coord''', which content is shown below:<br />
<br />
popID indivID L1 Ci t PC1 PC2<br />
NA12878.chrom22 NA12878.chrom22 1601 0.00858193 0.977243 31.522 224.098<br />
<br />
The ancestry coordinates for NA12878 samples are given in PC1 (31.522) and PC2 (224.098).<br />
<br />
It is recommended to visualize this results with HGDP reference samples whose coordinates are given in file: resource/HGDP/HGDP_938.RefPC.coord<br />
<br />
In our manuscript, an example figure is shown: <br />
<br />
[[File:LASER paper Figure 2.png|thumb|center|alt=LASER example outputs as in Figure 2|400px|LASER Outputs]] <br />
<br />
In this figure, 238 individuals were randomly selected from the total 938 HGDP samples as the testing set (colored symbols), <br />
and the remaining 700 HGDP individuals were used as the reference panel (gray symbols).<br />
<br />
= File format =<br />
<br />
== Geno file ==<br />
<br />
Geno file are from reference samples. LASER use genotype of these samples as a reference panel. You can obtain geno file from VCF files using [https://github.com/zhanxw/vcf2geno vcf2geno].<br />
<br />
In our resource folder, we provide an example geno file for the HGDP data set (resource/HGDP/HGDP_938.geno):<br />
<br />
Brahui HGDP00001 1 2 1 1 0 2 0 2 1 2 2 2 1 1 2 1 0<br />
Brahui HGDP00003 0 0 2 0 0 2 0 2 0 2 2 2 2 0 2 2 0<br />
Brahui HGDP00005 0 2 2 0 0 1 0 2 1 2 2 2 2 1 2 2 1<br />
Brahui HGDP00007 0 2 2 0 0 2 0 2 0 2 2 2 1 1 2 2 1<br />
Brahui HGDP00009 0 1 0 1 0 2 0 2 0 2 2 2 2 0 2 2 0<br />
Brahui HGDP00011 1 1 2 1 1 2 1 1 1 2 2 2 1 1 2 2 0<br />
Brahui HGDP00013 1 2 2 1 1 2 1 2 0 2 2 2 2 0 2 2 0<br />
Brahui HGDP00015 1 1 2 0 0 2 0 2 0 2 2 2 2 0 2 2 0<br />
Brahui HGDP00017 1 1 2 0 0 1 0 0 0 2 0 1 1 2 2 2 0<br />
Brahui HGDP00019 0 2 2 0 0 1 0 1 0 2 1 2 2 1 2 2 0<br />
<br />
The first and second columns represent the population id and individual id. <br />
From the third column, each number represents a genotype.<br />
To be consistent with the sequence data, genotypes should be given on the '''forward strand'''. Genotypes are coded by 0, 1, or 2, representing copies of the<br />
reference allele at a locus in one individual. <br />
<br />
In this geno file, we have 632,960 columns which contains 632,958 markers from column 3 to the last column.<br />
<br />
== Seq file ==<br />
Seq file is generated from pileup files. It contains sequencing information and organize it in a LASER readable format.<br />
The first two columns represent population id and individual id.<br />
Subsequent columns are total read depths and reference base counts.<br />
For example, column 3 and 4 are 0, 0 in the following example. That means at first marker, the sequence read depth is 0 and thus none of the reads has reference base.<br />
We enforce tab delimiters between markers and space delimiters between each read depths and reference base counts.<br />
On line of seq file looks like below:<br />
<br />
NA12878.chrom22 NA12878.chrom22 0 0 0 0 0 0 0 0 0 <br />
<br />
== Pileup file ==<br />
<br />
Pileup file are generated using samtools. An example pileup file is shown below:<br />
<br />
22 17094749 A 1 c D<br />
22 17202602 T 1 . D<br />
22 17411899 A 1 . C<br />
22 17450515 G 2 ., 9<<br />
22 17452966 T 1 c 5<br />
22 17470779 C 1 , A<br />
22 17492203 G 1 , B<br />
22 17504945 C 3 ,.. BCA<br />
22 17529814 T 3 .., CCC<br />
<br />
The columns are chromosome, position (1-based), reference base, depth, bases and base qualities.<br />
<br />
== BED file ==<br />
BED file represents genomic regions and it follows [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 UCSC conventions]:<br />
<br />
1 752565 752566<br />
1 768447 768448<br />
1 1005805 1005806<br />
1 1018703 1018704<br />
1 1021414 1021415<br />
<br />
The columns are: chromosome, start position (0-based) and end position (1-based).<br />
<br />
== Coord file ==<br />
Coord files represent the ancestries of both reference samples and sequence samples.<br />
An example coord file looks like below:<br />
<br />
popID indivID L1 Ci t PC1 PC2<br />
YRI NA19238 1409 0.304122 0.98933 52.7634 -39.7924<br />
CEU NA12892 1552 0.330037 0.989709 9.82674 25.2898<br />
CEU NA12891 1609 0.362198 0.988082 0.439573 26.8872<br />
CEU NA12878 1579 0.334825 0.988677 8.83775 28.1342<br />
YRI NA19239 1558 0.34898 0.988302 53.9104 -39.1727<br />
YRI NA19240 1735 0.404142 0.990264 59.8379 -45.2765<br />
<br />
The columns are: popID means "population ID", indivID means "individual ID", L1 means number of loci has been covered, Ci means "average coverage", t means Procrustes similarity.<br />
PC1, PC2 means coordinates of first and second principal components. You may notice L1, Ci, and t are omitted in the coord files of reference samples. The reason is that reference samples use genotypes and do not have coverage information.<br />
<br />
== Site file ==<br />
Site file is equivalent to BED file and it is used here to represent marker positions. An example site file looks like below:<br />
CHR POS ID REF ALT<br />
1 752566 rs3094315 G A<br />
1 768448 rs12562034 G A<br />
1 1005806 rs3934834 C T<br />
1 1018704 rs9442372 A G<br />
1 1021415 rs3737728 A G<br />
<br />
The site file has header line, and it contains chromosome, position(1-based), id (usually marker name), ref (reference allele) and alt (alternative allele).<br />
<br />
= Advanced options =<br />
<br />
LASER has advanced options including (1) parallel computing; (2) increase ancestry inference accuracy using repeated runs; (3) generate PCA coordiates using genotypes.<br />
See [http://csg.sph.umich.edu//chaolong/LASER/LASER_Manual.pdf LASER Manual] for detailed information.<br />
<br />
= Contact =<br />
Comments on this wiki page or questions related to preparing input files for LASER can be sent to [mailto:zhanxw@umich.edu Xiaowei Zhan].<br />
Comments on the LASER software or the user's manual can be sent to [mailto:chaolong@umich.edu Chaolong Wang].<br />
This project was directed by Gonçalo Abecasis and Sebastian Zöllner at the University of Michigan.</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=QPLOT&diff=14644QPLOT2017-02-02T15:42:20Z<p>Ppwhite: /* Source Code Distribution */</p>
<hr />
<div>= Introduction =<br />
<br />
The qplot program calculates various summary statistics some of which are plotted in a PDF file. These statistics can be used to assess the sequencing quality of sequence reads mapped to the reference genome. The main statistics are empirical Phred scores which are calculated based on the background mismatch rate. Background mismatch rate is the rate that sequenced bases are different from the reference genome, EXCLUDING dbSNP positions. Other statistics include GC biases, insert size distribution, depth distribution, genome coverage, empirical Q20 count, and so on. <br />
<br />
In the following sections, we will guide you through: [[#Where to Find It |how to obtain qplot]], [[#Usage |how to use qplot]], [[#Built-in example |example outputs]], [[#anchorOfInteractiveQplot |interactive diagnostic plots]], and [[#Diagnose sequencing quality |real applications]] in which qplot has helped identify sequencing problems.<br />
<br />
= Citing QPLOT =<br />
<br />
If you found QPLOT useful and wants to cite in your paper, please copy and paste the information below.<br />
<br />
* Bingshan Li, Xiaowei Zhan, Mary-Kate Wing, Paul Anderson, Hyun Min Kang, and Goncalo R. Abecasis, “QPLOT: A Quality Assessment Tool for Next Generation Sequencing Data,” BioMed Research International, vol. 2013, Article ID 865181, 4 pages, 2013. doi:10.1155/2013/865181 http://www.hindawi.com/journals/bmri/2013/865181/<br />
<br />
= Where to Find It =<br />
<br />
You can obtain qplot in two ways: <br />
<br />
(1) Download the pre-compiled binary along with the source code as described in [[#Binary Download|Binary Download]]. <br />
<br />
(2) Download source code only and compile it on your own machine. Please follow the instruction in [[#Source Code Distribution|Source Code Distribution]] on fetching source code and building instructions.<br />
<br />
== Binary Download ==<br />
<br />
We have prepared a pre-compiled (under Ubuntu) qplot along with source code . You can download it from: [http://csg.sph.umich.edu//zhanxw/software/qplot/qplot.20130627.tar.gz qplot.20130627.tar.gz (File Size: 1.7G)] <br />
<br />
The executable file is under qplot/bin/qplot. <br />
<br />
In addition, we provided the necessary input files under qplot/data/ (NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100).<br />
<br />
You can also find an example BAM input file under qplot/example/chrom20.9M.10M.bam. It is taken from the 1000 Genome Project with sequencing reads aligned to chromosome 20 positions 8M to 9M.<br />
<br />
== Source Code Distribution ==<br />
<br />
We provide a source code only download in [http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-source.20130627.tar.gz qplot-source.20130627.tar.gz]. Optionally, you can download example file and/or data file:<br />
<br />
[http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-example.tar.gz example]: example input file, and expected outputs if you following the [[#Built-in example | direction]]. <br />
<br />
[http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-data.tar.gz resources data]: necessary input files for qplot, including NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100.<br />
<br />
You can put above file(s) in the same folder and follow these steps:<br />
<br />
* 1. Unarchive downloaded file<br />
tar zvxf qplot-source.20130627.tar.gz<br />
<br />
A new folder ''qplot'' will be created.<br />
<br />
* 2. Build libStatGen<br />
cd qplot<br />
(cd ../libStatGen; make cloneLib)<br />
<br />
This step will download a necessary software library [http://genome.sph.umich.edu/wiki/C%2B%2B_Library:_libStatGen libStatGen] and compile source code into a binary code library.<br />
<br />
* 3. Build qplot<br />
make <br />
<br />
This step will then build qplot. Upon success, the executable qplot can be found under qplot/bin/.<br />
<br />
* 4. (Optional) unarchive example and/or data<br />
tar zvxf qplot-example.tar.gz<br />
<br />
An example file, ''chrom20.9M.10M.bam'', will be extracted to qplot/example/. It contains ~1.1 million aligned Illumina sequencing reads of NA12878 from 1000 Genome Project. Example command line, ''cmd.sh'', example outputs, ''qplot.pdf'', ''qplot.stats'', and ''qplot.R'' are also provided and will be extracted qplot/example/ as well. <br />
<br />
tar zvxf qplot-data.tar.gz<br />
<br />
Three files will be extracted to qplot/data/: ''human.g1k.v37-bs.umfa'' is binary NCBI reference genome build 37; ''dbSNP130.UCSC.coordinates.tbl'' is dbSNP version 130; and ''human.g1k.w100.gc'' is pre-calculated GC content with windows size 100.<br />
<br />
<!-- Please download source code from [[]], the building <br />
{{ToolGitRepo|repoName=qplot|noDownload=}}<br />
--><br />
<br />
= Usage =<br />
<br />
== Command line ==<br />
<br />
After you obtain the qplot executable (either by compiling the source code or by downloading the pre-compiled binary file), you will find the executable file under qplot/bin/qplot. <br />
<br />
Here is the qplot help page by invoking qplot without any command line arguments:<br />
<br />
some_linux_host > qplot/bin/qplot<br />
The following parameters are available. Ones with "[]" are in effect:<br />
<br />
<br />
<br />
References : --reference [/net/fantasia/home/zhanxw/software/qplot/data/human.g1k.v37.fa],<br />
--dbsnp [/net/fantasia/home/zhanxw/software/qplot/data/dbSNP130.UCSC.coordinates.tbl]<br />
GC content file options : --winsize [100]<br />
Region list : --regions [], --invertRegion<br />
Flag filters : --read1_skip, --read2_skip, --paired_skip,<br />
--unpaired_skip<br />
Dup and QCFail : --dup_keep, --qcfail_keep<br />
Mapping filters : --minMapQuality [0.00]<br />
Records to process : --first_n_record [-1]<br />
Lanes to process : --lanes []<br />
Read group to process : --readGroup []<br />
Input file options : --noeof<br />
Output files : --plot [], --stats [], --Rcode [], --xml []<br />
Plot labels : --label [], --bamLabel []<br />
Obsoleted (DO NOT USE) : --gccontent [], --create_gc<br />
<br />
== Input files ==<br />
<br />
qplot runs on the input BAM/SAM file(s) specified on the command-line after all other parameters.<br />
<br />
Additionally, three (3) precomputed files are required. <br />
<br />
* <code>--reference</code><br />
<br />
The reference genome is the same as karma reference genome. If the index files do not exist, qplot will create the index files '''automatically''' using the input reference fasta file.<br />
<br />
* <code>--dbsnp</code><br />
<br />
This file has two columns. First column is the chromosome name which must be consistent with the reference created above. Second column is 1-based SNP position. If you want to create your own dbSNP data from downloaded UCSC dbSNP file, one way to do it is: <code>cat dbsnp_129_b36.rod|grep "single" | awk '$4-$3==1' |cut -f2,4 > dbSNP_129_b36.tbl</code> <br />
<br />
* <code> **OBSOLETED** --gccontent, --create_gc </code><br />
<br />
Although GC content can be calculated on the fly each time, it is much more efficient to load a precomputed GC content from a file. <br />
GC content file name is automatically determined in this format: <reference_genome_base_file_name>.winsize<gc_content_window_size>.gc.<br />
For example, if your reference genome is human.g1k.v37.fa and the window size is 100, then the GC content file name is: human.g1k.v37.winsize100.gc .<br />
<br />
As it said, there is no need to use --gccontent to specify GC content file in each run.<br />
<br />
* <code> input files </code><br />
<br />
QPLOT take SAM/BAM files.<br />
<br />
''Note'': Before running qplot, it is critical to check how the chromosome names are coded. Some BAM/SAM files use just numbers, others use chr + numbers. '''You need to make sure that the chromosome names from the reference and dbSNP are consistent with the BAM/SAM files.'''<br />
<br />
== Parameters ==<br />
<br />
Some of the command line parameters are described here, but most are self explanatory.<br />
<br />
*Flag filter<br />
<br />
By default all reads are processed. If it is desired to check only the first read of a pair, use <code>--read2_skip</code> to ignore the second read. And so on.<br />
<br />
*Duplication and QCFail<br />
<br />
By default reads marked as duplication and QCFail are ignored but can be retained by <br />
--dup_keep <br />
or <br />
--qcfail_keep<br />
<br />
<br />
*Records to process <br />
<br />
The <code>--first_n_record</code> option followed by a number, '''n''', will enable qplot to read the first '''n''' reads to test the bam files and verify it works.<br />
<br />
* Lanes to process (only works for Illumina sequences)<br />
<br />
If the input bam files have more than one lane and only some of them need to be checked, use something like <code>--lanes 1,3,5</code> to specify that only lanes 1, 3, and 5 need to be checked.<br />
<br />
'''NOTE''' In order for this to work, the lane info has to be encoded in the read name such that the lane number is the second field with the delimiter ":".<br />
<br />
<br />
* Read group to process : <br />
<br />
The read group option can restrict qplot to process a subset of reads. For example, if the BAM contains the following @RG tags:<br />
<br />
@RG ID:UM0348_1:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0348_2:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0348_3:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0348_4:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0360_1:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0360_2:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0360_3:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0360_4:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
<br />
QPLOT will by default (without specifying --readgroup) process all reads.<br />
<br />
If you specify "--readGroup UM0348", then only read groups UM0348_1, UM_0348_2, UM_0348_3, UM_0348_4 will be processed.<br />
<br />
If you specify "--readGroup UM0348_1", then only one read group, UM0348_1, will be processed.<br />
<br />
<br />
* Input file options :<br />
<br />
BAM files are compressed using BGZF and should contain the EOF indicator by default. QPLOT will, by default, stop working if it does not find a valid EOF indicator inside the BAM files. <br />
However, you can force QPLOT to continue processing BAM files without an EOF indicator using --noeof. But you should be aware the input files may be corrupted.<br />
<br />
<br />
* Mapping filters<br />
<br />
Qplot will exclude reads with lower mapping qualities than the user specified parameter, <code>--minMapQuality</code>. By default, mapped reads with all mapping quality will be included in the analysis.<br />
<br />
<br />
*Region list<br />
<br />
If the interest of qplot is a list of regions, e.g. exons, this can be achieved by providing a list of regions. The regions should be in the form of "chr start end label" each line in the file (NOTE: ''start'' and ''end'' position are inclusive and they follow the convention of [http://genome.ucsc.edu/FAQ/FAQformat#format1 BED file]). <br />
In order for this option to work, within each chromosome (contig) the regions have to be sorted by starting position, and also the input bam files have to be sorted. <br />
For example, you can create a text file, region.txt like following:<br />
<br />
1 100 500 region_A<br />
1 600 800 region_B<br />
2 100 300 region_C<br />
<br />
Then specifying <code> --regions region.txt</code> enables qplot to calculate various statistics out of sequenced bases only within the above 3 regions.<br />
<br />
Qplot also provides the <code>--invertRegion</code> option. Enabling this option tells qplot to operate on those sequence bases that are outside the given region.<br />
<br />
<br />
* Plot labels<br />
<br />
Two kinds of labels are enabled. <code>--label</code> is the label for the plot (default is empty) which is appended to the title of each subplot. <code>--bamLabels</code> followed by a column separated list of labels provides the labels for each input SAM/BAM file, e.g. sample ID (default is numbers 1, 2, ... until the number of input bam files). For example:<br />
--label Run100 --bamLabels s1,s2,s3,s4,s5,s6,s7,s8<br />
<br />
== Output files ==<br />
<br />
There are three (optional) output files.<br />
* <code>--plot ''qa.pdf''</code><br />
<br />
Qplot will generate a PDF file named ''qa.pdf'' containing 2 pages each with 4 figures. The plot is generated using Rscript.<br />
<br />
* <code>--stats ''qa.stats''</code><br />
<br />
Qplot will generate a text file named ''qa.stats'' containing various summary statistics for each input BAM/SAM file.<br />
<br />
* <code>--Rcode ''qa.R''</code><br />
<br />
Qplot will generate ''qa.R'' which is the R code used for plotting the figures in the ''qa.pdf'' file. If Rscript is not installed in the system, you can use the qa.R to generate the figures on other machines, or extract plotting data from each run and combine multiple runs together to generate more comprehensive plots (See [[#Example | Example]]).<br />
<br />
= Example =<br />
<br />
Qplot can generate diagnostic graphs, related R code, and summary statistics for each SAM/BAM file.<br />
<br />
== Built-in example ==<br />
<br />
In the pre-compiled binary download, you will find a subdirectory named examples. We provide a sample file from the 1000 Genome project, it contains aligned reads on chromosome 20 from position 8 Mbp to 9Mbp. You can invoke qplot using the following command line:<br />
<br />
../bin/qplot --reference ../data/human.g1k.v37.umfa --dbsnp ../data/dbSNP130.UCSC.coordinates.tbl --gccontent ../data/human.g1k.w100.gc --plot qplot.pdf --stats qplot.stats --Rcode qplot.R --label "chr20:9M-10M" chrom20.9M.10M.bam<br />
<br />
Sample outputs are listed below:<br />
<br />
1) Figure: [[Media:qplot.pdf | qplot.pdf]]<br />
<br />
2) Summary statistics:<br />
Stats\BAM chrom20.9M.10M.bam<br />
TotalReads(e6) 1.11<br />
MappingRate(%) 97.24<br />
MapRate_MQpass(%) 97.24<br />
TargetMapping(%) 0.00<br />
ZeroMapQual(%) 2.39<br />
MapQual<10(%) 2.86<br />
PairedReads(%) 83.76<br />
ProperPaired(%) 71.34<br />
MappedBases(e9) 0.04<br />
Q20Bases(e9) 0.04<br />
Q20BasesPct(%) 88.63<br />
MeanDepth 42.22<br />
GenomeCover(%) 0.03<br />
EPS_MSE 1.81<br />
EPS_Cycle_Mean 18.71<br />
GCBiasMSE 0.01<br />
ISize_mode 137<br />
ISize_medium 184<br />
DupRate(%) 5.90<br />
QCFailRate(%) 0.00<br />
BaseComp_A(%) 29.9<br />
BaseComp_C(%) 20.1<br />
BaseComp_G(%) 20.2<br />
BaseComp_T(%) 29.8<br />
BaseComp_O(%) 0.1<br />
<br />
== Gallery of examples ==<br />
<br />
Here we show qplot can be applied in various sequencing scenarios. Also users can customize statistics generated by qplot to their needs.<br />
<br />
* Whole genome sequencing with 24-multiplexing<br />
<br />
With a customized script, we aggregated 24 bar-coded samples in the same graph.<br />
The graph will help compare sequencing quality between samples. <br />
<br />
[[Media: qplot.Pool.9847.pdf | QPlot of 24 samples(PDF) ]]<br />
<br />
* Interactive qplot <br />
<br />
<span id="anchorOfInteractiveQplot"></span><br />
Qplot can be interactive. In the following example, you can use mouse scroll to zoom in and zoom out on each graph and pan to a certain part of the graph.<br />
By presenting qplot data on a web page, users can easily identify problematic sequencing samples. Users of qplot can customize its outputs into web page format greatly easing the data exploring process.<br />
<br />
[http://www-personal.umich.edu/~zhanxw/qplot.Pool.9847.html QPlot of 24 samples(HTML) ]<br />
<br />
== Diagnose sequencing quality ==<br />
<br />
Qplot is designed and implemented for the need of checking sequencing quality. <br />
Besides the example of analyzing RNA-seq data as shown in our manuscript, <br />
here we demonstrate two additional scenarios in which qplot can help identify problems after obtaining sequencing data. <br />
<br />
* Base quality distributed abnormally<br />
<br />
[[Media: WrongBaseQual.pdf | Example of qplot helping to identify wrong phred base quality]]<br />
<br />
By checking the first graph "Empirical vs reported Phred score", we found reported base qualities are shifted to the right.<br />
In this particular example, '33' was incorrectly added to all base qualities. <br />
When such data used in variant calling, we may increase false positive SNP variants.<br />
<br />
* Bar-coded samples<br />
<br />
[[Media: WrongBarCoding.pdf | Example of qplot identifying the effect of ignoring bar-coding]]<br />
<br />
By checking "Empirical phred score by cycle" (top right graph on the first page), we noticed the empirical qualities in the first several cycles are abnormally low. This phenomenon leads us to hypothesize that the first several bases have different properties. Further investigation confirmed that this sequencing was done using bar-coded DNA samples, but the analysis did not properly de-multiplex each sample.<br />
<br />
= Contact =<br />
<br />
Questions and requests should be sent to Bingshan Li ([mailto:bingshan@umich.edu bingshan@umich.edu]) or Xiaowei Zhan ([mailto:zhanxw@umich.edu zhanxw@umich.edu]) or Goncalo Abecasis ([mailto:goncalo@umich.edu goncalo@umich.edu])</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=QPLOT&diff=14643QPLOT2017-02-02T15:41:33Z<p>Ppwhite: /* Binary Download */</p>
<hr />
<div>= Introduction =<br />
<br />
The qplot program calculates various summary statistics some of which are plotted in a PDF file. These statistics can be used to assess the sequencing quality of sequence reads mapped to the reference genome. The main statistics are empirical Phred scores which are calculated based on the background mismatch rate. Background mismatch rate is the rate that sequenced bases are different from the reference genome, EXCLUDING dbSNP positions. Other statistics include GC biases, insert size distribution, depth distribution, genome coverage, empirical Q20 count, and so on. <br />
<br />
In the following sections, we will guide you through: [[#Where to Find It |how to obtain qplot]], [[#Usage |how to use qplot]], [[#Built-in example |example outputs]], [[#anchorOfInteractiveQplot |interactive diagnostic plots]], and [[#Diagnose sequencing quality |real applications]] in which qplot has helped identify sequencing problems.<br />
<br />
= Citing QPLOT =<br />
<br />
If you found QPLOT useful and wants to cite in your paper, please copy and paste the information below.<br />
<br />
* Bingshan Li, Xiaowei Zhan, Mary-Kate Wing, Paul Anderson, Hyun Min Kang, and Goncalo R. Abecasis, “QPLOT: A Quality Assessment Tool for Next Generation Sequencing Data,” BioMed Research International, vol. 2013, Article ID 865181, 4 pages, 2013. doi:10.1155/2013/865181 http://www.hindawi.com/journals/bmri/2013/865181/<br />
<br />
= Where to Find It =<br />
<br />
You can obtain qplot in two ways: <br />
<br />
(1) Download the pre-compiled binary along with the source code as described in [[#Binary Download|Binary Download]]. <br />
<br />
(2) Download source code only and compile it on your own machine. Please follow the instruction in [[#Source Code Distribution|Source Code Distribution]] on fetching source code and building instructions.<br />
<br />
== Binary Download ==<br />
<br />
We have prepared a pre-compiled (under Ubuntu) qplot along with source code . You can download it from: [http://csg.sph.umich.edu//zhanxw/software/qplot/qplot.20130627.tar.gz qplot.20130627.tar.gz (File Size: 1.7G)] <br />
<br />
The executable file is under qplot/bin/qplot. <br />
<br />
In addition, we provided the necessary input files under qplot/data/ (NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100).<br />
<br />
You can also find an example BAM input file under qplot/example/chrom20.9M.10M.bam. It is taken from the 1000 Genome Project with sequencing reads aligned to chromosome 20 positions 8M to 9M.<br />
<br />
== Source Code Distribution ==<br />
<br />
We provide a source code only download in [http://www.sph.umich.edu/csg/zhanxw/software/qplot/qplot-source.20130627.tar.gz qplot-source.20130627.tar.gz]. Optionally, you can download example file and/or data file:<br />
<br />
[http://www.sph.umich.edu/csg/zhanxw/software/qplot/qplot-example.tar.gz example]: example input file, and expected outputs if you following the [[#Built-in example | direction]]. <br />
<br />
[http://www.sph.umich.edu/csg/zhanxw/software/qplot/qplot-data.tar.gz resources data]: necessary input files for qplot, including NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100.<br />
<br />
You can put above file(s) in the same folder and follow these steps:<br />
<br />
* 1. Unarchive downloaded file<br />
tar zvxf qplot-source.20130627.tar.gz<br />
<br />
A new folder ''qplot'' will be created.<br />
<br />
* 2. Build libStatGen<br />
cd qplot<br />
(cd ../libStatGen; make cloneLib)<br />
<br />
This step will download a necessary software library [http://genome.sph.umich.edu/wiki/C%2B%2B_Library:_libStatGen libStatGen] and compile source code into a binary code library.<br />
<br />
* 3. Build qplot<br />
make <br />
<br />
This step will then build qplot. Upon success, the executable qplot can be found under qplot/bin/.<br />
<br />
* 4. (Optional) unarchive example and/or data<br />
tar zvxf qplot-example.tar.gz<br />
<br />
An example file, ''chrom20.9M.10M.bam'', will be extracted to qplot/example/. It contains ~1.1 million aligned Illumina sequencing reads of NA12878 from 1000 Genome Project. Example command line, ''cmd.sh'', example outputs, ''qplot.pdf'', ''qplot.stats'', and ''qplot.R'' are also provided and will be extracted qplot/example/ as well. <br />
<br />
tar zvxf qplot-data.tar.gz<br />
<br />
Three files will be extracted to qplot/data/: ''human.g1k.v37-bs.umfa'' is binary NCBI reference genome build 37; ''dbSNP130.UCSC.coordinates.tbl'' is dbSNP version 130; and ''human.g1k.w100.gc'' is pre-calculated GC content with windows size 100.<br />
<br />
<!-- Please download source code from [[]], the building <br />
{{ToolGitRepo|repoName=qplot|noDownload=}}<br />
--><br />
<br />
= Usage =<br />
<br />
== Command line ==<br />
<br />
After you obtain the qplot executable (either by compiling the source code or by downloading the pre-compiled binary file), you will find the executable file under qplot/bin/qplot. <br />
<br />
Here is the qplot help page by invoking qplot without any command line arguments:<br />
<br />
some_linux_host > qplot/bin/qplot<br />
The following parameters are available. Ones with "[]" are in effect:<br />
<br />
<br />
<br />
References : --reference [/net/fantasia/home/zhanxw/software/qplot/data/human.g1k.v37.fa],<br />
--dbsnp [/net/fantasia/home/zhanxw/software/qplot/data/dbSNP130.UCSC.coordinates.tbl]<br />
GC content file options : --winsize [100]<br />
Region list : --regions [], --invertRegion<br />
Flag filters : --read1_skip, --read2_skip, --paired_skip,<br />
--unpaired_skip<br />
Dup and QCFail : --dup_keep, --qcfail_keep<br />
Mapping filters : --minMapQuality [0.00]<br />
Records to process : --first_n_record [-1]<br />
Lanes to process : --lanes []<br />
Read group to process : --readGroup []<br />
Input file options : --noeof<br />
Output files : --plot [], --stats [], --Rcode [], --xml []<br />
Plot labels : --label [], --bamLabel []<br />
Obsoleted (DO NOT USE) : --gccontent [], --create_gc<br />
<br />
== Input files ==<br />
<br />
qplot runs on the input BAM/SAM file(s) specified on the command-line after all other parameters.<br />
<br />
Additionally, three (3) precomputed files are required. <br />
<br />
* <code>--reference</code><br />
<br />
The reference genome is the same as karma reference genome. If the index files do not exist, qplot will create the index files '''automatically''' using the input reference fasta file.<br />
<br />
* <code>--dbsnp</code><br />
<br />
This file has two columns. First column is the chromosome name which must be consistent with the reference created above. Second column is 1-based SNP position. If you want to create your own dbSNP data from downloaded UCSC dbSNP file, one way to do it is: <code>cat dbsnp_129_b36.rod|grep "single" | awk '$4-$3==1' |cut -f2,4 > dbSNP_129_b36.tbl</code> <br />
<br />
* <code> **OBSOLETED** --gccontent, --create_gc </code><br />
<br />
Although GC content can be calculated on the fly each time, it is much more efficient to load a precomputed GC content from a file. <br />
GC content file name is automatically determined in this format: <reference_genome_base_file_name>.winsize<gc_content_window_size>.gc.<br />
For example, if your reference genome is human.g1k.v37.fa and the window size is 100, then the GC content file name is: human.g1k.v37.winsize100.gc .<br />
<br />
As it said, there is no need to use --gccontent to specify GC content file in each run.<br />
<br />
* <code> input files </code><br />
<br />
QPLOT take SAM/BAM files.<br />
<br />
''Note'': Before running qplot, it is critical to check how the chromosome names are coded. Some BAM/SAM files use just numbers, others use chr + numbers. '''You need to make sure that the chromosome names from the reference and dbSNP are consistent with the BAM/SAM files.'''<br />
<br />
== Parameters ==<br />
<br />
Some of the command line parameters are described here, but most are self explanatory.<br />
<br />
*Flag filter<br />
<br />
By default all reads are processed. If it is desired to check only the first read of a pair, use <code>--read2_skip</code> to ignore the second read. And so on.<br />
<br />
*Duplication and QCFail<br />
<br />
By default reads marked as duplication and QCFail are ignored but can be retained by <br />
--dup_keep <br />
or <br />
--qcfail_keep<br />
<br />
<br />
*Records to process <br />
<br />
The <code>--first_n_record</code> option followed by a number, '''n''', will enable qplot to read the first '''n''' reads to test the bam files and verify it works.<br />
<br />
* Lanes to process (only works for Illumina sequences)<br />
<br />
If the input bam files have more than one lane and only some of them need to be checked, use something like <code>--lanes 1,3,5</code> to specify that only lanes 1, 3, and 5 need to be checked.<br />
<br />
'''NOTE''' In order for this to work, the lane info has to be encoded in the read name such that the lane number is the second field with the delimiter ":".<br />
<br />
<br />
* Read group to process : <br />
<br />
The read group option can restrict qplot to process a subset of reads. For example, if the BAM contains the following @RG tags:<br />
<br />
@RG ID:UM0348_1:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0348_2:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0348_3:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0348_4:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0360_1:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0360_2:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0360_3:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
@RG ID:UM0360_4:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM<br />
<br />
QPLOT will by default (without specifying --readgroup) process all reads.<br />
<br />
If you specify "--readGroup UM0348", then only read groups UM0348_1, UM_0348_2, UM_0348_3, UM_0348_4 will be processed.<br />
<br />
If you specify "--readGroup UM0348_1", then only one read group, UM0348_1, will be processed.<br />
<br />
<br />
* Input file options :<br />
<br />
BAM files are compressed using BGZF and should contain the EOF indicator by default. QPLOT will, by default, stop working if it does not find a valid EOF indicator inside the BAM files. <br />
However, you can force QPLOT to continue processing BAM files without an EOF indicator using --noeof. But you should be aware the input files may be corrupted.<br />
<br />
<br />
* Mapping filters<br />
<br />
Qplot will exclude reads with lower mapping qualities than the user specified parameter, <code>--minMapQuality</code>. By default, mapped reads with all mapping quality will be included in the analysis.<br />
<br />
<br />
*Region list<br />
<br />
If the interest of qplot is a list of regions, e.g. exons, this can be achieved by providing a list of regions. The regions should be in the form of "chr start end label" each line in the file (NOTE: ''start'' and ''end'' position are inclusive and they follow the convention of [http://genome.ucsc.edu/FAQ/FAQformat#format1 BED file]). <br />
In order for this option to work, within each chromosome (contig) the regions have to be sorted by starting position, and also the input bam files have to be sorted. <br />
For example, you can create a text file, region.txt like following:<br />
<br />
1 100 500 region_A<br />
1 600 800 region_B<br />
2 100 300 region_C<br />
<br />
Then specifying <code> --regions region.txt</code> enables qplot to calculate various statistics out of sequenced bases only within the above 3 regions.<br />
<br />
Qplot also provides the <code>--invertRegion</code> option. Enabling this option tells qplot to operate on those sequence bases that are outside the given region.<br />
<br />
<br />
* Plot labels<br />
<br />
Two kinds of labels are enabled. <code>--label</code> is the label for the plot (default is empty) which is appended to the title of each subplot. <code>--bamLabels</code> followed by a column separated list of labels provides the labels for each input SAM/BAM file, e.g. sample ID (default is numbers 1, 2, ... until the number of input bam files). For example:<br />
--label Run100 --bamLabels s1,s2,s3,s4,s5,s6,s7,s8<br />
<br />
== Output files ==<br />
<br />
There are three (optional) output files.<br />
* <code>--plot ''qa.pdf''</code><br />
<br />
Qplot will generate a PDF file named ''qa.pdf'' containing 2 pages each with 4 figures. The plot is generated using Rscript.<br />
<br />
* <code>--stats ''qa.stats''</code><br />
<br />
Qplot will generate a text file named ''qa.stats'' containing various summary statistics for each input BAM/SAM file.<br />
<br />
* <code>--Rcode ''qa.R''</code><br />
<br />
Qplot will generate ''qa.R'' which is the R code used for plotting the figures in the ''qa.pdf'' file. If Rscript is not installed in the system, you can use the qa.R to generate the figures on other machines, or extract plotting data from each run and combine multiple runs together to generate more comprehensive plots (See [[#Example | Example]]).<br />
<br />
= Example =<br />
<br />
Qplot can generate diagnostic graphs, related R code, and summary statistics for each SAM/BAM file.<br />
<br />
== Built-in example ==<br />
<br />
In the pre-compiled binary download, you will find a subdirectory named examples. We provide a sample file from the 1000 Genome project, it contains aligned reads on chromosome 20 from position 8 Mbp to 9Mbp. You can invoke qplot using the following command line:<br />
<br />
../bin/qplot --reference ../data/human.g1k.v37.umfa --dbsnp ../data/dbSNP130.UCSC.coordinates.tbl --gccontent ../data/human.g1k.w100.gc --plot qplot.pdf --stats qplot.stats --Rcode qplot.R --label "chr20:9M-10M" chrom20.9M.10M.bam<br />
<br />
Sample outputs are listed below:<br />
<br />
1) Figure: [[Media:qplot.pdf | qplot.pdf]]<br />
<br />
2) Summary statistics:<br />
Stats\BAM chrom20.9M.10M.bam<br />
TotalReads(e6) 1.11<br />
MappingRate(%) 97.24<br />
MapRate_MQpass(%) 97.24<br />
TargetMapping(%) 0.00<br />
ZeroMapQual(%) 2.39<br />
MapQual<10(%) 2.86<br />
PairedReads(%) 83.76<br />
ProperPaired(%) 71.34<br />
MappedBases(e9) 0.04<br />
Q20Bases(e9) 0.04<br />
Q20BasesPct(%) 88.63<br />
MeanDepth 42.22<br />
GenomeCover(%) 0.03<br />
EPS_MSE 1.81<br />
EPS_Cycle_Mean 18.71<br />
GCBiasMSE 0.01<br />
ISize_mode 137<br />
ISize_medium 184<br />
DupRate(%) 5.90<br />
QCFailRate(%) 0.00<br />
BaseComp_A(%) 29.9<br />
BaseComp_C(%) 20.1<br />
BaseComp_G(%) 20.2<br />
BaseComp_T(%) 29.8<br />
BaseComp_O(%) 0.1<br />
<br />
== Gallery of examples ==<br />
<br />
Here we show qplot can be applied in various sequencing scenarios. Also users can customize statistics generated by qplot to their needs.<br />
<br />
* Whole genome sequencing with 24-multiplexing<br />
<br />
With a customized script, we aggregated 24 bar-coded samples in the same graph.<br />
The graph will help compare sequencing quality between samples. <br />
<br />
[[Media: qplot.Pool.9847.pdf | QPlot of 24 samples(PDF) ]]<br />
<br />
* Interactive qplot <br />
<br />
<span id="anchorOfInteractiveQplot"></span><br />
Qplot can be interactive. In the following example, you can use mouse scroll to zoom in and zoom out on each graph and pan to a certain part of the graph.<br />
By presenting qplot data on a web page, users can easily identify problematic sequencing samples. Users of qplot can customize its outputs into web page format greatly easing the data exploring process.<br />
<br />
[http://www-personal.umich.edu/~zhanxw/qplot.Pool.9847.html QPlot of 24 samples(HTML) ]<br />
<br />
== Diagnose sequencing quality ==<br />
<br />
Qplot is designed and implemented for the need of checking sequencing quality. <br />
Besides the example of analyzing RNA-seq data as shown in our manuscript, <br />
here we demonstrate two additional scenarios in which qplot can help identify problems after obtaining sequencing data. <br />
<br />
* Base quality distributed abnormally<br />
<br />
[[Media: WrongBaseQual.pdf | Example of qplot helping to identify wrong phred base quality]]<br />
<br />
By checking the first graph "Empirical vs reported Phred score", we found reported base qualities are shifted to the right.<br />
In this particular example, '33' was incorrectly added to all base qualities. <br />
When such data used in variant calling, we may increase false positive SNP variants.<br />
<br />
* Bar-coded samples<br />
<br />
[[Media: WrongBarCoding.pdf | Example of qplot identifying the effect of ignoring bar-coding]]<br />
<br />
By checking "Empirical phred score by cycle" (top right graph on the first page), we noticed the empirical qualities in the first several cycles are abnormally low. This phenomenon leads us to hypothesize that the first several bases have different properties. Further investigation confirmed that this sequencing was done using bar-coded DNA samples, but the analysis did not properly de-multiplex each sample.<br />
<br />
= Contact =<br />
<br />
Questions and requests should be sent to Bingshan Li ([mailto:bingshan@umich.edu bingshan@umich.edu]) or Xiaowei Zhan ([mailto:zhanxw@umich.edu zhanxw@umich.edu]) or Goncalo Abecasis ([mailto:goncalo@umich.edu goncalo@umich.edu])</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=CheckVCF.py&diff=14642CheckVCF.py2017-02-02T15:36:36Z<p>Ppwhite: /* Download */</p>
<hr />
<div>= checkVCF.py =<br />
<br />
checkVCF.py is a small tool written in [http://www.python.org/ Python] to check input [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 VCF] files before association tests. It can report monomorphic sites, sites with reference alleles inconsistent with the reference genome, sites with invalid genotypes, non-SNP site (e.g. indels), and all sites with allele frequencies greater than ''0.5''. After you passed the checking, you can go on to run [https://github.com/zhanxw/rvtests rvtests] - rare-variant test software.<br />
<br />
== Download ==<br />
<br />
Download from [http://csg.sph.umich.edu//zhanxw/software/checkVCF/checkVCF-20131123.tar.gz this] and unzip the downloaded file. This includes [https://github.com/zhanxw/checkVCF/blob/master/checkVCF.py checkVCF.py] script, reference genome in FASTA format and its index file.<br />
<br />
== Example ==<br />
<br />
<pre>python checkVCF.py -r hs37d5.fa -o test $your_VCF</pre><br />
== Outputs ==<br />
<br />
=== Console output and .log file ===<br />
<br />
Upon successfully running checkVCF.py on the example file, you will see following outputs:<br />
<br />
<pre>checkVCF.py -- check validity of VCF file for meta-analysis<br />
version 1.3 (20130223)<br />
contact zhanxw@umich.edu or dajiang@umich.edu for problems.<br />
Python version is [ 2.7.3.final.0 ] <br />
Begin checking vcfFile [ example.vcf.gz ]<br />
--------------- REPORT ---------------<br />
Total [ 18 ] lines processed<br />
Examine [ 7 ] VCF header lines, [ 11 ] variant sites, [ 6 ] samples<br />
[ 0 ] duplicated sites<br />
[ 0 ] NonSNP site are outputted to [ tmp.check.nonSnp ]<br />
[ 10 ] Inconsistent reference sites are outputted to [ tmp.check.ref ]<br />
[ 0 ] Variant sites with invalid genotypes are outputted to [ tmp.check.geno ]<br />
[ 1 ] Alternative allele frequency &gt; 0.5 sites are outputted to [ tmp.check.af ]<br />
[ 1 ] Monomorphic sites are outputted to [ tmp.check.mono ]<br />
--------------- ACTION ITEM ---------------<br />
* Read tmp.check.ref, for autosomal sites, make sure the you are using the forward strand<br />
* Upload these files to the ftp: tmp.check.log tmp.check.dup tmp.check.noSnp tmp.check.ref tmp.check.geno tmp.check.af tmp.check.mono</pre><br />
=== .check.nonSnp file ===<br />
<br />
This file includes all non-SNP sites. These sites can be detected when the length of the reference allele or alternative allele is larger than one. For example, reference allele is AT. Non-SNP sites also include reference alleles that are not composited of 'A', 'C', 'G', 'T' alleles or alternative alleles that are not composited of 'A', 'T', 'G', 'C', '.' alleles.<br />
<br />
=== .check.ref file ===<br />
<br />
This file includes the variant sites that do not match reference alleles. That can happen when: (1) variant chromosome names do not appear in the reference genome file. You will see a line with &quot;FailedGetBase&quot; and chromosome:position from the input VCF file; (2) reference alleles do not match. You will see &quot;MismatchRefBase&quot; and chromosome:position:trueReferenceAllele-referenceAlleleInVCF:referenceAlleleInVCF. For example:<br />
<br />
<pre>MismatchRefBase 19:50578409:G-C/T<br />
FailedGetBase 23:208316</pre><br />
=== .check.geno file ===<br />
<br />
This file contains line numbers in which genotypes are not found or not formatted correctly. You will get either &quot;IndividualMissingGTField&quot; warning or &quot;IndividualHasInvalidGT&quot; warnings.<br />
<br />
=== .check.af file ===<br />
<br />
This file contains the sites where alternative allele frequencies are larger than 0.5 . It is normal that this file contains a number of lines. For human exome chip, you are likely to have ~10k lines in this file. That means out of total ~250k variants, around 10k SNP variants have allele frequencies larger than 0.5.<br />
<br />
=== .check.mono file ===<br />
<br />
This file contains the monomorphic sites. It is normal that this file contains a number of lines. In the ideal case, VCF files should only contain variant sites. However, it is practical or convenient to keep some monomorhipc sites in the VCF file. This file records these monomorphic sites.<br />
<br />
== Contact ==<br />
<br />
Questions or comments can be sent to [mailto:zhanxw@umich.edu Xiaowei Zhan] or [mailto:dajiang@umich.edu Dajiang Liu].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=Rare_variant_tests&diff=14641Rare variant tests2017-02-02T15:35:35Z<p>Ppwhite: /* Summary of rare variant tests for sequence data */</p>
<hr />
<div>=== Summary of discussion from ESP rare variant working group ===<br />
<br />
The rare variant working group within ESP has discussed the issue of<br />
rare variant tests on several conference calls. The end result is<br />
that we recommend selecting one test from each of these three<br />
categories;<br />
<br />
1. Aggregate tests (typically with 1% threshold, nonsynonymous SNPs<br />
only, with meta-analysis across different ethnic groups)<br />
<br />
2. Tests that allow for risk and/or protective variants (again,<br />
probably 1% threshold, nonsynonymous SNPs only, with meta-analysis<br />
across different ethnic groups)<br />
<br />
3. Weighted tests that allow incorporation of more common variants<br />
(possibly apply 5% threshold?, nonsynonymous only, etc.)<br />
<br />
A brief summary of the RV discussion;<br />
<br />
- Permutations (where we permute phenotype while maintaining ethnic<br />
group) will likely be required to get empirical p-values. These RV<br />
tests typically provide conservative p-values (deflated QQ plot), but<br />
not always. Thus, a computationally intensive test will not be<br />
practical for performing large numbers of permutations (at least<br />
1000).<br />
<br />
- Using too many tests will decrease the power overall because of<br />
correction for family-wise error.<br />
<br />
- Although we'd like to evaluate power and type I error rates of these<br />
tests under a variety of genetic models, the reality is that we have<br />
so few known positive examples it would be difficult to assess them<br />
all in a fair way at this time. Instead, we expect to re-convene this<br />
discussion group at a later date once some true positive associations<br />
are identified.<br />
<br />
- Shamil Sunyaev is performing a bake-off with some of these tests,<br />
and we look forward to seeing his results in the future.<br />
<br />
- PLINKSeq is on its way, but is likely a month away from release (end Feb 2011)<br />
<br />
<br />
<br />
=== Summary of rare variant tests for sequence data ===<br />
<br />
Compiled by Cristen Willer and Suzanne Leal for the ESP<br />
Feb 1, 2011<br />
<br />
* indicates applicability to quantitative data<br />
<br />
<br />
<br />
<br />
'''1) Aggregate tests using a cut off e.g. 1 % analyzing nonsynonymous variants to detect detrimental variants'''<br />
<br />
{| width="75%" cellspacing="0" cellpadding="5" border="1"<br />
|-<br />
! scope="col" align="left" | Test Name<br />
! scope="col" align="left" | Reference<br />
! scope="col" align="left" | Software<br />
! scope="col" align="left" | Notes <br />
|-<br />
| CMC/T1 test* || [http://www.ncbi.nlm.nih.gov/pubmed/18691683 Li & Leal, 2008] <br />
|<br />
| [http://atgu.mgh.harvard.edu/plinkseq/ Will be implemented in PlinkSeq] <br />
|-<br />
| KBAC || [http://www.ncbi.nlm.nih.gov/pubmed/20976247 Liu & Leal, 2010] || <br />
| [http://atgu.mgh.harvard.edu/plinkseq/ Will be implemented in PlinkSeq] <br />
|-<br />
| VT* || [http://www.ncbi.nlm.nih.gov/pubmed/20471002 Price et al., 2010] <br />
| http://genetics.bwh.harvard.edu/rare_variants/ <br />
| Incorporating functional weights but not VT, [http://atgu.mgh.harvard.edu/plinkseq/ Will be implemented in PlinkSeq] |<br />
|-<br />
| WSS || [http://www.ncbi.nlm.nih.gov/pubmed/19214210 Madsen & Browning, 2009] || <br />
| with 1% cutoff, [http://atgu.mgh.harvard.edu/plinkseq/ Will be implemented in PlinkSeq] <br />
|-<br />
| CMAT || [http://www.ncbi.nlm.nih.gov/pubmed/21070896 Zawistowski et al. 2010] || || <br />
|-<br />
| ANRV/GRANVIL* || [http://www.ncbi.nlm.nih.gov/pubmed/19810025 Morris & Zeggini] || || <br />
|-<br />
| RARECOVER || [http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000954 Bhati et al. 2010] || || <br />
|-<br />
| CCRaVAT and QuTie* || [http://www.ncbi.nlm.nih.gov/pubmed/20964851 Lawrence et al. 2010] <br />
| http://www.sanger.ac.uk/resources/software/rarevariant/ || <br />
|-<br />
| RVE (rare variant exclusive) || Cohen & Hobbs || <br />
| underpowered, [http://atgu.mgh.harvard.edu/plinkseq/ Will be implemented in PlinkSeq] <br />
|}<br />
<br />
<br />
<br />
'''2) Aggregate tests for protective and detrimental variants (recommend 1% cutoff)'''<br />
<br />
{| width="75%" cellspacing="0" cellpadding="5" border="1"<br />
|-<br />
! scope="col" align="left" | Test Name<br />
! scope="col" align="left" | Reference<br />
! scope="col" align="left" | Software<br />
! scope="col" align="left" | Notes <br />
|-<br />
| C-alpha || [Neale et al., submitted] || <br />
| [http://atgu.mgh.harvard.edu/plinkseq/ Will be implemented in PlinkSeq] <br />
|-<br />
| Ionita-Laza & Lange <br />
| [http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1001289 Ionita-Laza & Lange, 2011] || || |<br />
|-<br />
| DASH* || [http://www.ncbi.nlm.nih.gov/pubmed/20413981 Han & Pan] || || Computational burden <br />
|-<br />
| SKAT* || [http://www.ncbi.nlm.nih.gov/pubmed/20560208 Wu et al., 2010] <br />
| http://www.hsph.harvard.edu/~xlin/software.html <br />
| For some kernel choices, need to code 0=major homozygote, 1=het, 2-minor homozygote <br />
|-<br />
| WHaIT || [http://www.ncbi.nlm.nih.gov/pubmed/21055717 Li et al. 2010] <br />
| http://csg.sph.umich.edu//yli/whait/ || <br />
|-<br />
| EMMPAT* <br />
| [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2978703/pdf/pgen.1001202.pdf King et al. 2010] <br />
| http://home.uchicago.edu/~crk8e/papersup.html || <br />
|}<br />
<br />
<br />
<br />
'''3) Analyzing common and rare variants together (could down-weight or threshold common variants)'''<br />
<br />
{| width="75%" cellspacing="0" cellpadding="5" border="1"<br />
|-<br />
! scope="col" align="left" | Test Name<br />
! scope="col" align="left" | Reference<br />
! scope="col" align="left" | Software<br />
! scope="col" align="left" | Notes<br />
|-<br />
| WSS || [http://www.ncbi.nlm.nih.gov/pubmed/19214210 Madsen & Browning, 2009] || <br />
| with 1% or 5% cutoff, [http://atgu.mgh.harvard.edu/plinkseq/ Will be implemented in PlinkSeq] <br />
|-<br />
| RARECOVER <br />
| [http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000954 Bhati et al. 2010] || || <br />
|-<br />
| Step-Up Collapsing* <br />
| [http://www.plosone.org/article/info:doi/10.1371/journal.pone.0013584 Hoffman et al. 2010] || <br />
| [http://atgu.mgh.harvard.edu/plinkseq/ Will be implemented in PlinkSeq] <br />
|-<br />
| CMC/T5 test* || [http://www.ncbi.nlm.nih.gov/pubmed/18691683 Li & Leal, 2008] || <br />
| [http://atgu.mgh.harvard.edu/plinkseq/ Will be implemented in PlinkSeq] <br />
|-<br />
| MENDEL* || [http://www.ncbi.nlm.nih.gov/pubmed/21121038 Zhou et al. 2011] <br />
| http://www.genetics.ucla.edu/software/download?package=1 || <br />
|}<br />
<br />
<br />
<br />
'''4.) Analyze higher frequency rare variants >1% individually'''<br />
Use same regression frame work which has been used for common variants*<br />
Use meta analysis to combine results from sequence data and imputed genotypes to increase power*<br />
<br />
'''Additional tests'''<br />
<br />
{| width="75%" cellspacing="0" cellpadding="5" border="1"<br />
|-<br />
! scope="col" align="left" | Test Name<br />
! scope="col" align="left" | Reference<br />
! scope="col" align="left" | Software<br />
! scope="col" align="left" | Notes <br />
|-<br />
| Logic regression* || [http://kooperberg.fhcrc.org/papers/2001gaw.pdf Kooperberg et al. 2001] || || <br />
|-<br />
| Sequence diversity || Anderson et al. 2006 || || <br />
|-<br />
| Sequence dissimilarity* || Schork et al. 2008, Wessel et al. 2006 || || <br />
|-<br />
| Ridge regression * || [http://www.cell.com/AJHG/abstract/S0002-9297(08)00091-8 Malo et al. 2008] || || <br />
|}</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=SplitRef&diff=14640SplitRef2017-02-02T15:34:16Z<p>Ppwhite: /* Download */</p>
<hr />
<div>This page documents the splitRef program, which splits a reference haplotype file into smaller files with subsets of markers.<br />
<br />
== Input Files ==<br />
=== Required Input Files ===<br />
==== Haplotype file (.hap) ====<br />
File fed to -hap option. One line for one haplotype, with last field containing the actual alleles with no separators between alleles. <br><br />
<br />
==== Marker list (.snps) file ====<br />
File fed to -snps option. One line for each marker: marker name only. <br><br />
<br />
=== Optional Input Files ===<br />
==== Map file ====<br />
File fed to -map option, containing chromosome, marker name, and marker coordinate (in base pairs) information for each marker. Markers should be stored in the same order as in the marker information file. <br><br />
<br />
== Options ==<br />
=== Required options ===<br />
==== window size ====<br />
Window size can be specified by one of the following three options: (1) -nWindows (2) -windowSize and (3) -windowLength. <br><br />
-nWindows specifies the number of windows to split into and the program splits markers evenly into output windows. <br><br />
-windowSize specifies the number of markers in one output window. The remainder goes to the last window. <br><br />
-windowLength specifies the length (in base pairs) of one output window. The remainder goes to the last window. Note that this option is only allowed when map input file is specified. <br><br />
<br />
==== flanking region ====<br />
Size of flanking region on each side can be specified by one of the following two options: (1) -overlapSize and (2) -overlapLength. <br><br />
-overlapSize specifies the number of markers in each flanking region (so that the total number of flanking markers for each window is twice the number specified except for the first and last window). <br><br />
-overlapLength specifies the length (in base pairs) of each flanking region (so that the total length of the flanking regions is twice the number specified except for the first and last window). <br><br />
<br />
==== Output prefix ====<br />
Specified by -o option. <br><br />
<br />
=== Additional options ===<br />
==== Estimate window size only ====<br />
This is controlled by -extimateWindowOnly option. By default, splitting is performed. <br><br />
But if one only wishes to peek into how the markers are allocated into output windows, use "-extimateWindowOnly 1". <br><br />
<br />
== Example Commands ==<br />
splitRef.pl -hap example.hap.gz -snps example.snps -map example.map -windowLength 10000000 -overlapLength 1000000 <br />
splitRef.pl -hap example.hap.gz -snps example.snps -windowSize 10000 -overlapSize 1000 <br />
splitRef.pl -hap example.hap.gz -snps example.snps -nWindows 12 -overlapSize 1000 <br />
<br />
== Download ==<br />
You can download splitPed at [http://csg.sph.umich.edu//yli/splitRef/download/ splitRef Download Page].<br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=SplitPed&diff=14639SplitPed2017-02-02T15:33:34Z<p>Ppwhite: /* Download */</p>
<hr />
<div>This page documents the splitPed program, which splits a pedigree file into smaller files with subsets of markers.<br />
<br />
== Input Files ==<br />
=== Required Input Files ===<br />
==== Pedigree file (.ped) ====<br />
File fed to -ped option, in [[Merlin]] format pedigree file. For details of the Merlin file format, see the Merlin tutorial [http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html]. <br><br />
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele). <br><br />
<br />
==== Marker information (.dat) file ====<br />
File fed to -dat option, in [[Merlin]] format marker information file. For details of the Merlin file format, see the Merlin tutorial [http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html]. <br><br />
<br />
=== Optional Input Files ===<br />
==== Map file ====<br />
File fed to -map option, containing chromosome, marker name, and marker coordinate (in base pairs) information for each marker. Markers should be stored in the same order as in the marker information file. <br><br />
<br />
== Options ==<br />
=== Required options ===<br />
==== window size ====<br />
Window size can be specified by one of the following three options: (1) -nWindows (2) -windowSize and (3) -windowLength. <br><br />
-nWindows specifies the number of windows to split into and the program splits markers evenly into output windows. <br><br />
-windowSize specifies the number of markers in one output window. The remainder goes to the last window. <br><br />
-windowLength specifies the length (in base pairs) of one output window. The remainder goes to the last window. Note that this option is only allowed when map input file is specified. <br><br />
<br />
==== flanking region ====<br />
Size of flanking region on each side can be specified by one of the following two options: (1) -overlapSize and (2) -overlapLength. <br><br />
-overlapSize specifies the number of markers in each flanking region (so that the total number of flanking markers for each window is twice the number specified except for the first and last window). <br><br />
-overlapLength specifies the length (in base pairs) of each flanking region (so that the total length of the flanking regions is twice the number specified except for the first and last window). <br><br />
<br />
==== Output prefix ====<br />
Specified by -o option. <br><br />
<br />
=== Additional options ===<br />
==== Split original pedigree file? ====<br />
This is controlled by -splitPed option. By default, all the output marker information (.dat) files share the same input pedigree (.ped) file and NO output pedigree (.ped) file is generated. <br><br />
If one wants separate .ped and .dat files for each output window, use "-splitPed 1". <br><br />
<br />
==== Estimate window size only ====<br />
This is controlled by -extimateWindowOnly option. By default, splitting is performed. <br><br />
But if one only wishes to peek into how the markers are allocated into output windows, use "-extimateWindowOnly 1". <br><br />
<br />
== Example Commands ==<br />
splitPed.pl -ped example.ped -dat example.dat -map example.map -windowLength 10000000 -overlapLength 1000000 -o split<br />
splitPed.pl -ped example.ped -dat example.dat -map example.map -windowLength 10000000 -overlapLength 1000000 -splitPed 1 -o split.with_ped<br />
splitPed.pl -ped example.ped -dat example.dat -windowSize 10000 -overlapSize 1000 -o split<br />
splitPed.pl -ped example.ped -dat example.dat -nWindows 12 -overlapSize 1000 -o split<br />
<br />
== Download ==<br />
You can download splitPed at [http://csg.sph.umich.edu//yli/splitPed/download/ splitPed Download Page].<br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=SplitPed&diff=14638SplitPed2017-02-02T15:33:16Z<p>Ppwhite: /* Required Input Files */</p>
<hr />
<div>This page documents the splitPed program, which splits a pedigree file into smaller files with subsets of markers.<br />
<br />
== Input Files ==<br />
=== Required Input Files ===<br />
==== Pedigree file (.ped) ====<br />
File fed to -ped option, in [[Merlin]] format pedigree file. For details of the Merlin file format, see the Merlin tutorial [http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html]. <br><br />
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele). <br><br />
<br />
==== Marker information (.dat) file ====<br />
File fed to -dat option, in [[Merlin]] format marker information file. For details of the Merlin file format, see the Merlin tutorial [http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html]. <br><br />
<br />
=== Optional Input Files ===<br />
==== Map file ====<br />
File fed to -map option, containing chromosome, marker name, and marker coordinate (in base pairs) information for each marker. Markers should be stored in the same order as in the marker information file. <br><br />
<br />
== Options ==<br />
=== Required options ===<br />
==== window size ====<br />
Window size can be specified by one of the following three options: (1) -nWindows (2) -windowSize and (3) -windowLength. <br><br />
-nWindows specifies the number of windows to split into and the program splits markers evenly into output windows. <br><br />
-windowSize specifies the number of markers in one output window. The remainder goes to the last window. <br><br />
-windowLength specifies the length (in base pairs) of one output window. The remainder goes to the last window. Note that this option is only allowed when map input file is specified. <br><br />
<br />
==== flanking region ====<br />
Size of flanking region on each side can be specified by one of the following two options: (1) -overlapSize and (2) -overlapLength. <br><br />
-overlapSize specifies the number of markers in each flanking region (so that the total number of flanking markers for each window is twice the number specified except for the first and last window). <br><br />
-overlapLength specifies the length (in base pairs) of each flanking region (so that the total length of the flanking regions is twice the number specified except for the first and last window). <br><br />
<br />
==== Output prefix ====<br />
Specified by -o option. <br><br />
<br />
=== Additional options ===<br />
==== Split original pedigree file? ====<br />
This is controlled by -splitPed option. By default, all the output marker information (.dat) files share the same input pedigree (.ped) file and NO output pedigree (.ped) file is generated. <br><br />
If one wants separate .ped and .dat files for each output window, use "-splitPed 1". <br><br />
<br />
==== Estimate window size only ====<br />
This is controlled by -extimateWindowOnly option. By default, splitting is performed. <br><br />
But if one only wishes to peek into how the markers are allocated into output windows, use "-extimateWindowOnly 1". <br><br />
<br />
== Example Commands ==<br />
splitPed.pl -ped example.ped -dat example.dat -map example.map -windowLength 10000000 -overlapLength 1000000 -o split<br />
splitPed.pl -ped example.ped -dat example.dat -map example.map -windowLength 10000000 -overlapLength 1000000 -splitPed 1 -o split.with_ped<br />
splitPed.pl -ped example.ped -dat example.dat -windowSize 10000 -overlapSize 1000 -o split<br />
splitPed.pl -ped example.ped -dat example.dat -nWindows 12 -overlapSize 1000 -o split<br />
<br />
== Download ==<br />
You can download splitPed at [http://www.sph.umich.edu/csg/yli/splitPed/download/ splitPed Download Page].<br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=CalcMatch&diff=14637CalcMatch2017-02-02T15:31:55Z<p>Ppwhite: </p>
<hr />
<div>CalcMatch is a C/C++ software developed by [https://csg.sph.umich.edu//yli/ Yun Li]. It compares two sets of pedigree files. It was initially written to compare imputed genotypes with their true/experimental counterpart but can be used to compare the concordance between any two sets of pedigree files. The input data are in standard Merlin/QTDT format (http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html). <br />
<br />
= Options =<br />
== --impped --impdat <br> ==<br />
specify one input pedigree set. <br />
<br />
== --trueped --truedat <br> ==<br />
specify the other input pedigree set.<br />
<br />
== --match == <br />
generates a matrix taking values 0,1,2 indicating # of matched alleles. The dimension of the matrix is # of overlapping individuals times # of overlapping markers of the two input pedigree sets. <br />
<br />
== --bySNP == <br />
is turned on by default (which means: if you put --bySNP in command line, it will be turned OFF!) to generate SNP specific measures. The output .bySNP will contain the following 6 fields for each SNP: <br />
<br />
(1) SNP&nbsp;: SNP name<br />
(2) gErr&nbsp;: genotypic discordance rate<br />
(3) aErr&nbsp;: allelic discordance rate<br />
(4) matchedG&nbsp;: number of genotypes matched<br />
(5) matchedA: number of alleles matched<br />
(6) maskedG: total number of genotypes evaluated/masked (&lt;=n of course) (I should change the naming to comparedG or evaluatedG)<br />
<br />
<br> <br />
<br />
== --byGeno ==<br />
NOTE: this option is turned on by default. If you put --byGeno in command line, it will be turned OFF!<br />
can be added on top of --bySNP. It will generates the following fields after the 6 fields above: <br />
<br />
(7) hetAerr : allelic discordance rate among heterozygotes<br />
(8) AL1: allele 1 (an arbitrary allele)<br />
(9) AL2: allele 2<br />
(10) freq1: frequency of AL1<br />
(11) MAF<br />
(12) #true 1/1: # individuals with experimental genotype AL1/AL1<br />
(13) mm1/2: # of true AL1/AL1 being imputed as AL1/AL2<br />
(14) mm2/2: # of true AL1/AL1 being imputed as AL2/AL2<br />
(15) #true 1/2<br />
(16) mm1/1<br />
(17) mm2/2<br />
(18) #true 2/2<br />
(19) mm1/1<br />
(20) mm1/2<br />
<br />
<br />
<br />
<br><br />
<br />
== --accuracyByGeno ==<br />
Similar to --byGeno, it is used on top of --bySNP. It may be used together with --byGeno. It will generate the following fields, after (7-20) is --byGeno is turned on or after the 6th field otherwise. <br />
<br />
(A) almajor: major allele<br />
(B) alminor: minor allele<br />
(C) freq1: major allele frequency<br />
(D) accuracy11: allelic concordance rate for homozygotes major allele<br />
(E) accuracy12: allelic concordance rate for heterozygotes<br />
(F) accuracy22: allelic concordance rate for homozygotes minor allele<br />
<br />
<br> <br />
== --byPerson ==<br />
generates a separate output file .byPerson and contains the following information for each person: <br />
<br />
(1) famid<br />
(2) subjID<br />
(3) gErr<br />
(4) aErr<br />
(5) matchedG<br />
(6) matchedA<br />
(7) maskedG<br />
<br />
<br> This --byPerson option is useful if there is potential sample swap or inter-individual difference, e.g., sequencing depth, number of markers genotyped. <br />
<br />
<br> <br />
<br />
== --maskflag --maskped --maskdat ==<br />
CalcMatch compares all genotypes overlapping the two input sets. However, when --maskflag is turned on AND --maskped and --maskdat are specified (I know ...) it compares only the following subset of the overlapping genotypes: genotypes either not found (i.e., individual or marker not included) or missing (included but with value 0/0, N/N, ./. etc) in --maskped / --maskdat. These options are useful when some individuals were masked for some SNPs while others masked for a different set of SNPs.<br />
<br />
= output files =<br />
== .bySNP ==<br />
See option --bySNP <br><br />
<br />
== .byPerson ==<br />
See option --byPerson <br><br />
<br />
== .minusstrand ==<br />
Reports the list of SNPs that appear in minus strand (that is, SNPs for which more than two alleles are seen when combining imputed and true pedigree files. This file will only be generated if --byGeno or --accuracyByGeno is turned on. The former option --byGeno is turned on by default. <br><br />
<br />
= example command lines =<br />
<br />
CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --byPerson <br />
<br />
Will generate CalcMatch.Output.bySNP (6 fields only) and CalcMatch.Output.byPerson.<br />
<br />
CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --byGeno --byPerson <br />
<br />
Will generate CalcMatch.Output.bySNP (6+20 fields) and CalcMatch.Output.byPerson.<br />
<br />
CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --accuracyByGeno --byPerson <br />
<br />
Will generate CalcMatch.Output.bySNP (6+6 fields only) and CalcMatch.Output.byPerson.<br />
<br />
CalcMatch --trueped true.ped --truedat true.dat --impped imp.ped --impdat imp.dat -o CalcMatch.Output --accuracyByGeno --byGeno --byPerson <br />
<br />
Will generate CalcMatch.Output.bySNP (6+20+6 fields only) and CalcMatch.Output.byPerson.<br />
<br />
= Download =<br />
Please go to http://www.sph.umich.edu/csg/yli/software.html</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH:_Input_Files&diff=14636MaCH: Input Files2017-02-02T15:31:07Z<p>Ppwhite: /* Optional Phased Haplotypes */</p>
<hr />
<div>An earlier version of this page is available at http://csg.sph.umich.edu//abecasis/MaCH/tour/input_files.html.<br />
<br />
MACH input files include information on experimental genotypes for a set of individuals and, optionally, on a set of known haplotypes. MACH can use these to estimate haplotypes for each sampled individual (conditional on the observed genotypes) or to fill in missing genotypes (conditional on observed genotypes at flanking markers and on the observed genotypes at other individuals). Since an essential first step in any analysis is to make sure data is formatted correctly, it is worthwhile to go over the input files MACH expects and their formats.<br />
<br />
== Observed Genotypes ==<br />
The essential inputs for MACH are a set of observed genotypes for each individual being studied. Typically, MACH expects that all the markers being examined map to one chromosome and that appear in map order in the input files. These requirements can be relaxed when using phased haplotypes as input (see below).<br />
<br />
MACH expects observed genotype data to be stored in a set of matched pedigree and data files. The two files are intrinsically linked, the data file describes the contents of the pedigree file (every pedigree file is slightly different) and the pedigree file itself can only be decoded with its companion data file. The two files can use either the more modern [[Merlin]] / [[QTDT]] format or the classic [[LINKAGE]] format. Detailed descriptions of each format are available elsewhere (for example, see [http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html details of Merlin input formats]), and here we focus on providing an overview of the bare essentials required for using MACH.<br />
<br />
Data files can describe a variety of fields, including disease status information, quantitative traits and covariates, and marker genotypes. A simple MACH data file simply lists names for a series of genetic markers. Each marker name appears its own line prefaced by an " M " field code. Here is an example:<br />
<br />
'''<Example of a simple data file>'''<br />
M marker1<br />
M marker2<br />
...<br />
'''<End of simple data file>'''<br />
<br />
The actual genotypes are stored in a pedigree file. The pedigree file encodes one individual per row. Each row should start with an family id and individual id, followed by a father and mother id (which typically are both set to 0, 'zero', for unrelated individuals), and sex. These initial columns are followed by a series of marker genotypes, each with two alleles. We recommend that the alleles should be coded as A, C, G, T. For compatibility with older analysis tools, it is also possible to encode allels as 1 (for A), 2 (for C), 3 (for G) and 4 (for T). See below for an example:<br />
<br />
'''<Example of a pedigree file with base-pair coded alleles>'''<br />
FAM1001 ID1234 0 0 M A A A C C C<br />
FAM1002 ID5678 0 0 F A C C C G G<br />
...<br />
'''<End of pedigree file>'''<br />
<br />
Some people prefer to use a "/" to separate alleles, as it makes the pedigree easier to read. Thus, the following pedigree is equivalent:<br />
<br />
'''<Example of a pedigree file with base-pair coded alleles>'''<br />
FAM1001 ID1234 0 0 M A/A A/C C/C<br />
FAM1002 ID5678 0 0 F A/C C/C G/G<br />
...<br />
'''<End of pedigree file>'''<br />
<br />
Missing genotypes can be encoded with a '.', "dot", or a '0', "zero". For example, here are two individuals that are missing the first genotype:<br />
<br />
'''<Example of a pedigree file with base-pair coded alleles>'''<br />
FAM1003 ID1234 0 0 M ./. A/C C/C<br />
FAM1004 ID5678 0 0 F 0/0 C/C G/G<br />
...<br />
'''<End of pedigree file>'''<br />
<br />
<br />
Although we don't recommend it, it is possible to use a pedigree file with numerically coded alleles. For an example, see [[MaCH: Pedigree with Integer Allele Codes|obsolete input formats]].<br />
<br />
In the MACH command line, the name of the data and pedigree files is indicated with the -d and -p options (in short hand form) or the --datfile and --pedfile options (in long form) respectively. <br />
<br />
For example: <br />
<br />
mach -d genotypes.dat -p genotypes.ped<br />
<br />
Or:<br />
<br />
mach --datfile genotypes.dat --pedfile genotypes.ped<br />
<br />
== Optional Phased Haplotypes ==<br />
<br />
For many analyses, but in particular for genotype imputation, it can be very helpful to provide a set of reference haplotypes as input. Reference haplotypes can include genotypes for markers that were not examined in your own sample but which can, often, be inputed based on genotypes at flanking markers. Most commonly, these haplotypes might be derived from a public resource such as the International HapMap Project and, eventually, the 1000 Genomes Project. <br />
<br />
You can retrieve a current set of phased HapMap format haplotypes from http://hapmap.org/downloads/phasing/2007-08_rel22/phased/. <br />
<br />
HapMap III phased haplotypes are in different format, you will need to use our converted haplotypes available at http://csg.sph.umich.edu//yli/mach/download/HapMap3.r2.b36.html<br />
<br />
Additional reference files (e.g., those based on data from the 1000 Genomes Project; combined reference files) can be found through links at http://csg.sph.umich.edu//yli/mach/download/<br />
<br />
Phase haplotype information is encoded in two files. The first file (which MACH calls the "snp file") lists the markers in the phased haplotype. The second file (which MACH calls the "haplotype file") lists one haplotype per line. If you retrieved these files from the HapMap website, simply combine the --hapmapFormat option with the --snp option to indicate the name of the HapMap legend file and the --haps option to indicate the name of the file with phased haplotypes. Here is an example:<br />
<br />
prompt> mach1 --hapmapFormat --snps genotypes_chr1_CEU_r22_nr.b36_fwd_legend.txt.gz --haps genotypes_chr1_CEU_r22_nr.b36_fwd.phase.gz ...<br />
<br />
If you don't use the --hapmapFormat option, MACH expects the snp file (indicated with the --snps option) to simply list one marker name per line and the haplotype files (indicated with the --haps option) to list one haplotype per line. Haplotypes can be prefaced by one or two optional labels followed by a series of single character alleles, one for each marker. Within each haplotype, spaces are ignored. Here are two examples:<br />
<br />
'''<Example of a snp list file>'''<br />
marker1<br />
marker2<br />
...<br />
marker13<br />
'''<End of snp list file>'''<br />
<br />
In the sample haplotype file below, note that the first two columns are automatically ignored (because, based on the snp list file, MACH knows the phased haplotypes should include only 13 markers, corresponding to the last string of characters on each line). <br />
<br />
'''<Example of a phased haplotype file>'''<br />
FAMILY1->PERSON1 HAPLO1 CGGCGCGCTTGGC<br />
FAMILY1->PERSON1 HAPLO2 CGGCGCGTCCAGC<br />
FAMILY2->PERSON1 HAPLO1 GGGCGCGCTTGGC<br />
FAMILY2->PERSON1 HAPLO2 GGAAGCACTCGGC<br />
...<br />
'''<End of phased haplotype file>'''<br />
<br />
If you provide a MACH a set of reference haplotypes as input, the marker order in the phased haplotypes overrides any marker order that may be specified in the pedigree and data files that contain the genotype data. This means that one convenient way to re-order markers in your original pedigree and data file is to simply create an empty haplotype file and a companion snp that lists markers in the desired order. When you provide these two as input, they'll overwrite the marker order specified in the data file.<br />
<br />
== Saving Disk Space ==<br />
<br />
'''Useful Tip:''' You can usually economize disk space by using gzip to compress your input files (the data and pedigree files and any files containing the reference haplotypes). MACH can automatically recognize gzipped files and decompress them on the fly.<br />
<br />
That is all you should need to get started!</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH:_Input_Files&diff=14635MaCH: Input Files2017-02-02T15:30:29Z<p>Ppwhite: /* Optional Phased Haplotypes */</p>
<hr />
<div>An earlier version of this page is available at http://csg.sph.umich.edu//abecasis/MaCH/tour/input_files.html.<br />
<br />
MACH input files include information on experimental genotypes for a set of individuals and, optionally, on a set of known haplotypes. MACH can use these to estimate haplotypes for each sampled individual (conditional on the observed genotypes) or to fill in missing genotypes (conditional on observed genotypes at flanking markers and on the observed genotypes at other individuals). Since an essential first step in any analysis is to make sure data is formatted correctly, it is worthwhile to go over the input files MACH expects and their formats.<br />
<br />
== Observed Genotypes ==<br />
The essential inputs for MACH are a set of observed genotypes for each individual being studied. Typically, MACH expects that all the markers being examined map to one chromosome and that appear in map order in the input files. These requirements can be relaxed when using phased haplotypes as input (see below).<br />
<br />
MACH expects observed genotype data to be stored in a set of matched pedigree and data files. The two files are intrinsically linked, the data file describes the contents of the pedigree file (every pedigree file is slightly different) and the pedigree file itself can only be decoded with its companion data file. The two files can use either the more modern [[Merlin]] / [[QTDT]] format or the classic [[LINKAGE]] format. Detailed descriptions of each format are available elsewhere (for example, see [http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html details of Merlin input formats]), and here we focus on providing an overview of the bare essentials required for using MACH.<br />
<br />
Data files can describe a variety of fields, including disease status information, quantitative traits and covariates, and marker genotypes. A simple MACH data file simply lists names for a series of genetic markers. Each marker name appears its own line prefaced by an " M " field code. Here is an example:<br />
<br />
'''<Example of a simple data file>'''<br />
M marker1<br />
M marker2<br />
...<br />
'''<End of simple data file>'''<br />
<br />
The actual genotypes are stored in a pedigree file. The pedigree file encodes one individual per row. Each row should start with an family id and individual id, followed by a father and mother id (which typically are both set to 0, 'zero', for unrelated individuals), and sex. These initial columns are followed by a series of marker genotypes, each with two alleles. We recommend that the alleles should be coded as A, C, G, T. For compatibility with older analysis tools, it is also possible to encode allels as 1 (for A), 2 (for C), 3 (for G) and 4 (for T). See below for an example:<br />
<br />
'''<Example of a pedigree file with base-pair coded alleles>'''<br />
FAM1001 ID1234 0 0 M A A A C C C<br />
FAM1002 ID5678 0 0 F A C C C G G<br />
...<br />
'''<End of pedigree file>'''<br />
<br />
Some people prefer to use a "/" to separate alleles, as it makes the pedigree easier to read. Thus, the following pedigree is equivalent:<br />
<br />
'''<Example of a pedigree file with base-pair coded alleles>'''<br />
FAM1001 ID1234 0 0 M A/A A/C C/C<br />
FAM1002 ID5678 0 0 F A/C C/C G/G<br />
...<br />
'''<End of pedigree file>'''<br />
<br />
Missing genotypes can be encoded with a '.', "dot", or a '0', "zero". For example, here are two individuals that are missing the first genotype:<br />
<br />
'''<Example of a pedigree file with base-pair coded alleles>'''<br />
FAM1003 ID1234 0 0 M ./. A/C C/C<br />
FAM1004 ID5678 0 0 F 0/0 C/C G/G<br />
...<br />
'''<End of pedigree file>'''<br />
<br />
<br />
Although we don't recommend it, it is possible to use a pedigree file with numerically coded alleles. For an example, see [[MaCH: Pedigree with Integer Allele Codes|obsolete input formats]].<br />
<br />
In the MACH command line, the name of the data and pedigree files is indicated with the -d and -p options (in short hand form) or the --datfile and --pedfile options (in long form) respectively. <br />
<br />
For example: <br />
<br />
mach -d genotypes.dat -p genotypes.ped<br />
<br />
Or:<br />
<br />
mach --datfile genotypes.dat --pedfile genotypes.ped<br />
<br />
== Optional Phased Haplotypes ==<br />
<br />
For many analyses, but in particular for genotype imputation, it can be very helpful to provide a set of reference haplotypes as input. Reference haplotypes can include genotypes for markers that were not examined in your own sample but which can, often, be inputed based on genotypes at flanking markers. Most commonly, these haplotypes might be derived from a public resource such as the International HapMap Project and, eventually, the 1000 Genomes Project. <br />
<br />
You can retrieve a current set of phased HapMap format haplotypes from http://hapmap.org/downloads/phasing/2007-08_rel22/phased/. <br />
<br />
HapMap III phased haplotypes are in different format, you will need to use our converted haplotypes available at http://www.sph.umich.edu/csg/yli/mach/download/HapMap3.r2.b36.html<br />
<br />
Additional reference files (e.g., those based on data from the 1000 Genomes Project; combined reference files) can be found through links at http://csg.sph.umich.edu//yli/mach/download/<br />
<br />
Phase haplotype information is encoded in two files. The first file (which MACH calls the "snp file") lists the markers in the phased haplotype. The second file (which MACH calls the "haplotype file") lists one haplotype per line. If you retrieved these files from the HapMap website, simply combine the --hapmapFormat option with the --snp option to indicate the name of the HapMap legend file and the --haps option to indicate the name of the file with phased haplotypes. Here is an example:<br />
<br />
prompt> mach1 --hapmapFormat --snps genotypes_chr1_CEU_r22_nr.b36_fwd_legend.txt.gz --haps genotypes_chr1_CEU_r22_nr.b36_fwd.phase.gz ...<br />
<br />
If you don't use the --hapmapFormat option, MACH expects the snp file (indicated with the --snps option) to simply list one marker name per line and the haplotype files (indicated with the --haps option) to list one haplotype per line. Haplotypes can be prefaced by one or two optional labels followed by a series of single character alleles, one for each marker. Within each haplotype, spaces are ignored. Here are two examples:<br />
<br />
'''<Example of a snp list file>'''<br />
marker1<br />
marker2<br />
...<br />
marker13<br />
'''<End of snp list file>'''<br />
<br />
In the sample haplotype file below, note that the first two columns are automatically ignored (because, based on the snp list file, MACH knows the phased haplotypes should include only 13 markers, corresponding to the last string of characters on each line). <br />
<br />
'''<Example of a phased haplotype file>'''<br />
FAMILY1->PERSON1 HAPLO1 CGGCGCGCTTGGC<br />
FAMILY1->PERSON1 HAPLO2 CGGCGCGTCCAGC<br />
FAMILY2->PERSON1 HAPLO1 GGGCGCGCTTGGC<br />
FAMILY2->PERSON1 HAPLO2 GGAAGCACTCGGC<br />
...<br />
'''<End of phased haplotype file>'''<br />
<br />
If you provide a MACH a set of reference haplotypes as input, the marker order in the phased haplotypes overrides any marker order that may be specified in the pedigree and data files that contain the genotype data. This means that one convenient way to re-order markers in your original pedigree and data file is to simply create an empty haplotype file and a companion snp that lists markers in the desired order. When you provide these two as input, they'll overwrite the marker order specified in the data file.<br />
<br />
== Saving Disk Space ==<br />
<br />
'''Useful Tip:''' You can usually economize disk space by using gzip to compress your input files (the data and pedigree files and any files containing the reference haplotypes). MACH can automatically recognize gzipped files and decompress them on the fly.<br />
<br />
That is all you should need to get started!</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH:_Input_Files&diff=14634MaCH: Input Files2017-02-02T15:29:41Z<p>Ppwhite: </p>
<hr />
<div>An earlier version of this page is available at http://csg.sph.umich.edu//abecasis/MaCH/tour/input_files.html.<br />
<br />
MACH input files include information on experimental genotypes for a set of individuals and, optionally, on a set of known haplotypes. MACH can use these to estimate haplotypes for each sampled individual (conditional on the observed genotypes) or to fill in missing genotypes (conditional on observed genotypes at flanking markers and on the observed genotypes at other individuals). Since an essential first step in any analysis is to make sure data is formatted correctly, it is worthwhile to go over the input files MACH expects and their formats.<br />
<br />
== Observed Genotypes ==<br />
The essential inputs for MACH are a set of observed genotypes for each individual being studied. Typically, MACH expects that all the markers being examined map to one chromosome and that appear in map order in the input files. These requirements can be relaxed when using phased haplotypes as input (see below).<br />
<br />
MACH expects observed genotype data to be stored in a set of matched pedigree and data files. The two files are intrinsically linked, the data file describes the contents of the pedigree file (every pedigree file is slightly different) and the pedigree file itself can only be decoded with its companion data file. The two files can use either the more modern [[Merlin]] / [[QTDT]] format or the classic [[LINKAGE]] format. Detailed descriptions of each format are available elsewhere (for example, see [http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html details of Merlin input formats]), and here we focus on providing an overview of the bare essentials required for using MACH.<br />
<br />
Data files can describe a variety of fields, including disease status information, quantitative traits and covariates, and marker genotypes. A simple MACH data file simply lists names for a series of genetic markers. Each marker name appears its own line prefaced by an " M " field code. Here is an example:<br />
<br />
'''<Example of a simple data file>'''<br />
M marker1<br />
M marker2<br />
...<br />
'''<End of simple data file>'''<br />
<br />
The actual genotypes are stored in a pedigree file. The pedigree file encodes one individual per row. Each row should start with an family id and individual id, followed by a father and mother id (which typically are both set to 0, 'zero', for unrelated individuals), and sex. These initial columns are followed by a series of marker genotypes, each with two alleles. We recommend that the alleles should be coded as A, C, G, T. For compatibility with older analysis tools, it is also possible to encode allels as 1 (for A), 2 (for C), 3 (for G) and 4 (for T). See below for an example:<br />
<br />
'''<Example of a pedigree file with base-pair coded alleles>'''<br />
FAM1001 ID1234 0 0 M A A A C C C<br />
FAM1002 ID5678 0 0 F A C C C G G<br />
...<br />
'''<End of pedigree file>'''<br />
<br />
Some people prefer to use a "/" to separate alleles, as it makes the pedigree easier to read. Thus, the following pedigree is equivalent:<br />
<br />
'''<Example of a pedigree file with base-pair coded alleles>'''<br />
FAM1001 ID1234 0 0 M A/A A/C C/C<br />
FAM1002 ID5678 0 0 F A/C C/C G/G<br />
...<br />
'''<End of pedigree file>'''<br />
<br />
Missing genotypes can be encoded with a '.', "dot", or a '0', "zero". For example, here are two individuals that are missing the first genotype:<br />
<br />
'''<Example of a pedigree file with base-pair coded alleles>'''<br />
FAM1003 ID1234 0 0 M ./. A/C C/C<br />
FAM1004 ID5678 0 0 F 0/0 C/C G/G<br />
...<br />
'''<End of pedigree file>'''<br />
<br />
<br />
Although we don't recommend it, it is possible to use a pedigree file with numerically coded alleles. For an example, see [[MaCH: Pedigree with Integer Allele Codes|obsolete input formats]].<br />
<br />
In the MACH command line, the name of the data and pedigree files is indicated with the -d and -p options (in short hand form) or the --datfile and --pedfile options (in long form) respectively. <br />
<br />
For example: <br />
<br />
mach -d genotypes.dat -p genotypes.ped<br />
<br />
Or:<br />
<br />
mach --datfile genotypes.dat --pedfile genotypes.ped<br />
<br />
== Optional Phased Haplotypes ==<br />
<br />
For many analyses, but in particular for genotype imputation, it can be very helpful to provide a set of reference haplotypes as input. Reference haplotypes can include genotypes for markers that were not examined in your own sample but which can, often, be inputed based on genotypes at flanking markers. Most commonly, these haplotypes might be derived from a public resource such as the International HapMap Project and, eventually, the 1000 Genomes Project. <br />
<br />
You can retrieve a current set of phased HapMap format haplotypes from http://hapmap.org/downloads/phasing/2007-08_rel22/phased/. <br />
<br />
HapMap III phased haplotypes are in different format, you will need to use our converted haplotypes available at http://www.sph.umich.edu/csg/yli/mach/download/HapMap3.r2.b36.html<br />
<br />
Additional reference files (e.g., those based on data from the 1000 Genomes Project; combined reference files) can be found through links at http://www.sph.umich.edu/csg/yli/mach/download/<br />
<br />
Phase haplotype information is encoded in two files. The first file (which MACH calls the "snp file") lists the markers in the phased haplotype. The second file (which MACH calls the "haplotype file") lists one haplotype per line. If you retrieved these files from the HapMap website, simply combine the --hapmapFormat option with the --snp option to indicate the name of the HapMap legend file and the --haps option to indicate the name of the file with phased haplotypes. Here is an example:<br />
<br />
prompt> mach1 --hapmapFormat --snps genotypes_chr1_CEU_r22_nr.b36_fwd_legend.txt.gz --haps genotypes_chr1_CEU_r22_nr.b36_fwd.phase.gz ...<br />
<br />
If you don't use the --hapmapFormat option, MACH expects the snp file (indicated with the --snps option) to simply list one marker name per line and the haplotype files (indicated with the --haps option) to list one haplotype per line. Haplotypes can be prefaced by one or two optional labels followed by a series of single character alleles, one for each marker. Within each haplotype, spaces are ignored. Here are two examples:<br />
<br />
'''<Example of a snp list file>'''<br />
marker1<br />
marker2<br />
...<br />
marker13<br />
'''<End of snp list file>'''<br />
<br />
In the sample haplotype file below, note that the first two columns are automatically ignored (because, based on the snp list file, MACH knows the phased haplotypes should include only 13 markers, corresponding to the last string of characters on each line). <br />
<br />
'''<Example of a phased haplotype file>'''<br />
FAMILY1->PERSON1 HAPLO1 CGGCGCGCTTGGC<br />
FAMILY1->PERSON1 HAPLO2 CGGCGCGTCCAGC<br />
FAMILY2->PERSON1 HAPLO1 GGGCGCGCTTGGC<br />
FAMILY2->PERSON1 HAPLO2 GGAAGCACTCGGC<br />
...<br />
'''<End of phased haplotype file>'''<br />
<br />
If you provide a MACH a set of reference haplotypes as input, the marker order in the phased haplotypes overrides any marker order that may be specified in the pedigree and data files that contain the genotype data. This means that one convenient way to re-order markers in your original pedigree and data file is to simply create an empty haplotype file and a companion snp that lists markers in the desired order. When you provide these two as input, they'll overwrite the marker order specified in the data file.<br />
<br />
== Saving Disk Space ==<br />
<br />
'''Useful Tip:''' You can usually economize disk space by using gzip to compress your input files (the data and pedigree files and any files containing the reference haplotypes). MACH can automatically recognize gzipped files and decompress them on the fly.<br />
<br />
That is all you should need to get started!</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH:_machX&diff=14633MaCH: machX2017-02-02T15:28:54Z<p>Ppwhite: /* Reference Haplotypes */</p>
<hr />
<div>This page documents how to perform X chromosome (non-pseudo-autosomal part) imputation using MaCH [http://www.sph.umich.edu/csg/yli/mach] and minimac [http://genome.sph.umich.edu/wiki/Minimac]. <br />
<br />
== Getting Started ==<br />
<br />
=== Your Own Data ===<br />
<br />
To get started, you will need to store your data in [[Merlin]] format pedigree and data files, one per chromosome. For details of the Merlin file format, see the Merlin tutorial [http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html]. <br><br />
<br />
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele). <br><br />
<br />
Note that for males hemizygotes are coded as homozygotes. <br><br />
<br />
=== Reference Haplotypes ===<br />
<br />
You can download the reference haplotypes from MaCH download page [http://csg.sph.umich.edu//yli/mach/download/chrX.html].<br />
<br />
== Two-Step Imputation ==<br />
<br />
=== Phase Your Own Data ===<br />
<br />
If there is no missing genotypes in males, you will only need to phase the females. Make sure that alleles are all stored in forward strand before phasing. <br />
<br />
mach1 -d sample.dat -p sample.ped --states 200 -r 20 --phase -o sample.phased > sample.phased.log<br />
<br />
=== Impute ===<br />
<br />
Imputation will then be performed on the phased haplotypes using minimac [http://genome.sph.umich.edu/wiki/Minimac].<br />
<br />
minimac --refHaps ref.hap.gz --refSnps ref.snps --haps sample.phased.gz --snps sample.snps --rounds 5 --states 200 --prefix sample.imputed > sample.imputed.log<br />
<br />
== FAQ ==<br />
=== Shall I phase/impute males and females together or separately? ===<br />
Phasing males together with or separately from females doesn't seem to affect imputation quality. <br />
<br />
Imputing males together with or separately from females doesn't seem to affect imputation quality either. <br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH:_machX&diff=14632MaCH: machX2017-02-02T15:28:36Z<p>Ppwhite: /* Your Own Data */</p>
<hr />
<div>This page documents how to perform X chromosome (non-pseudo-autosomal part) imputation using MaCH [http://www.sph.umich.edu/csg/yli/mach] and minimac [http://genome.sph.umich.edu/wiki/Minimac]. <br />
<br />
== Getting Started ==<br />
<br />
=== Your Own Data ===<br />
<br />
To get started, you will need to store your data in [[Merlin]] format pedigree and data files, one per chromosome. For details of the Merlin file format, see the Merlin tutorial [http://csg.sph.umich.edu//abecasis/Merlin/tour/input_files.html]. <br><br />
<br />
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele). <br><br />
<br />
Note that for males hemizygotes are coded as homozygotes. <br><br />
<br />
=== Reference Haplotypes ===<br />
<br />
You can download the reference haplotypes from MaCH download page [http://www.sph.umich.edu/csg/yli/mach/download/chrX.html].<br />
<br />
== Two-Step Imputation ==<br />
<br />
=== Phase Your Own Data ===<br />
<br />
If there is no missing genotypes in males, you will only need to phase the females. Make sure that alleles are all stored in forward strand before phasing. <br />
<br />
mach1 -d sample.dat -p sample.ped --states 200 -r 20 --phase -o sample.phased > sample.phased.log<br />
<br />
=== Impute ===<br />
<br />
Imputation will then be performed on the phased haplotypes using minimac [http://genome.sph.umich.edu/wiki/Minimac].<br />
<br />
minimac --refHaps ref.hap.gz --refSnps ref.snps --haps sample.phased.gz --snps sample.snps --rounds 5 --states 200 --prefix sample.imputed > sample.imputed.log<br />
<br />
== FAQ ==<br />
=== Shall I phase/impute males and females together or separately? ===<br />
Phasing males together with or separately from females doesn't seem to affect imputation quality. <br />
<br />
Imputing males together with or separately from females doesn't seem to affect imputation quality either. <br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=Thunder&diff=14631Thunder2017-02-02T15:27:22Z<p>Ppwhite: /* Ligate Haplotypes */</p>
<hr />
<div>This page documents how to perform variant calling from low-coverage sequencing data using glfmultiples and thunder. The pipeline was originally developed by [mailto:yunli@med.unc.edu Yun Li] and for [mailto:goncalo@umich.edu Goncalo Abecasis] the 1000 Genomes Low Coverage Pilot Project. <br />
<br />
== Input Data ==<br />
<br />
To get started, you will need glf files in the standard format [http://samtools.sourceforge.net/SAM1.pdf glf format]. Sample files are available at [ftp://share.sph.umich.edu/1000genomes/pilot1/examples/glf.tgz sample glf files]. <br />
<br />
If you do not have glf files, you can generate them from bam files (bam format also specified in [http://samtools.sourceforge.net/SAM1.pdf glf format bam format]) using the following command line: <br />
<br />
samtools pileup -g -T 1 -f ref.fa my.bam &gt; my.glf<br />
<br />
Note: you will need the reference fasta file ref.fa to create glf file from bam file.<br />
<br />
== How to Run ==<br />
<br />
This variant calling pipeline has two steps. (step 1) promotion of a set of potential polymorphisms; and (step 2) genotype/haplotype calling using LD information. <br />
<br />
=== (step 1) Site promotion using software glfMultiples [https://csg.sph.umich.edu//yli/GPT_Freq.011.source.tgz GPT_Freq] ===<br />
<br />
GPT_Freq -b my.out -p 0.9 --minDepth 10 --maxDepth 1000 *.glf <br />
<br />
minDepth and maxDepth are the cutoffs on total depth (across all individuals). We have found it useful to exclude sites with extremely low and high total depth. Please see Important Filters below.<br />
<br />
=== (step 2) Genotype/haplotype calling using thunder [https://csg.sph.umich.edu//yli/thunder/thunder.V011.source.tgz thunder_glf_freq] ===<br />
<br />
thunder_glf_freq --shotgun my.out.$chr --detailedInput -r 100 --states 200 --dosage --phase --interim 25 -o my.final.out<br />
<br />
Notes: <br />
<br />
(1) The program thunder used in step 2 is an extension of MaCH, the genotype imputation software we have previously developed. For details regarding the shared options, please check out [http://csg.sph.umich.edu//yli/mach/index.html MaCH website] and [http://genome.sph.umich.edu/wiki/Mach MaCH wiki]. <br />
<br />
(2) Check out example files and command lines under examples/thunder/ in the thunder package [https://csg.sph.umich.edu//yli/thunder/thunder.V011.source.tgz thunder_glf_freq].<br />
<br />
== Example Showing the Whole Pipeline ==<br />
In the thunder [https://csg.sph.umich.edu//yli/thunder/thunder.V011.source.tgz thunder_glf_freq] tarball, you can find under example/thunder/ folder, input files extracted from real data and a C-shell script that executes the whole analysis pipeline.<br />
<br />
== Ligate Haplotypes ==<br />
Please use [http://csg.sph.umich.edu//yli/ligateHap.V004.tgz ligateHaplotypes].<br />
<br />
== Important Filters ==<br />
<br />
We have found that the following filters are helpful.<br />
<br />
=== allelic imbalance ===<br />
A statistic developed by Dr. Tom Blackwell [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_Based_Allele_Balance allelic imbalance]. <br />
<br />
=== indel filter ===<br />
We recommend distance to known indels >= 5bp. A catalog of known indels can be found at [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/ indel catalog].<br />
<br />
=== site promotion filter ===<br />
We recommend setting parameter -p at least >= 0.9 in step 1 (running glfMultiples).<br />
<br />
=== strand bias filter ===<br />
<br />
=== total depth filter ===<br />
For the 1000 Genomes Project (average depth per individual ~4X), we have found it useful to exclude sites with average total depth per individual < 0.5X or > 20X.<br />
<br />
=== coverage filter ===<br />
We recommend the filter of >50% individuals with coverage.<br />
<br />
=== flanking sequence filter ===<br />
We recommend excluding sites with >0.1% flanking 10-mer frequency among candidate sites. samtools calmd -br performs this base quality re-calibration.<br />
<br />
== Citation ==<br />
Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: Implications for design of complex trait association studies. <em>Genome Res.</em> 2011 Jun;21(6):940-51. <br><br />
<br />
== Inference with External Reference ==<br />
<br />
Please refer to [http://genome.sph.umich.edu/wiki/UMAKE UMAKE]. <br><br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=Thunder&diff=14630Thunder2017-02-02T15:27:06Z<p>Ppwhite: /* Example Showing the Whole Pipeline */</p>
<hr />
<div>This page documents how to perform variant calling from low-coverage sequencing data using glfmultiples and thunder. The pipeline was originally developed by [mailto:yunli@med.unc.edu Yun Li] and for [mailto:goncalo@umich.edu Goncalo Abecasis] the 1000 Genomes Low Coverage Pilot Project. <br />
<br />
== Input Data ==<br />
<br />
To get started, you will need glf files in the standard format [http://samtools.sourceforge.net/SAM1.pdf glf format]. Sample files are available at [ftp://share.sph.umich.edu/1000genomes/pilot1/examples/glf.tgz sample glf files]. <br />
<br />
If you do not have glf files, you can generate them from bam files (bam format also specified in [http://samtools.sourceforge.net/SAM1.pdf glf format bam format]) using the following command line: <br />
<br />
samtools pileup -g -T 1 -f ref.fa my.bam &gt; my.glf<br />
<br />
Note: you will need the reference fasta file ref.fa to create glf file from bam file.<br />
<br />
== How to Run ==<br />
<br />
This variant calling pipeline has two steps. (step 1) promotion of a set of potential polymorphisms; and (step 2) genotype/haplotype calling using LD information. <br />
<br />
=== (step 1) Site promotion using software glfMultiples [https://csg.sph.umich.edu//yli/GPT_Freq.011.source.tgz GPT_Freq] ===<br />
<br />
GPT_Freq -b my.out -p 0.9 --minDepth 10 --maxDepth 1000 *.glf <br />
<br />
minDepth and maxDepth are the cutoffs on total depth (across all individuals). We have found it useful to exclude sites with extremely low and high total depth. Please see Important Filters below.<br />
<br />
=== (step 2) Genotype/haplotype calling using thunder [https://csg.sph.umich.edu//yli/thunder/thunder.V011.source.tgz thunder_glf_freq] ===<br />
<br />
thunder_glf_freq --shotgun my.out.$chr --detailedInput -r 100 --states 200 --dosage --phase --interim 25 -o my.final.out<br />
<br />
Notes: <br />
<br />
(1) The program thunder used in step 2 is an extension of MaCH, the genotype imputation software we have previously developed. For details regarding the shared options, please check out [http://csg.sph.umich.edu//yli/mach/index.html MaCH website] and [http://genome.sph.umich.edu/wiki/Mach MaCH wiki]. <br />
<br />
(2) Check out example files and command lines under examples/thunder/ in the thunder package [https://csg.sph.umich.edu//yli/thunder/thunder.V011.source.tgz thunder_glf_freq].<br />
<br />
== Example Showing the Whole Pipeline ==<br />
In the thunder [https://csg.sph.umich.edu//yli/thunder/thunder.V011.source.tgz thunder_glf_freq] tarball, you can find under example/thunder/ folder, input files extracted from real data and a C-shell script that executes the whole analysis pipeline.<br />
<br />
== Ligate Haplotypes ==<br />
Please use [http://www.sph.umich.edu/csg/yli/ligateHap.V004.tgz ligateHaplotypes].<br />
<br />
== Important Filters ==<br />
<br />
We have found that the following filters are helpful.<br />
<br />
=== allelic imbalance ===<br />
A statistic developed by Dr. Tom Blackwell [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_Based_Allele_Balance allelic imbalance]. <br />
<br />
=== indel filter ===<br />
We recommend distance to known indels >= 5bp. A catalog of known indels can be found at [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/ indel catalog].<br />
<br />
=== site promotion filter ===<br />
We recommend setting parameter -p at least >= 0.9 in step 1 (running glfMultiples).<br />
<br />
=== strand bias filter ===<br />
<br />
=== total depth filter ===<br />
For the 1000 Genomes Project (average depth per individual ~4X), we have found it useful to exclude sites with average total depth per individual < 0.5X or > 20X.<br />
<br />
=== coverage filter ===<br />
We recommend the filter of >50% individuals with coverage.<br />
<br />
=== flanking sequence filter ===<br />
We recommend excluding sites with >0.1% flanking 10-mer frequency among candidate sites. samtools calmd -br performs this base quality re-calibration.<br />
<br />
== Citation ==<br />
Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: Implications for design of complex trait association studies. <em>Genome Res.</em> 2011 Jun;21(6):940-51. <br><br />
<br />
== Inference with External Reference ==<br />
<br />
Please refer to [http://genome.sph.umich.edu/wiki/UMAKE UMAKE]. <br><br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=Thunder&diff=14629Thunder2017-02-02T15:26:33Z<p>Ppwhite: /* (step 2) Genotype/haplotype calling using thunder thunder_glf_freq */</p>
<hr />
<div>This page documents how to perform variant calling from low-coverage sequencing data using glfmultiples and thunder. The pipeline was originally developed by [mailto:yunli@med.unc.edu Yun Li] and for [mailto:goncalo@umich.edu Goncalo Abecasis] the 1000 Genomes Low Coverage Pilot Project. <br />
<br />
== Input Data ==<br />
<br />
To get started, you will need glf files in the standard format [http://samtools.sourceforge.net/SAM1.pdf glf format]. Sample files are available at [ftp://share.sph.umich.edu/1000genomes/pilot1/examples/glf.tgz sample glf files]. <br />
<br />
If you do not have glf files, you can generate them from bam files (bam format also specified in [http://samtools.sourceforge.net/SAM1.pdf glf format bam format]) using the following command line: <br />
<br />
samtools pileup -g -T 1 -f ref.fa my.bam &gt; my.glf<br />
<br />
Note: you will need the reference fasta file ref.fa to create glf file from bam file.<br />
<br />
== How to Run ==<br />
<br />
This variant calling pipeline has two steps. (step 1) promotion of a set of potential polymorphisms; and (step 2) genotype/haplotype calling using LD information. <br />
<br />
=== (step 1) Site promotion using software glfMultiples [https://csg.sph.umich.edu//yli/GPT_Freq.011.source.tgz GPT_Freq] ===<br />
<br />
GPT_Freq -b my.out -p 0.9 --minDepth 10 --maxDepth 1000 *.glf <br />
<br />
minDepth and maxDepth are the cutoffs on total depth (across all individuals). We have found it useful to exclude sites with extremely low and high total depth. Please see Important Filters below.<br />
<br />
=== (step 2) Genotype/haplotype calling using thunder [https://csg.sph.umich.edu//yli/thunder/thunder.V011.source.tgz thunder_glf_freq] ===<br />
<br />
thunder_glf_freq --shotgun my.out.$chr --detailedInput -r 100 --states 200 --dosage --phase --interim 25 -o my.final.out<br />
<br />
Notes: <br />
<br />
(1) The program thunder used in step 2 is an extension of MaCH, the genotype imputation software we have previously developed. For details regarding the shared options, please check out [http://csg.sph.umich.edu//yli/mach/index.html MaCH website] and [http://genome.sph.umich.edu/wiki/Mach MaCH wiki]. <br />
<br />
(2) Check out example files and command lines under examples/thunder/ in the thunder package [https://csg.sph.umich.edu//yli/thunder/thunder.V011.source.tgz thunder_glf_freq].<br />
<br />
== Example Showing the Whole Pipeline ==<br />
In the thunder [https://www.sph.umich.edu/csg/yli/thunder/thunder.V011.source.tgz thunder_glf_freq] tarball, you can find under example/thunder/ folder, input files extracted from real data and a C-shell script that executes the whole analysis pipeline.<br />
<br />
== Ligate Haplotypes ==<br />
Please use [http://www.sph.umich.edu/csg/yli/ligateHap.V004.tgz ligateHaplotypes].<br />
<br />
== Important Filters ==<br />
<br />
We have found that the following filters are helpful.<br />
<br />
=== allelic imbalance ===<br />
A statistic developed by Dr. Tom Blackwell [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_Based_Allele_Balance allelic imbalance]. <br />
<br />
=== indel filter ===<br />
We recommend distance to known indels >= 5bp. A catalog of known indels can be found at [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/ indel catalog].<br />
<br />
=== site promotion filter ===<br />
We recommend setting parameter -p at least >= 0.9 in step 1 (running glfMultiples).<br />
<br />
=== strand bias filter ===<br />
<br />
=== total depth filter ===<br />
For the 1000 Genomes Project (average depth per individual ~4X), we have found it useful to exclude sites with average total depth per individual < 0.5X or > 20X.<br />
<br />
=== coverage filter ===<br />
We recommend the filter of >50% individuals with coverage.<br />
<br />
=== flanking sequence filter ===<br />
We recommend excluding sites with >0.1% flanking 10-mer frequency among candidate sites. samtools calmd -br performs this base quality re-calibration.<br />
<br />
== Citation ==<br />
Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: Implications for design of complex trait association studies. <em>Genome Res.</em> 2011 Jun;21(6):940-51. <br><br />
<br />
== Inference with External Reference ==<br />
<br />
Please refer to [http://genome.sph.umich.edu/wiki/UMAKE UMAKE]. <br><br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=Thunder&diff=14628Thunder2017-02-02T15:25:59Z<p>Ppwhite: /* (step 1) Site promotion using software glfMultiples GPT_Freq */</p>
<hr />
<div>This page documents how to perform variant calling from low-coverage sequencing data using glfmultiples and thunder. The pipeline was originally developed by [mailto:yunli@med.unc.edu Yun Li] and for [mailto:goncalo@umich.edu Goncalo Abecasis] the 1000 Genomes Low Coverage Pilot Project. <br />
<br />
== Input Data ==<br />
<br />
To get started, you will need glf files in the standard format [http://samtools.sourceforge.net/SAM1.pdf glf format]. Sample files are available at [ftp://share.sph.umich.edu/1000genomes/pilot1/examples/glf.tgz sample glf files]. <br />
<br />
If you do not have glf files, you can generate them from bam files (bam format also specified in [http://samtools.sourceforge.net/SAM1.pdf glf format bam format]) using the following command line: <br />
<br />
samtools pileup -g -T 1 -f ref.fa my.bam &gt; my.glf<br />
<br />
Note: you will need the reference fasta file ref.fa to create glf file from bam file.<br />
<br />
== How to Run ==<br />
<br />
This variant calling pipeline has two steps. (step 1) promotion of a set of potential polymorphisms; and (step 2) genotype/haplotype calling using LD information. <br />
<br />
=== (step 1) Site promotion using software glfMultiples [https://csg.sph.umich.edu//yli/GPT_Freq.011.source.tgz GPT_Freq] ===<br />
<br />
GPT_Freq -b my.out -p 0.9 --minDepth 10 --maxDepth 1000 *.glf <br />
<br />
minDepth and maxDepth are the cutoffs on total depth (across all individuals). We have found it useful to exclude sites with extremely low and high total depth. Please see Important Filters below.<br />
<br />
=== (step 2) Genotype/haplotype calling using thunder [https://www.sph.umich.edu/csg/yli/thunder/thunder.V011.source.tgz thunder_glf_freq] ===<br />
<br />
thunder_glf_freq --shotgun my.out.$chr --detailedInput -r 100 --states 200 --dosage --phase --interim 25 -o my.final.out<br />
<br />
Notes: <br />
<br />
(1) The program thunder used in step 2 is an extension of MaCH, the genotype imputation software we have previously developed. For details regarding the shared options, please check out [http://www.sph.umich.edu/csg/yli/mach/index.html MaCH website] and [http://genome.sph.umich.edu/wiki/Mach MaCH wiki]. <br />
<br />
(2) Check out example files and command lines under examples/thunder/ in the thunder package [https://www.sph.umich.edu/csg/yli/thunder/thunder.V011.source.tgz thunder_glf_freq].<br />
<br />
== Example Showing the Whole Pipeline ==<br />
In the thunder [https://www.sph.umich.edu/csg/yli/thunder/thunder.V011.source.tgz thunder_glf_freq] tarball, you can find under example/thunder/ folder, input files extracted from real data and a C-shell script that executes the whole analysis pipeline.<br />
<br />
== Ligate Haplotypes ==<br />
Please use [http://www.sph.umich.edu/csg/yli/ligateHap.V004.tgz ligateHaplotypes].<br />
<br />
== Important Filters ==<br />
<br />
We have found that the following filters are helpful.<br />
<br />
=== allelic imbalance ===<br />
A statistic developed by Dr. Tom Blackwell [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_Based_Allele_Balance allelic imbalance]. <br />
<br />
=== indel filter ===<br />
We recommend distance to known indels >= 5bp. A catalog of known indels can be found at [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/low_coverage/indels/ indel catalog].<br />
<br />
=== site promotion filter ===<br />
We recommend setting parameter -p at least >= 0.9 in step 1 (running glfMultiples).<br />
<br />
=== strand bias filter ===<br />
<br />
=== total depth filter ===<br />
For the 1000 Genomes Project (average depth per individual ~4X), we have found it useful to exclude sites with average total depth per individual < 0.5X or > 20X.<br />
<br />
=== coverage filter ===<br />
We recommend the filter of >50% individuals with coverage.<br />
<br />
=== flanking sequence filter ===<br />
We recommend excluding sites with >0.1% flanking 10-mer frequency among candidate sites. samtools calmd -br performs this base quality re-calibration.<br />
<br />
== Citation ==<br />
Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: Implications for design of complex trait association studies. <em>Genome Res.</em> 2011 Jun;21(6):940-51. <br><br />
<br />
== Inference with External Reference ==<br />
<br />
Please refer to [http://genome.sph.umich.edu/wiki/UMAKE UMAKE]. <br><br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=Mach_DAC&diff=14627Mach DAC2017-02-02T15:24:56Z<p>Ppwhite: /* Post Phasing/Imputation Ligation */</p>
<hr />
<div>This is the MaCH Divide and Conquer page, documenting how to break the genome into smaller pieces before imputation/phasing and how to ligate after imputation/phasing.<br />
<br />
== Phasing without External Reference ==<br />
=== Your Data ===<br />
To get started, you will need to store your data in [[Merlin]] format pedigree and data files, one per chromosome. For details of the Merlin file format, see the Merlin tutorial [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html]. <br><br />
<br />
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele). <br><br />
<br />
=== Split Your Data ===<br />
You can split your data using [http://csg.sph.umich.edu//yli/splitPed/ splitPed]. If you follow our recommendation of using MaCH+minimac for imputation, you only need to use splitPed in the MaCH step (to phase your study sample), which does not involve external reference. In the minimac step, imputation finishes within a day for several thousand individuals even for the largest chromosome as a whole: A good rule of thumb is that minimac should take about 1 hour to impute 1,000,000 markers in 1,000 individuals using a reference panel with 100 haplotypes, see [http://genome.sph.umich.edu/wiki/Minimac#Imputation minimac wiki] for more details.<br />
<br />
== Phasing/Imputation with External Reference ==<br />
When you phase/impute with external reference panel, you will only need to break the reference files into parts containing subsets of markers because SNPs in your own data (pedigree files) but not in reference files will be automatically ignored by MaCH and minimac. <br><br />
<br />
You can split the reference data using [http://csg.sph.umich.edu//yli/splitRef/ splitRef].<br />
<br />
== Post Phasing/Imputation Ligation ==<br />
You can use [http://csg.sph.umich.edu//yli/ligateHap.V004.tgz LigateHaplotypes ] to ligate the parts.<br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=Mach_DAC&diff=14626Mach DAC2017-02-02T15:24:42Z<p>Ppwhite: /* Phasing/Imputation with External Reference */</p>
<hr />
<div>This is the MaCH Divide and Conquer page, documenting how to break the genome into smaller pieces before imputation/phasing and how to ligate after imputation/phasing.<br />
<br />
== Phasing without External Reference ==<br />
=== Your Data ===<br />
To get started, you will need to store your data in [[Merlin]] format pedigree and data files, one per chromosome. For details of the Merlin file format, see the Merlin tutorial [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html]. <br><br />
<br />
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele). <br><br />
<br />
=== Split Your Data ===<br />
You can split your data using [http://csg.sph.umich.edu//yli/splitPed/ splitPed]. If you follow our recommendation of using MaCH+minimac for imputation, you only need to use splitPed in the MaCH step (to phase your study sample), which does not involve external reference. In the minimac step, imputation finishes within a day for several thousand individuals even for the largest chromosome as a whole: A good rule of thumb is that minimac should take about 1 hour to impute 1,000,000 markers in 1,000 individuals using a reference panel with 100 haplotypes, see [http://genome.sph.umich.edu/wiki/Minimac#Imputation minimac wiki] for more details.<br />
<br />
== Phasing/Imputation with External Reference ==<br />
When you phase/impute with external reference panel, you will only need to break the reference files into parts containing subsets of markers because SNPs in your own data (pedigree files) but not in reference files will be automatically ignored by MaCH and minimac. <br><br />
<br />
You can split the reference data using [http://csg.sph.umich.edu//yli/splitRef/ splitRef].<br />
<br />
== Post Phasing/Imputation Ligation ==<br />
You can use [http://www.sph.umich.edu/csg/yli/ligateHap.V004.tgz LigateHaplotypes ] to ligate the parts.<br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=Mach_DAC&diff=14625Mach DAC2017-02-02T15:24:18Z<p>Ppwhite: /* Split Your Data */</p>
<hr />
<div>This is the MaCH Divide and Conquer page, documenting how to break the genome into smaller pieces before imputation/phasing and how to ligate after imputation/phasing.<br />
<br />
== Phasing without External Reference ==<br />
=== Your Data ===<br />
To get started, you will need to store your data in [[Merlin]] format pedigree and data files, one per chromosome. For details of the Merlin file format, see the Merlin tutorial [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html]. <br><br />
<br />
Within each file, markers should be stored by chromosome position. Alleles should be stored in the forward strand and can be encoded as 'A', 'C', 'G' or 'T' (there is no need to use numeric identifiers for each allele). <br><br />
<br />
=== Split Your Data ===<br />
You can split your data using [http://csg.sph.umich.edu//yli/splitPed/ splitPed]. If you follow our recommendation of using MaCH+minimac for imputation, you only need to use splitPed in the MaCH step (to phase your study sample), which does not involve external reference. In the minimac step, imputation finishes within a day for several thousand individuals even for the largest chromosome as a whole: A good rule of thumb is that minimac should take about 1 hour to impute 1,000,000 markers in 1,000 individuals using a reference panel with 100 haplotypes, see [http://genome.sph.umich.edu/wiki/Minimac#Imputation minimac wiki] for more details.<br />
<br />
== Phasing/Imputation with External Reference ==<br />
When you phase/impute with external reference panel, you will only need to break the reference files into parts containing subsets of markers because SNPs in your own data (pedigree files) but not in reference files will be automatically ignored by MaCH and minimac. <br><br />
<br />
You can split the reference data using [http://www.sph.umich.edu/csg/yli/splitRef/ splitRef].<br />
<br />
== Post Phasing/Imputation Ligation ==<br />
You can use [http://www.sph.umich.edu/csg/yli/ligateHap.V004.tgz LigateHaplotypes ] to ligate the parts.<br />
<br />
== Questions and Comments? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH_FAQ&diff=14624MaCH FAQ2017-02-02T15:22:49Z<p>Ppwhite: /* How do I get imputation quality estimates? */</p>
<hr />
<div>== How to speed up? ==<br />
<br />
=== minimac ===<br />
<br />
This is the new 2-step procedure we are recommending, particularly considering people that are performing imputation multiple times (using HapMap as reference, or using updated releases of the 1000 Genomes data as reference). <br><br />
The first step is a pre-phasing step using MaCH. This step does not need external reference. This is a time-consuming step BUT is a one-time investment. For computational reason, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]) for this step. In general, we recommend >500Kb overlapping region on each side. For example, for Affymetrix 6.0 panel, if we use core region of 10Mb and flanking/overlapping region of 1Mb on each side, it will correspond to ~3500 SNps in the core region and ~350 SNPs on each side. For 2000 individuals, one job with ~4,200 SNPs running with --states 200 and -r 50, this would take ~40 hours. For other combinations, using the following link to estimate computing time [http://csg.sph.umich.edu//yli/MaCH-Admix/runtime.php#est runtime estimate]. <br><br />
The second step is the actual imputation step using minimac. This step can run on whole chromosomes. Regarding computing time, one million markers for 1000 individuals using 100 reference haplotypes takes ~ 1 hour; and computing time increases linearly with all the above three parameters. See [http://genome.sph.umich.edu/wiki/Minimac minimac] for details.<br />
<br />
=== MaCH-Admix ===<br />
<br />
If you are doing imputation only a few (<5) times (think twice if this is really true) or under an immediate time pressure, you can use MaCH-Admix, which does not require pre-phased data and takes ~1/7 of the computing time of that typically needed for pre-phasing. For large dataset, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]). Details see [http://www.unc.edu/~yunmli/MaCH-Admix/ MaCH-Admix].<br />
<br />
=== Divide and Conquer ===<br />
See [http://genome.sph.umich.edu/wiki/Mach_DAC MaCH Divide and Conquer] for details.<br />
<br />
=== 2-step imputation ===<br />
See [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F 2-step imputation] for details. <br />
<br />
== Why and how to perform a 2-step imputation? ==<br />
<br />
When one has a large number of individuals (&gt;1000), we recommend a 2-step imputation to speed up. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp; A 2-step imputation contains the following 2 steps:<br> <br />
<br />
&nbsp;&nbsp;&nbsp; (step 1) a representative subset of &gt;= 200 unrelated individuals are used to calibrate model parameters; and<br>&nbsp;&nbsp;&nbsp; (step 2) actual genotype imputation is performed for every person using parameters inferred in step 1. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Example command lines for a 2-step imputation:<br> <br />
<br />
# step 1:<br />
mach1 -d sample.dat -p subset.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip -r 100 -o par_infer &gt; mach.infer.log<br />
<br />
# step 2:<br />
mach1 -d sample.dat -p sample.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip --errorMap par_infer.erate --crossoverMap par_infer.rec --mle --mldetails &gt; mach.imp.log<br />
<br />
In step1, one can use --greedy in combination with --states XX in MaCH versions 16.b and above. We have found that using 1/3 of the reference haplotypes (with 1/9 computational time) results in almost no power loss for the current HapMap and 1000G reference panels.<br><br />
<br />
In step2, each individual is imputed independently and can therefore be split into as many as n (sample size) jobs for each chromosome for parallelism.<br />
<br />
For other approaches to speed up, see [how to speed up].<br />
<br />
== Can MaCH perform imputation for chromosome X? ==<br />
Yes. See [http://genome.sph.umich.edu/wiki/MaCH:_machX MaCH X Chromosome] for details.<br />
<br />
== Where can I find combined HapMap reference files? ==<br />
<br />
You can find them at http://csg.sph.umich.edu//yli/mach/download/HapMap-r21.html or on the HapMap Project website.<br />
<br />
== Where can I find HapMap III / 1000 Genomes reference files? ==<br />
<br />
You can find these at the MaCH download page, which is at http://csg.sph.umich.edu//yli/mach/download/<br />
<br />
== Does --mle overwrite input genotypes? ==<br />
<br />
Yes, but not often. The --mle option outputs the most likely genotype configuration taking into account observed genotypes and integration over the most similar reference haplotypes. The original genotypes will be changed only if the underlying reference haplotypes strongly contradict the input genotype. <br />
<br />
== How do I get imputation quality estimates? ==<br />
<br />
A simple approach is to use --mask option (in the second step alone if using two-step imputation). For example, --mask 0.02 masks 2% of the genotypes at random, impute them and compare with the masked original to estimate genotypic and allelic error rates. Messages like the following will be generated to stdout: <br />
<br />
Comparing 948352 masked genotypes with MLE estimates ...<br />
Estimated per genotype error rate is 0.0568<br />
Estimated per allele error rate is 0.0293 <br />
<br />
A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use&nbsp;&nbsp; [http://genome.sph.umich.edu/wiki/CalcMatch CalcMatch ]and [http://csg.sph.umich.edu//ylwtx/doseR2.tgz doseR2.pl ]to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from [http://csg.sph.umich.edu//ylwtx/software.html http://csg.sph.umich.edu//ylwtx/software.html]. <br />
<br />
'''Warning''': Imputation involving masked datasets should be performed separately for imputation quality estimation. For production, one should use all available information.<br />
<br />
== How do I interpret the imputation quality estimates? ==<br />
In the simple approach, you will only get concordance/error estimates. There are two aspects to check. (1) the ratio between the genotypic error and allelic error. We expect that only a small proportion of errors where one homozygote is imputed as the other homozygote. Therefore, a ~2:1 ratio is expected. (2) the absolute error rate. There are several factors influencing imputation quality including the population to be imputed, the reference population and the genotyping panel used. Typically, we expect <2% allelic error rate among Caucasians and East Asians; 3-5% among Africans and African Americans. Figure below show imputation quality from the Human Genome Diversity Project (HGDP) for 52 populations across the world and by different HapMap reference panel.<br />
<br />
http://csg.sph.umich.edu//yli/figure3.gif<br />
<br />
Table 3 in the MaCH 1.0 paper tabulates imputation quality by commercial panel in CEU, YRI, and CHB+JPT.<br />
<br />
== Shall I apply QC before or after imputation? If so, how? ==<br />
<br />
We strongly recommend QC both before and after imputation. Before imputation, we recommend the standard battery of QC filters including HWE, MAF (recommended cutoff is 1% for genotyping-based GWAS), completeness, Mendelian inconsistency etc. Post-imputation, we recommend Rsq 0.3 (which removes &gt;70% of poorly-imputed SNPs at the cost of &lt;0.5% well-imputed SNPs) and MAF of 1%. <br />
<br />
== How do I get reference files for an region of interest? ==<br />
<br />
Note that you do not need to extract regional pedigree files for your own samples because SNPs in pedigree but not in reference will be automatically discarded. <br> 1. For HapMapII format, download haplotypes from http://csg.sph.umich.edu//ylwtx/HapMapForMach.tgz <br> 2. For MACH format, you can do the following: <br />
<br />
*First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position. <br />
*Then, under csh:<br />
@ first = `grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
@ last = `grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
under bash:<br />
first=`grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
last=`grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
<br />
*Then find out the field that contains the actual haplotypes, where alleles are separated by whitespace<br />
head -1 orig.hap | wc -w<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | head -1 | wc -w<br />
<br />
* Finally (say you got 3 from the above wc -w command. If you got other numbers, replace the 3 in bold below with the number you got):<br />
<br />
awk '{print $'''3'''}' orig.hap | cut -c${first}-${last} &gt; region.hap<br />
<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | awk '{print $'''3'''}' | cut -c${first}-${last} &gt; region.hap<br />
<br />
The created reference files are in MaCH format. You do NOT need to turn on --hapmapFormat option.<br />
<br />
== Do I always have to sort the pedigree file by marker position? ==<br />
<br />
If you use a reference set of haplotypes, you do not have to as long as the external reference is in correct order.<br />
<br />
== What if I specify ''--states R'' where ''R'' exceeds the maximum possible (2*number diploid individuals - 2 + number_haplotypes)? ==<br />
<br />
Mach caps the number of states at the maximum possible value. <br />
<br />
== How is AL1 defined? Which allele dosage is .dose/.mldose counting? ==<br />
<br />
AL1 is an arbitrary allele. Typically, it is the first allele read in the reference haplotypes. The earliest versions (prior to April 2007) of mach counted the expected number copies of AL2 and more recent versions count the number of AL1. One can find out which allele is counted following the steps below. <br />
<br />
#. First, find the two alleles for one of the markers in your data<br />
<br />
<source lang="text"><br />
prompt> head -2 mlinfo/chr21.mlinfo <br />
SNP Al1 Al2 Freq1 MAF Quality Rsq <br />
rs885550 2 4 0.9840 0.0160 0.9682 0.992<br />
</source> <br />
<br />
#. Second, check the dosage for a few individuals at this SNP.<br />
<br />
<source lang="text"><br />
prompt> head -3 mldose/chr21.mldose | cut -f3 -d ' ' <br />
1.962 <br />
1.000<br />
0.078<br />
</source> <br />
<br />
#. Finally, compare these dosages to genotypes.<br />
<br />
<source lang="text"><br />
prompt> head -1 mlgeno/chr21.mlgeno | cut -f3 -d ' ' <br />
2/2 <br />
2/4<br />
4/4<br />
</source> <br />
<br />
In this example, you can see that the first individual has a high dosage count (1.962) and most likely genotype 2/2. The last individual has a low dosage count and most likely genotype 4/4. Thus, the output corresponds to version of Mach released after April 2007, which should tally allele 1 counts. <br />
<br />
Note that, on the example above, .mldose could be replaced with .dose and .mlgeno could be replaced with .geno. <br />
<br />
Based on the three files above, we've confirmed that dosage is the number of AL1 copies: you will only to check for one informative case (i.e, dosage values close to 0 or 2) since it's consistent across all individuals and all SNPs.<br />
<br />
== Can I used an unphased reference? ==<br />
<br />
Yes. You could create pedigree (.ped) and data files (.dat) that include both reference panel and sample genotypes or request that MaCH merge apppropriate files on the fly. <br />
<br />
For example, if you have: <br />
<br />
'''reference.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''reference.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A<br />
<br />
'''sample.dat''' <br />
<br />
M SNP1<br />
M SNP4 <br />
<br />
'''sample.ped''' <br />
<br />
1 1 0 0 1 A/A G/G<br />
<br />
Your could create a combined data set as: <br />
<br />
'''comb.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''comb.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A <br />
1 1 0 0 1 A/A ./. ./. G/G ./. <br />
<br />
Equivalently, you could write -d reference.dat,sample.dat -p reference.ped,sample.ped on the command line and MACH would merge both files ''on-the-fly''.<br />
<br />
== How big are the imputation output file? ==<br />
For 1,000 individuals with 8 million SNPs, gz compressed geno/dose/prob files take ~5Gb/10Gb/15Gb.<br />
<br />
== How long does imputation take? ==<br />
<br />
The following factors/parameters affect computational time: <br />
<br />
#m, # of genotyped markers (number of markers in .dat file)<br> <br />
#n, # of individuals<br> <br />
#h, # of reference haplotypes (determined by --greedy or states, by default, h = 2*number diploid individuals - 2 + number_haplotypes)<br> <br />
#r, # of rounds (-r or --rounds, --mle corresponds to 1-2 rounds)<br />
<br />
Computational time increases linearly with m, n, r and quadratically with h. On our Xeon 3.0GHz machine, imputation with m=25K, n=250, h=120, and r=100 takes ~20 hours (25000*250*120^2*100/4.5/10^11). <br />
<br />
If you have a larger number of individuals to impute (e.g., > 1,000), we recommend a 2-step imputation manner http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F.<br />
<br />
== undefined symbol: gzopen64 ==<br />
If you see this message, you will need to re-compile the program. Type the following commands:<br />
<br />
make clear<br />
make all<br />
<br />
New executables mach1 and thunder will then be generated under folder executables/<br />
<br />
== Install MaCH ==<br />
We have source codes available through the MaCH download page: http://csg.sph.umich.edu//yli/mach/download/ <br><br />
<br />
== More questions? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li] or [mailto:goncalo@umich.edu Goncalo Abecasis].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH_FAQ&diff=14623MaCH FAQ2017-02-02T15:22:19Z<p>Ppwhite: /* How do I get imputation quality estimates? */</p>
<hr />
<div>== How to speed up? ==<br />
<br />
=== minimac ===<br />
<br />
This is the new 2-step procedure we are recommending, particularly considering people that are performing imputation multiple times (using HapMap as reference, or using updated releases of the 1000 Genomes data as reference). <br><br />
The first step is a pre-phasing step using MaCH. This step does not need external reference. This is a time-consuming step BUT is a one-time investment. For computational reason, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]) for this step. In general, we recommend >500Kb overlapping region on each side. For example, for Affymetrix 6.0 panel, if we use core region of 10Mb and flanking/overlapping region of 1Mb on each side, it will correspond to ~3500 SNps in the core region and ~350 SNPs on each side. For 2000 individuals, one job with ~4,200 SNPs running with --states 200 and -r 50, this would take ~40 hours. For other combinations, using the following link to estimate computing time [http://csg.sph.umich.edu//yli/MaCH-Admix/runtime.php#est runtime estimate]. <br><br />
The second step is the actual imputation step using minimac. This step can run on whole chromosomes. Regarding computing time, one million markers for 1000 individuals using 100 reference haplotypes takes ~ 1 hour; and computing time increases linearly with all the above three parameters. See [http://genome.sph.umich.edu/wiki/Minimac minimac] for details.<br />
<br />
=== MaCH-Admix ===<br />
<br />
If you are doing imputation only a few (<5) times (think twice if this is really true) or under an immediate time pressure, you can use MaCH-Admix, which does not require pre-phased data and takes ~1/7 of the computing time of that typically needed for pre-phasing. For large dataset, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]). Details see [http://www.unc.edu/~yunmli/MaCH-Admix/ MaCH-Admix].<br />
<br />
=== Divide and Conquer ===<br />
See [http://genome.sph.umich.edu/wiki/Mach_DAC MaCH Divide and Conquer] for details.<br />
<br />
=== 2-step imputation ===<br />
See [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F 2-step imputation] for details. <br />
<br />
== Why and how to perform a 2-step imputation? ==<br />
<br />
When one has a large number of individuals (&gt;1000), we recommend a 2-step imputation to speed up. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp; A 2-step imputation contains the following 2 steps:<br> <br />
<br />
&nbsp;&nbsp;&nbsp; (step 1) a representative subset of &gt;= 200 unrelated individuals are used to calibrate model parameters; and<br>&nbsp;&nbsp;&nbsp; (step 2) actual genotype imputation is performed for every person using parameters inferred in step 1. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Example command lines for a 2-step imputation:<br> <br />
<br />
# step 1:<br />
mach1 -d sample.dat -p subset.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip -r 100 -o par_infer &gt; mach.infer.log<br />
<br />
# step 2:<br />
mach1 -d sample.dat -p sample.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip --errorMap par_infer.erate --crossoverMap par_infer.rec --mle --mldetails &gt; mach.imp.log<br />
<br />
In step1, one can use --greedy in combination with --states XX in MaCH versions 16.b and above. We have found that using 1/3 of the reference haplotypes (with 1/9 computational time) results in almost no power loss for the current HapMap and 1000G reference panels.<br><br />
<br />
In step2, each individual is imputed independently and can therefore be split into as many as n (sample size) jobs for each chromosome for parallelism.<br />
<br />
For other approaches to speed up, see [how to speed up].<br />
<br />
== Can MaCH perform imputation for chromosome X? ==<br />
Yes. See [http://genome.sph.umich.edu/wiki/MaCH:_machX MaCH X Chromosome] for details.<br />
<br />
== Where can I find combined HapMap reference files? ==<br />
<br />
You can find them at http://csg.sph.umich.edu//yli/mach/download/HapMap-r21.html or on the HapMap Project website.<br />
<br />
== Where can I find HapMap III / 1000 Genomes reference files? ==<br />
<br />
You can find these at the MaCH download page, which is at http://csg.sph.umich.edu//yli/mach/download/<br />
<br />
== Does --mle overwrite input genotypes? ==<br />
<br />
Yes, but not often. The --mle option outputs the most likely genotype configuration taking into account observed genotypes and integration over the most similar reference haplotypes. The original genotypes will be changed only if the underlying reference haplotypes strongly contradict the input genotype. <br />
<br />
== How do I get imputation quality estimates? ==<br />
<br />
A simple approach is to use --mask option (in the second step alone if using two-step imputation). For example, --mask 0.02 masks 2% of the genotypes at random, impute them and compare with the masked original to estimate genotypic and allelic error rates. Messages like the following will be generated to stdout: <br />
<br />
Comparing 948352 masked genotypes with MLE estimates ...<br />
Estimated per genotype error rate is 0.0568<br />
Estimated per allele error rate is 0.0293 <br />
<br />
A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use&nbsp;&nbsp; [http://genome.sph.umich.edu/wiki/CalcMatch CalcMatch ]and [http://csg.sph.umich.edu//ylwtx/doseR2.tgz doseR2.pl ]to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from [http://csg.sph.umich.edu//ylwtx/software.html http://www.sph.umich.edu/csg/ylwtx/software.html]. <br />
<br />
'''Warning''': Imputation involving masked datasets should be performed separately for imputation quality estimation. For production, one should use all available information.<br />
<br />
== How do I interpret the imputation quality estimates? ==<br />
In the simple approach, you will only get concordance/error estimates. There are two aspects to check. (1) the ratio between the genotypic error and allelic error. We expect that only a small proportion of errors where one homozygote is imputed as the other homozygote. Therefore, a ~2:1 ratio is expected. (2) the absolute error rate. There are several factors influencing imputation quality including the population to be imputed, the reference population and the genotyping panel used. Typically, we expect <2% allelic error rate among Caucasians and East Asians; 3-5% among Africans and African Americans. Figure below show imputation quality from the Human Genome Diversity Project (HGDP) for 52 populations across the world and by different HapMap reference panel.<br />
<br />
http://csg.sph.umich.edu//yli/figure3.gif<br />
<br />
Table 3 in the MaCH 1.0 paper tabulates imputation quality by commercial panel in CEU, YRI, and CHB+JPT.<br />
<br />
== Shall I apply QC before or after imputation? If so, how? ==<br />
<br />
We strongly recommend QC both before and after imputation. Before imputation, we recommend the standard battery of QC filters including HWE, MAF (recommended cutoff is 1% for genotyping-based GWAS), completeness, Mendelian inconsistency etc. Post-imputation, we recommend Rsq 0.3 (which removes &gt;70% of poorly-imputed SNPs at the cost of &lt;0.5% well-imputed SNPs) and MAF of 1%. <br />
<br />
== How do I get reference files for an region of interest? ==<br />
<br />
Note that you do not need to extract regional pedigree files for your own samples because SNPs in pedigree but not in reference will be automatically discarded. <br> 1. For HapMapII format, download haplotypes from http://csg.sph.umich.edu//ylwtx/HapMapForMach.tgz <br> 2. For MACH format, you can do the following: <br />
<br />
*First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position. <br />
*Then, under csh:<br />
@ first = `grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
@ last = `grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
under bash:<br />
first=`grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
last=`grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
<br />
*Then find out the field that contains the actual haplotypes, where alleles are separated by whitespace<br />
head -1 orig.hap | wc -w<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | head -1 | wc -w<br />
<br />
* Finally (say you got 3 from the above wc -w command. If you got other numbers, replace the 3 in bold below with the number you got):<br />
<br />
awk '{print $'''3'''}' orig.hap | cut -c${first}-${last} &gt; region.hap<br />
<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | awk '{print $'''3'''}' | cut -c${first}-${last} &gt; region.hap<br />
<br />
The created reference files are in MaCH format. You do NOT need to turn on --hapmapFormat option.<br />
<br />
== Do I always have to sort the pedigree file by marker position? ==<br />
<br />
If you use a reference set of haplotypes, you do not have to as long as the external reference is in correct order.<br />
<br />
== What if I specify ''--states R'' where ''R'' exceeds the maximum possible (2*number diploid individuals - 2 + number_haplotypes)? ==<br />
<br />
Mach caps the number of states at the maximum possible value. <br />
<br />
== How is AL1 defined? Which allele dosage is .dose/.mldose counting? ==<br />
<br />
AL1 is an arbitrary allele. Typically, it is the first allele read in the reference haplotypes. The earliest versions (prior to April 2007) of mach counted the expected number copies of AL2 and more recent versions count the number of AL1. One can find out which allele is counted following the steps below. <br />
<br />
#. First, find the two alleles for one of the markers in your data<br />
<br />
<source lang="text"><br />
prompt> head -2 mlinfo/chr21.mlinfo <br />
SNP Al1 Al2 Freq1 MAF Quality Rsq <br />
rs885550 2 4 0.9840 0.0160 0.9682 0.992<br />
</source> <br />
<br />
#. Second, check the dosage for a few individuals at this SNP.<br />
<br />
<source lang="text"><br />
prompt> head -3 mldose/chr21.mldose | cut -f3 -d ' ' <br />
1.962 <br />
1.000<br />
0.078<br />
</source> <br />
<br />
#. Finally, compare these dosages to genotypes.<br />
<br />
<source lang="text"><br />
prompt> head -1 mlgeno/chr21.mlgeno | cut -f3 -d ' ' <br />
2/2 <br />
2/4<br />
4/4<br />
</source> <br />
<br />
In this example, you can see that the first individual has a high dosage count (1.962) and most likely genotype 2/2. The last individual has a low dosage count and most likely genotype 4/4. Thus, the output corresponds to version of Mach released after April 2007, which should tally allele 1 counts. <br />
<br />
Note that, on the example above, .mldose could be replaced with .dose and .mlgeno could be replaced with .geno. <br />
<br />
Based on the three files above, we've confirmed that dosage is the number of AL1 copies: you will only to check for one informative case (i.e, dosage values close to 0 or 2) since it's consistent across all individuals and all SNPs.<br />
<br />
== Can I used an unphased reference? ==<br />
<br />
Yes. You could create pedigree (.ped) and data files (.dat) that include both reference panel and sample genotypes or request that MaCH merge apppropriate files on the fly. <br />
<br />
For example, if you have: <br />
<br />
'''reference.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''reference.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A<br />
<br />
'''sample.dat''' <br />
<br />
M SNP1<br />
M SNP4 <br />
<br />
'''sample.ped''' <br />
<br />
1 1 0 0 1 A/A G/G<br />
<br />
Your could create a combined data set as: <br />
<br />
'''comb.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''comb.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A <br />
1 1 0 0 1 A/A ./. ./. G/G ./. <br />
<br />
Equivalently, you could write -d reference.dat,sample.dat -p reference.ped,sample.ped on the command line and MACH would merge both files ''on-the-fly''.<br />
<br />
== How big are the imputation output file? ==<br />
For 1,000 individuals with 8 million SNPs, gz compressed geno/dose/prob files take ~5Gb/10Gb/15Gb.<br />
<br />
== How long does imputation take? ==<br />
<br />
The following factors/parameters affect computational time: <br />
<br />
#m, # of genotyped markers (number of markers in .dat file)<br> <br />
#n, # of individuals<br> <br />
#h, # of reference haplotypes (determined by --greedy or states, by default, h = 2*number diploid individuals - 2 + number_haplotypes)<br> <br />
#r, # of rounds (-r or --rounds, --mle corresponds to 1-2 rounds)<br />
<br />
Computational time increases linearly with m, n, r and quadratically with h. On our Xeon 3.0GHz machine, imputation with m=25K, n=250, h=120, and r=100 takes ~20 hours (25000*250*120^2*100/4.5/10^11). <br />
<br />
If you have a larger number of individuals to impute (e.g., > 1,000), we recommend a 2-step imputation manner http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F.<br />
<br />
== undefined symbol: gzopen64 ==<br />
If you see this message, you will need to re-compile the program. Type the following commands:<br />
<br />
make clear<br />
make all<br />
<br />
New executables mach1 and thunder will then be generated under folder executables/<br />
<br />
== Install MaCH ==<br />
We have source codes available through the MaCH download page: http://csg.sph.umich.edu//yli/mach/download/ <br><br />
<br />
== More questions? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li] or [mailto:goncalo@umich.edu Goncalo Abecasis].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH_FAQ&diff=14622MaCH FAQ2017-02-02T15:21:46Z<p>Ppwhite: /* Install MaCH */</p>
<hr />
<div>== How to speed up? ==<br />
<br />
=== minimac ===<br />
<br />
This is the new 2-step procedure we are recommending, particularly considering people that are performing imputation multiple times (using HapMap as reference, or using updated releases of the 1000 Genomes data as reference). <br><br />
The first step is a pre-phasing step using MaCH. This step does not need external reference. This is a time-consuming step BUT is a one-time investment. For computational reason, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]) for this step. In general, we recommend >500Kb overlapping region on each side. For example, for Affymetrix 6.0 panel, if we use core region of 10Mb and flanking/overlapping region of 1Mb on each side, it will correspond to ~3500 SNps in the core region and ~350 SNPs on each side. For 2000 individuals, one job with ~4,200 SNPs running with --states 200 and -r 50, this would take ~40 hours. For other combinations, using the following link to estimate computing time [http://csg.sph.umich.edu//yli/MaCH-Admix/runtime.php#est runtime estimate]. <br><br />
The second step is the actual imputation step using minimac. This step can run on whole chromosomes. Regarding computing time, one million markers for 1000 individuals using 100 reference haplotypes takes ~ 1 hour; and computing time increases linearly with all the above three parameters. See [http://genome.sph.umich.edu/wiki/Minimac minimac] for details.<br />
<br />
=== MaCH-Admix ===<br />
<br />
If you are doing imputation only a few (<5) times (think twice if this is really true) or under an immediate time pressure, you can use MaCH-Admix, which does not require pre-phased data and takes ~1/7 of the computing time of that typically needed for pre-phasing. For large dataset, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]). Details see [http://www.unc.edu/~yunmli/MaCH-Admix/ MaCH-Admix].<br />
<br />
=== Divide and Conquer ===<br />
See [http://genome.sph.umich.edu/wiki/Mach_DAC MaCH Divide and Conquer] for details.<br />
<br />
=== 2-step imputation ===<br />
See [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F 2-step imputation] for details. <br />
<br />
== Why and how to perform a 2-step imputation? ==<br />
<br />
When one has a large number of individuals (&gt;1000), we recommend a 2-step imputation to speed up. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp; A 2-step imputation contains the following 2 steps:<br> <br />
<br />
&nbsp;&nbsp;&nbsp; (step 1) a representative subset of &gt;= 200 unrelated individuals are used to calibrate model parameters; and<br>&nbsp;&nbsp;&nbsp; (step 2) actual genotype imputation is performed for every person using parameters inferred in step 1. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Example command lines for a 2-step imputation:<br> <br />
<br />
# step 1:<br />
mach1 -d sample.dat -p subset.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip -r 100 -o par_infer &gt; mach.infer.log<br />
<br />
# step 2:<br />
mach1 -d sample.dat -p sample.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip --errorMap par_infer.erate --crossoverMap par_infer.rec --mle --mldetails &gt; mach.imp.log<br />
<br />
In step1, one can use --greedy in combination with --states XX in MaCH versions 16.b and above. We have found that using 1/3 of the reference haplotypes (with 1/9 computational time) results in almost no power loss for the current HapMap and 1000G reference panels.<br><br />
<br />
In step2, each individual is imputed independently and can therefore be split into as many as n (sample size) jobs for each chromosome for parallelism.<br />
<br />
For other approaches to speed up, see [how to speed up].<br />
<br />
== Can MaCH perform imputation for chromosome X? ==<br />
Yes. See [http://genome.sph.umich.edu/wiki/MaCH:_machX MaCH X Chromosome] for details.<br />
<br />
== Where can I find combined HapMap reference files? ==<br />
<br />
You can find them at http://csg.sph.umich.edu//yli/mach/download/HapMap-r21.html or on the HapMap Project website.<br />
<br />
== Where can I find HapMap III / 1000 Genomes reference files? ==<br />
<br />
You can find these at the MaCH download page, which is at http://csg.sph.umich.edu//yli/mach/download/<br />
<br />
== Does --mle overwrite input genotypes? ==<br />
<br />
Yes, but not often. The --mle option outputs the most likely genotype configuration taking into account observed genotypes and integration over the most similar reference haplotypes. The original genotypes will be changed only if the underlying reference haplotypes strongly contradict the input genotype. <br />
<br />
== How do I get imputation quality estimates? ==<br />
<br />
A simple approach is to use --mask option (in the second step alone if using two-step imputation). For example, --mask 0.02 masks 2% of the genotypes at random, impute them and compare with the masked original to estimate genotypic and allelic error rates. Messages like the following will be generated to stdout: <br />
<br />
Comparing 948352 masked genotypes with MLE estimates ...<br />
Estimated per genotype error rate is 0.0568<br />
Estimated per allele error rate is 0.0293 <br />
<br />
A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use&nbsp;&nbsp; [http://genome.sph.umich.edu/wiki/CalcMatch CalcMatch ]and [http://www.sph.umich.edu/csg/ylwtx/doseR2.tgz doseR2.pl ]to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from [http://www.sph.umich.edu/csg/ylwtx/software.html http://www.sph.umich.edu/csg/ylwtx/software.html]. <br />
<br />
'''Warning''': Imputation involving masked datasets should be performed separately for imputation quality estimation. For production, one should use all available information.<br />
<br />
== How do I interpret the imputation quality estimates? ==<br />
In the simple approach, you will only get concordance/error estimates. There are two aspects to check. (1) the ratio between the genotypic error and allelic error. We expect that only a small proportion of errors where one homozygote is imputed as the other homozygote. Therefore, a ~2:1 ratio is expected. (2) the absolute error rate. There are several factors influencing imputation quality including the population to be imputed, the reference population and the genotyping panel used. Typically, we expect <2% allelic error rate among Caucasians and East Asians; 3-5% among Africans and African Americans. Figure below show imputation quality from the Human Genome Diversity Project (HGDP) for 52 populations across the world and by different HapMap reference panel.<br />
<br />
http://csg.sph.umich.edu//yli/figure3.gif<br />
<br />
Table 3 in the MaCH 1.0 paper tabulates imputation quality by commercial panel in CEU, YRI, and CHB+JPT.<br />
<br />
== Shall I apply QC before or after imputation? If so, how? ==<br />
<br />
We strongly recommend QC both before and after imputation. Before imputation, we recommend the standard battery of QC filters including HWE, MAF (recommended cutoff is 1% for genotyping-based GWAS), completeness, Mendelian inconsistency etc. Post-imputation, we recommend Rsq 0.3 (which removes &gt;70% of poorly-imputed SNPs at the cost of &lt;0.5% well-imputed SNPs) and MAF of 1%. <br />
<br />
== How do I get reference files for an region of interest? ==<br />
<br />
Note that you do not need to extract regional pedigree files for your own samples because SNPs in pedigree but not in reference will be automatically discarded. <br> 1. For HapMapII format, download haplotypes from http://csg.sph.umich.edu//ylwtx/HapMapForMach.tgz <br> 2. For MACH format, you can do the following: <br />
<br />
*First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position. <br />
*Then, under csh:<br />
@ first = `grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
@ last = `grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
under bash:<br />
first=`grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
last=`grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
<br />
*Then find out the field that contains the actual haplotypes, where alleles are separated by whitespace<br />
head -1 orig.hap | wc -w<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | head -1 | wc -w<br />
<br />
* Finally (say you got 3 from the above wc -w command. If you got other numbers, replace the 3 in bold below with the number you got):<br />
<br />
awk '{print $'''3'''}' orig.hap | cut -c${first}-${last} &gt; region.hap<br />
<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | awk '{print $'''3'''}' | cut -c${first}-${last} &gt; region.hap<br />
<br />
The created reference files are in MaCH format. You do NOT need to turn on --hapmapFormat option.<br />
<br />
== Do I always have to sort the pedigree file by marker position? ==<br />
<br />
If you use a reference set of haplotypes, you do not have to as long as the external reference is in correct order.<br />
<br />
== What if I specify ''--states R'' where ''R'' exceeds the maximum possible (2*number diploid individuals - 2 + number_haplotypes)? ==<br />
<br />
Mach caps the number of states at the maximum possible value. <br />
<br />
== How is AL1 defined? Which allele dosage is .dose/.mldose counting? ==<br />
<br />
AL1 is an arbitrary allele. Typically, it is the first allele read in the reference haplotypes. The earliest versions (prior to April 2007) of mach counted the expected number copies of AL2 and more recent versions count the number of AL1. One can find out which allele is counted following the steps below. <br />
<br />
#. First, find the two alleles for one of the markers in your data<br />
<br />
<source lang="text"><br />
prompt> head -2 mlinfo/chr21.mlinfo <br />
SNP Al1 Al2 Freq1 MAF Quality Rsq <br />
rs885550 2 4 0.9840 0.0160 0.9682 0.992<br />
</source> <br />
<br />
#. Second, check the dosage for a few individuals at this SNP.<br />
<br />
<source lang="text"><br />
prompt> head -3 mldose/chr21.mldose | cut -f3 -d ' ' <br />
1.962 <br />
1.000<br />
0.078<br />
</source> <br />
<br />
#. Finally, compare these dosages to genotypes.<br />
<br />
<source lang="text"><br />
prompt> head -1 mlgeno/chr21.mlgeno | cut -f3 -d ' ' <br />
2/2 <br />
2/4<br />
4/4<br />
</source> <br />
<br />
In this example, you can see that the first individual has a high dosage count (1.962) and most likely genotype 2/2. The last individual has a low dosage count and most likely genotype 4/4. Thus, the output corresponds to version of Mach released after April 2007, which should tally allele 1 counts. <br />
<br />
Note that, on the example above, .mldose could be replaced with .dose and .mlgeno could be replaced with .geno. <br />
<br />
Based on the three files above, we've confirmed that dosage is the number of AL1 copies: you will only to check for one informative case (i.e, dosage values close to 0 or 2) since it's consistent across all individuals and all SNPs.<br />
<br />
== Can I used an unphased reference? ==<br />
<br />
Yes. You could create pedigree (.ped) and data files (.dat) that include both reference panel and sample genotypes or request that MaCH merge apppropriate files on the fly. <br />
<br />
For example, if you have: <br />
<br />
'''reference.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''reference.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A<br />
<br />
'''sample.dat''' <br />
<br />
M SNP1<br />
M SNP4 <br />
<br />
'''sample.ped''' <br />
<br />
1 1 0 0 1 A/A G/G<br />
<br />
Your could create a combined data set as: <br />
<br />
'''comb.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''comb.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A <br />
1 1 0 0 1 A/A ./. ./. G/G ./. <br />
<br />
Equivalently, you could write -d reference.dat,sample.dat -p reference.ped,sample.ped on the command line and MACH would merge both files ''on-the-fly''.<br />
<br />
== How big are the imputation output file? ==<br />
For 1,000 individuals with 8 million SNPs, gz compressed geno/dose/prob files take ~5Gb/10Gb/15Gb.<br />
<br />
== How long does imputation take? ==<br />
<br />
The following factors/parameters affect computational time: <br />
<br />
#m, # of genotyped markers (number of markers in .dat file)<br> <br />
#n, # of individuals<br> <br />
#h, # of reference haplotypes (determined by --greedy or states, by default, h = 2*number diploid individuals - 2 + number_haplotypes)<br> <br />
#r, # of rounds (-r or --rounds, --mle corresponds to 1-2 rounds)<br />
<br />
Computational time increases linearly with m, n, r and quadratically with h. On our Xeon 3.0GHz machine, imputation with m=25K, n=250, h=120, and r=100 takes ~20 hours (25000*250*120^2*100/4.5/10^11). <br />
<br />
If you have a larger number of individuals to impute (e.g., > 1,000), we recommend a 2-step imputation manner http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F.<br />
<br />
== undefined symbol: gzopen64 ==<br />
If you see this message, you will need to re-compile the program. Type the following commands:<br />
<br />
make clear<br />
make all<br />
<br />
New executables mach1 and thunder will then be generated under folder executables/<br />
<br />
== Install MaCH ==<br />
We have source codes available through the MaCH download page: http://csg.sph.umich.edu//yli/mach/download/ <br><br />
<br />
== More questions? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li] or [mailto:goncalo@umich.edu Goncalo Abecasis].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH_FAQ&diff=14621MaCH FAQ2017-02-02T15:21:15Z<p>Ppwhite: /* How do I get reference files for an region of interest? */</p>
<hr />
<div>== How to speed up? ==<br />
<br />
=== minimac ===<br />
<br />
This is the new 2-step procedure we are recommending, particularly considering people that are performing imputation multiple times (using HapMap as reference, or using updated releases of the 1000 Genomes data as reference). <br><br />
The first step is a pre-phasing step using MaCH. This step does not need external reference. This is a time-consuming step BUT is a one-time investment. For computational reason, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]) for this step. In general, we recommend >500Kb overlapping region on each side. For example, for Affymetrix 6.0 panel, if we use core region of 10Mb and flanking/overlapping region of 1Mb on each side, it will correspond to ~3500 SNps in the core region and ~350 SNPs on each side. For 2000 individuals, one job with ~4,200 SNPs running with --states 200 and -r 50, this would take ~40 hours. For other combinations, using the following link to estimate computing time [http://csg.sph.umich.edu//yli/MaCH-Admix/runtime.php#est runtime estimate]. <br><br />
The second step is the actual imputation step using minimac. This step can run on whole chromosomes. Regarding computing time, one million markers for 1000 individuals using 100 reference haplotypes takes ~ 1 hour; and computing time increases linearly with all the above three parameters. See [http://genome.sph.umich.edu/wiki/Minimac minimac] for details.<br />
<br />
=== MaCH-Admix ===<br />
<br />
If you are doing imputation only a few (<5) times (think twice if this is really true) or under an immediate time pressure, you can use MaCH-Admix, which does not require pre-phased data and takes ~1/7 of the computing time of that typically needed for pre-phasing. For large dataset, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]). Details see [http://www.unc.edu/~yunmli/MaCH-Admix/ MaCH-Admix].<br />
<br />
=== Divide and Conquer ===<br />
See [http://genome.sph.umich.edu/wiki/Mach_DAC MaCH Divide and Conquer] for details.<br />
<br />
=== 2-step imputation ===<br />
See [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F 2-step imputation] for details. <br />
<br />
== Why and how to perform a 2-step imputation? ==<br />
<br />
When one has a large number of individuals (&gt;1000), we recommend a 2-step imputation to speed up. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp; A 2-step imputation contains the following 2 steps:<br> <br />
<br />
&nbsp;&nbsp;&nbsp; (step 1) a representative subset of &gt;= 200 unrelated individuals are used to calibrate model parameters; and<br>&nbsp;&nbsp;&nbsp; (step 2) actual genotype imputation is performed for every person using parameters inferred in step 1. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Example command lines for a 2-step imputation:<br> <br />
<br />
# step 1:<br />
mach1 -d sample.dat -p subset.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip -r 100 -o par_infer &gt; mach.infer.log<br />
<br />
# step 2:<br />
mach1 -d sample.dat -p sample.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip --errorMap par_infer.erate --crossoverMap par_infer.rec --mle --mldetails &gt; mach.imp.log<br />
<br />
In step1, one can use --greedy in combination with --states XX in MaCH versions 16.b and above. We have found that using 1/3 of the reference haplotypes (with 1/9 computational time) results in almost no power loss for the current HapMap and 1000G reference panels.<br><br />
<br />
In step2, each individual is imputed independently and can therefore be split into as many as n (sample size) jobs for each chromosome for parallelism.<br />
<br />
For other approaches to speed up, see [how to speed up].<br />
<br />
== Can MaCH perform imputation for chromosome X? ==<br />
Yes. See [http://genome.sph.umich.edu/wiki/MaCH:_machX MaCH X Chromosome] for details.<br />
<br />
== Where can I find combined HapMap reference files? ==<br />
<br />
You can find them at http://csg.sph.umich.edu//yli/mach/download/HapMap-r21.html or on the HapMap Project website.<br />
<br />
== Where can I find HapMap III / 1000 Genomes reference files? ==<br />
<br />
You can find these at the MaCH download page, which is at http://csg.sph.umich.edu//yli/mach/download/<br />
<br />
== Does --mle overwrite input genotypes? ==<br />
<br />
Yes, but not often. The --mle option outputs the most likely genotype configuration taking into account observed genotypes and integration over the most similar reference haplotypes. The original genotypes will be changed only if the underlying reference haplotypes strongly contradict the input genotype. <br />
<br />
== How do I get imputation quality estimates? ==<br />
<br />
A simple approach is to use --mask option (in the second step alone if using two-step imputation). For example, --mask 0.02 masks 2% of the genotypes at random, impute them and compare with the masked original to estimate genotypic and allelic error rates. Messages like the following will be generated to stdout: <br />
<br />
Comparing 948352 masked genotypes with MLE estimates ...<br />
Estimated per genotype error rate is 0.0568<br />
Estimated per allele error rate is 0.0293 <br />
<br />
A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use&nbsp;&nbsp; [http://genome.sph.umich.edu/wiki/CalcMatch CalcMatch ]and [http://www.sph.umich.edu/csg/ylwtx/doseR2.tgz doseR2.pl ]to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from [http://www.sph.umich.edu/csg/ylwtx/software.html http://www.sph.umich.edu/csg/ylwtx/software.html]. <br />
<br />
'''Warning''': Imputation involving masked datasets should be performed separately for imputation quality estimation. For production, one should use all available information.<br />
<br />
== How do I interpret the imputation quality estimates? ==<br />
In the simple approach, you will only get concordance/error estimates. There are two aspects to check. (1) the ratio between the genotypic error and allelic error. We expect that only a small proportion of errors where one homozygote is imputed as the other homozygote. Therefore, a ~2:1 ratio is expected. (2) the absolute error rate. There are several factors influencing imputation quality including the population to be imputed, the reference population and the genotyping panel used. Typically, we expect <2% allelic error rate among Caucasians and East Asians; 3-5% among Africans and African Americans. Figure below show imputation quality from the Human Genome Diversity Project (HGDP) for 52 populations across the world and by different HapMap reference panel.<br />
<br />
http://csg.sph.umich.edu//yli/figure3.gif<br />
<br />
Table 3 in the MaCH 1.0 paper tabulates imputation quality by commercial panel in CEU, YRI, and CHB+JPT.<br />
<br />
== Shall I apply QC before or after imputation? If so, how? ==<br />
<br />
We strongly recommend QC both before and after imputation. Before imputation, we recommend the standard battery of QC filters including HWE, MAF (recommended cutoff is 1% for genotyping-based GWAS), completeness, Mendelian inconsistency etc. Post-imputation, we recommend Rsq 0.3 (which removes &gt;70% of poorly-imputed SNPs at the cost of &lt;0.5% well-imputed SNPs) and MAF of 1%. <br />
<br />
== How do I get reference files for an region of interest? ==<br />
<br />
Note that you do not need to extract regional pedigree files for your own samples because SNPs in pedigree but not in reference will be automatically discarded. <br> 1. For HapMapII format, download haplotypes from http://csg.sph.umich.edu//ylwtx/HapMapForMach.tgz <br> 2. For MACH format, you can do the following: <br />
<br />
*First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position. <br />
*Then, under csh:<br />
@ first = `grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
@ last = `grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
under bash:<br />
first=`grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
last=`grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
<br />
*Then find out the field that contains the actual haplotypes, where alleles are separated by whitespace<br />
head -1 orig.hap | wc -w<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | head -1 | wc -w<br />
<br />
* Finally (say you got 3 from the above wc -w command. If you got other numbers, replace the 3 in bold below with the number you got):<br />
<br />
awk '{print $'''3'''}' orig.hap | cut -c${first}-${last} &gt; region.hap<br />
<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | awk '{print $'''3'''}' | cut -c${first}-${last} &gt; region.hap<br />
<br />
The created reference files are in MaCH format. You do NOT need to turn on --hapmapFormat option.<br />
<br />
== Do I always have to sort the pedigree file by marker position? ==<br />
<br />
If you use a reference set of haplotypes, you do not have to as long as the external reference is in correct order.<br />
<br />
== What if I specify ''--states R'' where ''R'' exceeds the maximum possible (2*number diploid individuals - 2 + number_haplotypes)? ==<br />
<br />
Mach caps the number of states at the maximum possible value. <br />
<br />
== How is AL1 defined? Which allele dosage is .dose/.mldose counting? ==<br />
<br />
AL1 is an arbitrary allele. Typically, it is the first allele read in the reference haplotypes. The earliest versions (prior to April 2007) of mach counted the expected number copies of AL2 and more recent versions count the number of AL1. One can find out which allele is counted following the steps below. <br />
<br />
#. First, find the two alleles for one of the markers in your data<br />
<br />
<source lang="text"><br />
prompt> head -2 mlinfo/chr21.mlinfo <br />
SNP Al1 Al2 Freq1 MAF Quality Rsq <br />
rs885550 2 4 0.9840 0.0160 0.9682 0.992<br />
</source> <br />
<br />
#. Second, check the dosage for a few individuals at this SNP.<br />
<br />
<source lang="text"><br />
prompt> head -3 mldose/chr21.mldose | cut -f3 -d ' ' <br />
1.962 <br />
1.000<br />
0.078<br />
</source> <br />
<br />
#. Finally, compare these dosages to genotypes.<br />
<br />
<source lang="text"><br />
prompt> head -1 mlgeno/chr21.mlgeno | cut -f3 -d ' ' <br />
2/2 <br />
2/4<br />
4/4<br />
</source> <br />
<br />
In this example, you can see that the first individual has a high dosage count (1.962) and most likely genotype 2/2. The last individual has a low dosage count and most likely genotype 4/4. Thus, the output corresponds to version of Mach released after April 2007, which should tally allele 1 counts. <br />
<br />
Note that, on the example above, .mldose could be replaced with .dose and .mlgeno could be replaced with .geno. <br />
<br />
Based on the three files above, we've confirmed that dosage is the number of AL1 copies: you will only to check for one informative case (i.e, dosage values close to 0 or 2) since it's consistent across all individuals and all SNPs.<br />
<br />
== Can I used an unphased reference? ==<br />
<br />
Yes. You could create pedigree (.ped) and data files (.dat) that include both reference panel and sample genotypes or request that MaCH merge apppropriate files on the fly. <br />
<br />
For example, if you have: <br />
<br />
'''reference.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''reference.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A<br />
<br />
'''sample.dat''' <br />
<br />
M SNP1<br />
M SNP4 <br />
<br />
'''sample.ped''' <br />
<br />
1 1 0 0 1 A/A G/G<br />
<br />
Your could create a combined data set as: <br />
<br />
'''comb.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''comb.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A <br />
1 1 0 0 1 A/A ./. ./. G/G ./. <br />
<br />
Equivalently, you could write -d reference.dat,sample.dat -p reference.ped,sample.ped on the command line and MACH would merge both files ''on-the-fly''.<br />
<br />
== How big are the imputation output file? ==<br />
For 1,000 individuals with 8 million SNPs, gz compressed geno/dose/prob files take ~5Gb/10Gb/15Gb.<br />
<br />
== How long does imputation take? ==<br />
<br />
The following factors/parameters affect computational time: <br />
<br />
#m, # of genotyped markers (number of markers in .dat file)<br> <br />
#n, # of individuals<br> <br />
#h, # of reference haplotypes (determined by --greedy or states, by default, h = 2*number diploid individuals - 2 + number_haplotypes)<br> <br />
#r, # of rounds (-r or --rounds, --mle corresponds to 1-2 rounds)<br />
<br />
Computational time increases linearly with m, n, r and quadratically with h. On our Xeon 3.0GHz machine, imputation with m=25K, n=250, h=120, and r=100 takes ~20 hours (25000*250*120^2*100/4.5/10^11). <br />
<br />
If you have a larger number of individuals to impute (e.g., > 1,000), we recommend a 2-step imputation manner http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F.<br />
<br />
== undefined symbol: gzopen64 ==<br />
If you see this message, you will need to re-compile the program. Type the following commands:<br />
<br />
make clear<br />
make all<br />
<br />
New executables mach1 and thunder will then be generated under folder executables/<br />
<br />
== Install MaCH ==<br />
We have source codes available through the MaCH download page: http://www.sph.umich.edu/csg/yli/mach/download/ <br><br />
<br />
== More questions? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li] or [mailto:goncalo@umich.edu Goncalo Abecasis].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH_FAQ&diff=14620MaCH FAQ2017-02-02T15:20:54Z<p>Ppwhite: /* How do I interpret the imputation quality estimates? */</p>
<hr />
<div>== How to speed up? ==<br />
<br />
=== minimac ===<br />
<br />
This is the new 2-step procedure we are recommending, particularly considering people that are performing imputation multiple times (using HapMap as reference, or using updated releases of the 1000 Genomes data as reference). <br><br />
The first step is a pre-phasing step using MaCH. This step does not need external reference. This is a time-consuming step BUT is a one-time investment. For computational reason, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]) for this step. In general, we recommend >500Kb overlapping region on each side. For example, for Affymetrix 6.0 panel, if we use core region of 10Mb and flanking/overlapping region of 1Mb on each side, it will correspond to ~3500 SNps in the core region and ~350 SNPs on each side. For 2000 individuals, one job with ~4,200 SNPs running with --states 200 and -r 50, this would take ~40 hours. For other combinations, using the following link to estimate computing time [http://csg.sph.umich.edu//yli/MaCH-Admix/runtime.php#est runtime estimate]. <br><br />
The second step is the actual imputation step using minimac. This step can run on whole chromosomes. Regarding computing time, one million markers for 1000 individuals using 100 reference haplotypes takes ~ 1 hour; and computing time increases linearly with all the above three parameters. See [http://genome.sph.umich.edu/wiki/Minimac minimac] for details.<br />
<br />
=== MaCH-Admix ===<br />
<br />
If you are doing imputation only a few (<5) times (think twice if this is really true) or under an immediate time pressure, you can use MaCH-Admix, which does not require pre-phased data and takes ~1/7 of the computing time of that typically needed for pre-phasing. For large dataset, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]). Details see [http://www.unc.edu/~yunmli/MaCH-Admix/ MaCH-Admix].<br />
<br />
=== Divide and Conquer ===<br />
See [http://genome.sph.umich.edu/wiki/Mach_DAC MaCH Divide and Conquer] for details.<br />
<br />
=== 2-step imputation ===<br />
See [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F 2-step imputation] for details. <br />
<br />
== Why and how to perform a 2-step imputation? ==<br />
<br />
When one has a large number of individuals (&gt;1000), we recommend a 2-step imputation to speed up. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp; A 2-step imputation contains the following 2 steps:<br> <br />
<br />
&nbsp;&nbsp;&nbsp; (step 1) a representative subset of &gt;= 200 unrelated individuals are used to calibrate model parameters; and<br>&nbsp;&nbsp;&nbsp; (step 2) actual genotype imputation is performed for every person using parameters inferred in step 1. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Example command lines for a 2-step imputation:<br> <br />
<br />
# step 1:<br />
mach1 -d sample.dat -p subset.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip -r 100 -o par_infer &gt; mach.infer.log<br />
<br />
# step 2:<br />
mach1 -d sample.dat -p sample.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip --errorMap par_infer.erate --crossoverMap par_infer.rec --mle --mldetails &gt; mach.imp.log<br />
<br />
In step1, one can use --greedy in combination with --states XX in MaCH versions 16.b and above. We have found that using 1/3 of the reference haplotypes (with 1/9 computational time) results in almost no power loss for the current HapMap and 1000G reference panels.<br><br />
<br />
In step2, each individual is imputed independently and can therefore be split into as many as n (sample size) jobs for each chromosome for parallelism.<br />
<br />
For other approaches to speed up, see [how to speed up].<br />
<br />
== Can MaCH perform imputation for chromosome X? ==<br />
Yes. See [http://genome.sph.umich.edu/wiki/MaCH:_machX MaCH X Chromosome] for details.<br />
<br />
== Where can I find combined HapMap reference files? ==<br />
<br />
You can find them at http://csg.sph.umich.edu//yli/mach/download/HapMap-r21.html or on the HapMap Project website.<br />
<br />
== Where can I find HapMap III / 1000 Genomes reference files? ==<br />
<br />
You can find these at the MaCH download page, which is at http://csg.sph.umich.edu//yli/mach/download/<br />
<br />
== Does --mle overwrite input genotypes? ==<br />
<br />
Yes, but not often. The --mle option outputs the most likely genotype configuration taking into account observed genotypes and integration over the most similar reference haplotypes. The original genotypes will be changed only if the underlying reference haplotypes strongly contradict the input genotype. <br />
<br />
== How do I get imputation quality estimates? ==<br />
<br />
A simple approach is to use --mask option (in the second step alone if using two-step imputation). For example, --mask 0.02 masks 2% of the genotypes at random, impute them and compare with the masked original to estimate genotypic and allelic error rates. Messages like the following will be generated to stdout: <br />
<br />
Comparing 948352 masked genotypes with MLE estimates ...<br />
Estimated per genotype error rate is 0.0568<br />
Estimated per allele error rate is 0.0293 <br />
<br />
A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use&nbsp;&nbsp; [http://genome.sph.umich.edu/wiki/CalcMatch CalcMatch ]and [http://www.sph.umich.edu/csg/ylwtx/doseR2.tgz doseR2.pl ]to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from [http://www.sph.umich.edu/csg/ylwtx/software.html http://www.sph.umich.edu/csg/ylwtx/software.html]. <br />
<br />
'''Warning''': Imputation involving masked datasets should be performed separately for imputation quality estimation. For production, one should use all available information.<br />
<br />
== How do I interpret the imputation quality estimates? ==<br />
In the simple approach, you will only get concordance/error estimates. There are two aspects to check. (1) the ratio between the genotypic error and allelic error. We expect that only a small proportion of errors where one homozygote is imputed as the other homozygote. Therefore, a ~2:1 ratio is expected. (2) the absolute error rate. There are several factors influencing imputation quality including the population to be imputed, the reference population and the genotyping panel used. Typically, we expect <2% allelic error rate among Caucasians and East Asians; 3-5% among Africans and African Americans. Figure below show imputation quality from the Human Genome Diversity Project (HGDP) for 52 populations across the world and by different HapMap reference panel.<br />
<br />
http://csg.sph.umich.edu//yli/figure3.gif<br />
<br />
Table 3 in the MaCH 1.0 paper tabulates imputation quality by commercial panel in CEU, YRI, and CHB+JPT.<br />
<br />
== Shall I apply QC before or after imputation? If so, how? ==<br />
<br />
We strongly recommend QC both before and after imputation. Before imputation, we recommend the standard battery of QC filters including HWE, MAF (recommended cutoff is 1% for genotyping-based GWAS), completeness, Mendelian inconsistency etc. Post-imputation, we recommend Rsq 0.3 (which removes &gt;70% of poorly-imputed SNPs at the cost of &lt;0.5% well-imputed SNPs) and MAF of 1%. <br />
<br />
== How do I get reference files for an region of interest? ==<br />
<br />
Note that you do not need to extract regional pedigree files for your own samples because SNPs in pedigree but not in reference will be automatically discarded. <br> 1. For HapMapII format, download haplotypes from http://www.sph.umich.edu/csg/ylwtx/HapMapForMach.tgz <br> 2. For MACH format, you can do the following: <br />
<br />
*First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position. <br />
*Then, under csh:<br />
@ first = `grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
@ last = `grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
under bash:<br />
first=`grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
last=`grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
<br />
*Then find out the field that contains the actual haplotypes, where alleles are separated by whitespace<br />
head -1 orig.hap | wc -w<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | head -1 | wc -w<br />
<br />
* Finally (say you got 3 from the above wc -w command. If you got other numbers, replace the 3 in bold below with the number you got):<br />
<br />
awk '{print $'''3'''}' orig.hap | cut -c${first}-${last} &gt; region.hap<br />
<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | awk '{print $'''3'''}' | cut -c${first}-${last} &gt; region.hap<br />
<br />
The created reference files are in MaCH format. You do NOT need to turn on --hapmapFormat option.<br />
<br />
== Do I always have to sort the pedigree file by marker position? ==<br />
<br />
If you use a reference set of haplotypes, you do not have to as long as the external reference is in correct order.<br />
<br />
== What if I specify ''--states R'' where ''R'' exceeds the maximum possible (2*number diploid individuals - 2 + number_haplotypes)? ==<br />
<br />
Mach caps the number of states at the maximum possible value. <br />
<br />
== How is AL1 defined? Which allele dosage is .dose/.mldose counting? ==<br />
<br />
AL1 is an arbitrary allele. Typically, it is the first allele read in the reference haplotypes. The earliest versions (prior to April 2007) of mach counted the expected number copies of AL2 and more recent versions count the number of AL1. One can find out which allele is counted following the steps below. <br />
<br />
#. First, find the two alleles for one of the markers in your data<br />
<br />
<source lang="text"><br />
prompt> head -2 mlinfo/chr21.mlinfo <br />
SNP Al1 Al2 Freq1 MAF Quality Rsq <br />
rs885550 2 4 0.9840 0.0160 0.9682 0.992<br />
</source> <br />
<br />
#. Second, check the dosage for a few individuals at this SNP.<br />
<br />
<source lang="text"><br />
prompt> head -3 mldose/chr21.mldose | cut -f3 -d ' ' <br />
1.962 <br />
1.000<br />
0.078<br />
</source> <br />
<br />
#. Finally, compare these dosages to genotypes.<br />
<br />
<source lang="text"><br />
prompt> head -1 mlgeno/chr21.mlgeno | cut -f3 -d ' ' <br />
2/2 <br />
2/4<br />
4/4<br />
</source> <br />
<br />
In this example, you can see that the first individual has a high dosage count (1.962) and most likely genotype 2/2. The last individual has a low dosage count and most likely genotype 4/4. Thus, the output corresponds to version of Mach released after April 2007, which should tally allele 1 counts. <br />
<br />
Note that, on the example above, .mldose could be replaced with .dose and .mlgeno could be replaced with .geno. <br />
<br />
Based on the three files above, we've confirmed that dosage is the number of AL1 copies: you will only to check for one informative case (i.e, dosage values close to 0 or 2) since it's consistent across all individuals and all SNPs.<br />
<br />
== Can I used an unphased reference? ==<br />
<br />
Yes. You could create pedigree (.ped) and data files (.dat) that include both reference panel and sample genotypes or request that MaCH merge apppropriate files on the fly. <br />
<br />
For example, if you have: <br />
<br />
'''reference.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''reference.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A<br />
<br />
'''sample.dat''' <br />
<br />
M SNP1<br />
M SNP4 <br />
<br />
'''sample.ped''' <br />
<br />
1 1 0 0 1 A/A G/G<br />
<br />
Your could create a combined data set as: <br />
<br />
'''comb.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''comb.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A <br />
1 1 0 0 1 A/A ./. ./. G/G ./. <br />
<br />
Equivalently, you could write -d reference.dat,sample.dat -p reference.ped,sample.ped on the command line and MACH would merge both files ''on-the-fly''.<br />
<br />
== How big are the imputation output file? ==<br />
For 1,000 individuals with 8 million SNPs, gz compressed geno/dose/prob files take ~5Gb/10Gb/15Gb.<br />
<br />
== How long does imputation take? ==<br />
<br />
The following factors/parameters affect computational time: <br />
<br />
#m, # of genotyped markers (number of markers in .dat file)<br> <br />
#n, # of individuals<br> <br />
#h, # of reference haplotypes (determined by --greedy or states, by default, h = 2*number diploid individuals - 2 + number_haplotypes)<br> <br />
#r, # of rounds (-r or --rounds, --mle corresponds to 1-2 rounds)<br />
<br />
Computational time increases linearly with m, n, r and quadratically with h. On our Xeon 3.0GHz machine, imputation with m=25K, n=250, h=120, and r=100 takes ~20 hours (25000*250*120^2*100/4.5/10^11). <br />
<br />
If you have a larger number of individuals to impute (e.g., > 1,000), we recommend a 2-step imputation manner http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F.<br />
<br />
== undefined symbol: gzopen64 ==<br />
If you see this message, you will need to re-compile the program. Type the following commands:<br />
<br />
make clear<br />
make all<br />
<br />
New executables mach1 and thunder will then be generated under folder executables/<br />
<br />
== Install MaCH ==<br />
We have source codes available through the MaCH download page: http://www.sph.umich.edu/csg/yli/mach/download/ <br><br />
<br />
== More questions? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li] or [mailto:goncalo@umich.edu Goncalo Abecasis].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH_FAQ&diff=14619MaCH FAQ2017-02-02T15:20:33Z<p>Ppwhite: /* Where can I find HapMap III / 1000 Genomes reference files? */</p>
<hr />
<div>== How to speed up? ==<br />
<br />
=== minimac ===<br />
<br />
This is the new 2-step procedure we are recommending, particularly considering people that are performing imputation multiple times (using HapMap as reference, or using updated releases of the 1000 Genomes data as reference). <br><br />
The first step is a pre-phasing step using MaCH. This step does not need external reference. This is a time-consuming step BUT is a one-time investment. For computational reason, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]) for this step. In general, we recommend >500Kb overlapping region on each side. For example, for Affymetrix 6.0 panel, if we use core region of 10Mb and flanking/overlapping region of 1Mb on each side, it will correspond to ~3500 SNps in the core region and ~350 SNPs on each side. For 2000 individuals, one job with ~4,200 SNPs running with --states 200 and -r 50, this would take ~40 hours. For other combinations, using the following link to estimate computing time [http://csg.sph.umich.edu//yli/MaCH-Admix/runtime.php#est runtime estimate]. <br><br />
The second step is the actual imputation step using minimac. This step can run on whole chromosomes. Regarding computing time, one million markers for 1000 individuals using 100 reference haplotypes takes ~ 1 hour; and computing time increases linearly with all the above three parameters. See [http://genome.sph.umich.edu/wiki/Minimac minimac] for details.<br />
<br />
=== MaCH-Admix ===<br />
<br />
If you are doing imputation only a few (<5) times (think twice if this is really true) or under an immediate time pressure, you can use MaCH-Admix, which does not require pre-phased data and takes ~1/7 of the computing time of that typically needed for pre-phasing. For large dataset, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]). Details see [http://www.unc.edu/~yunmli/MaCH-Admix/ MaCH-Admix].<br />
<br />
=== Divide and Conquer ===<br />
See [http://genome.sph.umich.edu/wiki/Mach_DAC MaCH Divide and Conquer] for details.<br />
<br />
=== 2-step imputation ===<br />
See [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F 2-step imputation] for details. <br />
<br />
== Why and how to perform a 2-step imputation? ==<br />
<br />
When one has a large number of individuals (&gt;1000), we recommend a 2-step imputation to speed up. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp; A 2-step imputation contains the following 2 steps:<br> <br />
<br />
&nbsp;&nbsp;&nbsp; (step 1) a representative subset of &gt;= 200 unrelated individuals are used to calibrate model parameters; and<br>&nbsp;&nbsp;&nbsp; (step 2) actual genotype imputation is performed for every person using parameters inferred in step 1. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Example command lines for a 2-step imputation:<br> <br />
<br />
# step 1:<br />
mach1 -d sample.dat -p subset.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip -r 100 -o par_infer &gt; mach.infer.log<br />
<br />
# step 2:<br />
mach1 -d sample.dat -p sample.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip --errorMap par_infer.erate --crossoverMap par_infer.rec --mle --mldetails &gt; mach.imp.log<br />
<br />
In step1, one can use --greedy in combination with --states XX in MaCH versions 16.b and above. We have found that using 1/3 of the reference haplotypes (with 1/9 computational time) results in almost no power loss for the current HapMap and 1000G reference panels.<br><br />
<br />
In step2, each individual is imputed independently and can therefore be split into as many as n (sample size) jobs for each chromosome for parallelism.<br />
<br />
For other approaches to speed up, see [how to speed up].<br />
<br />
== Can MaCH perform imputation for chromosome X? ==<br />
Yes. See [http://genome.sph.umich.edu/wiki/MaCH:_machX MaCH X Chromosome] for details.<br />
<br />
== Where can I find combined HapMap reference files? ==<br />
<br />
You can find them at http://csg.sph.umich.edu//yli/mach/download/HapMap-r21.html or on the HapMap Project website.<br />
<br />
== Where can I find HapMap III / 1000 Genomes reference files? ==<br />
<br />
You can find these at the MaCH download page, which is at http://csg.sph.umich.edu//yli/mach/download/<br />
<br />
== Does --mle overwrite input genotypes? ==<br />
<br />
Yes, but not often. The --mle option outputs the most likely genotype configuration taking into account observed genotypes and integration over the most similar reference haplotypes. The original genotypes will be changed only if the underlying reference haplotypes strongly contradict the input genotype. <br />
<br />
== How do I get imputation quality estimates? ==<br />
<br />
A simple approach is to use --mask option (in the second step alone if using two-step imputation). For example, --mask 0.02 masks 2% of the genotypes at random, impute them and compare with the masked original to estimate genotypic and allelic error rates. Messages like the following will be generated to stdout: <br />
<br />
Comparing 948352 masked genotypes with MLE estimates ...<br />
Estimated per genotype error rate is 0.0568<br />
Estimated per allele error rate is 0.0293 <br />
<br />
A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use&nbsp;&nbsp; [http://genome.sph.umich.edu/wiki/CalcMatch CalcMatch ]and [http://www.sph.umich.edu/csg/ylwtx/doseR2.tgz doseR2.pl ]to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from [http://www.sph.umich.edu/csg/ylwtx/software.html http://www.sph.umich.edu/csg/ylwtx/software.html]. <br />
<br />
'''Warning''': Imputation involving masked datasets should be performed separately for imputation quality estimation. For production, one should use all available information.<br />
<br />
== How do I interpret the imputation quality estimates? ==<br />
In the simple approach, you will only get concordance/error estimates. There are two aspects to check. (1) the ratio between the genotypic error and allelic error. We expect that only a small proportion of errors where one homozygote is imputed as the other homozygote. Therefore, a ~2:1 ratio is expected. (2) the absolute error rate. There are several factors influencing imputation quality including the population to be imputed, the reference population and the genotyping panel used. Typically, we expect <2% allelic error rate among Caucasians and East Asians; 3-5% among Africans and African Americans. Figure below show imputation quality from the Human Genome Diversity Project (HGDP) for 52 populations across the world and by different HapMap reference panel.<br />
<br />
http://www.sph.umich.edu/csg/yli/figure3.gif<br />
<br />
Table 3 in the MaCH 1.0 paper tabulates imputation quality by commercial panel in CEU, YRI, and CHB+JPT.<br />
<br />
== Shall I apply QC before or after imputation? If so, how? ==<br />
<br />
We strongly recommend QC both before and after imputation. Before imputation, we recommend the standard battery of QC filters including HWE, MAF (recommended cutoff is 1% for genotyping-based GWAS), completeness, Mendelian inconsistency etc. Post-imputation, we recommend Rsq 0.3 (which removes &gt;70% of poorly-imputed SNPs at the cost of &lt;0.5% well-imputed SNPs) and MAF of 1%. <br />
<br />
== How do I get reference files for an region of interest? ==<br />
<br />
Note that you do not need to extract regional pedigree files for your own samples because SNPs in pedigree but not in reference will be automatically discarded. <br> 1. For HapMapII format, download haplotypes from http://www.sph.umich.edu/csg/ylwtx/HapMapForMach.tgz <br> 2. For MACH format, you can do the following: <br />
<br />
*First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position. <br />
*Then, under csh:<br />
@ first = `grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
@ last = `grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
under bash:<br />
first=`grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
last=`grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
<br />
*Then find out the field that contains the actual haplotypes, where alleles are separated by whitespace<br />
head -1 orig.hap | wc -w<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | head -1 | wc -w<br />
<br />
* Finally (say you got 3 from the above wc -w command. If you got other numbers, replace the 3 in bold below with the number you got):<br />
<br />
awk '{print $'''3'''}' orig.hap | cut -c${first}-${last} &gt; region.hap<br />
<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | awk '{print $'''3'''}' | cut -c${first}-${last} &gt; region.hap<br />
<br />
The created reference files are in MaCH format. You do NOT need to turn on --hapmapFormat option.<br />
<br />
== Do I always have to sort the pedigree file by marker position? ==<br />
<br />
If you use a reference set of haplotypes, you do not have to as long as the external reference is in correct order.<br />
<br />
== What if I specify ''--states R'' where ''R'' exceeds the maximum possible (2*number diploid individuals - 2 + number_haplotypes)? ==<br />
<br />
Mach caps the number of states at the maximum possible value. <br />
<br />
== How is AL1 defined? Which allele dosage is .dose/.mldose counting? ==<br />
<br />
AL1 is an arbitrary allele. Typically, it is the first allele read in the reference haplotypes. The earliest versions (prior to April 2007) of mach counted the expected number copies of AL2 and more recent versions count the number of AL1. One can find out which allele is counted following the steps below. <br />
<br />
#. First, find the two alleles for one of the markers in your data<br />
<br />
<source lang="text"><br />
prompt> head -2 mlinfo/chr21.mlinfo <br />
SNP Al1 Al2 Freq1 MAF Quality Rsq <br />
rs885550 2 4 0.9840 0.0160 0.9682 0.992<br />
</source> <br />
<br />
#. Second, check the dosage for a few individuals at this SNP.<br />
<br />
<source lang="text"><br />
prompt> head -3 mldose/chr21.mldose | cut -f3 -d ' ' <br />
1.962 <br />
1.000<br />
0.078<br />
</source> <br />
<br />
#. Finally, compare these dosages to genotypes.<br />
<br />
<source lang="text"><br />
prompt> head -1 mlgeno/chr21.mlgeno | cut -f3 -d ' ' <br />
2/2 <br />
2/4<br />
4/4<br />
</source> <br />
<br />
In this example, you can see that the first individual has a high dosage count (1.962) and most likely genotype 2/2. The last individual has a low dosage count and most likely genotype 4/4. Thus, the output corresponds to version of Mach released after April 2007, which should tally allele 1 counts. <br />
<br />
Note that, on the example above, .mldose could be replaced with .dose and .mlgeno could be replaced with .geno. <br />
<br />
Based on the three files above, we've confirmed that dosage is the number of AL1 copies: you will only to check for one informative case (i.e, dosage values close to 0 or 2) since it's consistent across all individuals and all SNPs.<br />
<br />
== Can I used an unphased reference? ==<br />
<br />
Yes. You could create pedigree (.ped) and data files (.dat) that include both reference panel and sample genotypes or request that MaCH merge apppropriate files on the fly. <br />
<br />
For example, if you have: <br />
<br />
'''reference.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''reference.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A<br />
<br />
'''sample.dat''' <br />
<br />
M SNP1<br />
M SNP4 <br />
<br />
'''sample.ped''' <br />
<br />
1 1 0 0 1 A/A G/G<br />
<br />
Your could create a combined data set as: <br />
<br />
'''comb.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''comb.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A <br />
1 1 0 0 1 A/A ./. ./. G/G ./. <br />
<br />
Equivalently, you could write -d reference.dat,sample.dat -p reference.ped,sample.ped on the command line and MACH would merge both files ''on-the-fly''.<br />
<br />
== How big are the imputation output file? ==<br />
For 1,000 individuals with 8 million SNPs, gz compressed geno/dose/prob files take ~5Gb/10Gb/15Gb.<br />
<br />
== How long does imputation take? ==<br />
<br />
The following factors/parameters affect computational time: <br />
<br />
#m, # of genotyped markers (number of markers in .dat file)<br> <br />
#n, # of individuals<br> <br />
#h, # of reference haplotypes (determined by --greedy or states, by default, h = 2*number diploid individuals - 2 + number_haplotypes)<br> <br />
#r, # of rounds (-r or --rounds, --mle corresponds to 1-2 rounds)<br />
<br />
Computational time increases linearly with m, n, r and quadratically with h. On our Xeon 3.0GHz machine, imputation with m=25K, n=250, h=120, and r=100 takes ~20 hours (25000*250*120^2*100/4.5/10^11). <br />
<br />
If you have a larger number of individuals to impute (e.g., > 1,000), we recommend a 2-step imputation manner http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F.<br />
<br />
== undefined symbol: gzopen64 ==<br />
If you see this message, you will need to re-compile the program. Type the following commands:<br />
<br />
make clear<br />
make all<br />
<br />
New executables mach1 and thunder will then be generated under folder executables/<br />
<br />
== Install MaCH ==<br />
We have source codes available through the MaCH download page: http://www.sph.umich.edu/csg/yli/mach/download/ <br><br />
<br />
== More questions? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li] or [mailto:goncalo@umich.edu Goncalo Abecasis].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH_FAQ&diff=14618MaCH FAQ2017-02-02T15:20:16Z<p>Ppwhite: /* Where can I find combined HapMap reference files? */</p>
<hr />
<div>== How to speed up? ==<br />
<br />
=== minimac ===<br />
<br />
This is the new 2-step procedure we are recommending, particularly considering people that are performing imputation multiple times (using HapMap as reference, or using updated releases of the 1000 Genomes data as reference). <br><br />
The first step is a pre-phasing step using MaCH. This step does not need external reference. This is a time-consuming step BUT is a one-time investment. For computational reason, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]) for this step. In general, we recommend >500Kb overlapping region on each side. For example, for Affymetrix 6.0 panel, if we use core region of 10Mb and flanking/overlapping region of 1Mb on each side, it will correspond to ~3500 SNps in the core region and ~350 SNPs on each side. For 2000 individuals, one job with ~4,200 SNPs running with --states 200 and -r 50, this would take ~40 hours. For other combinations, using the following link to estimate computing time [http://csg.sph.umich.edu//yli/MaCH-Admix/runtime.php#est runtime estimate]. <br><br />
The second step is the actual imputation step using minimac. This step can run on whole chromosomes. Regarding computing time, one million markers for 1000 individuals using 100 reference haplotypes takes ~ 1 hour; and computing time increases linearly with all the above three parameters. See [http://genome.sph.umich.edu/wiki/Minimac minimac] for details.<br />
<br />
=== MaCH-Admix ===<br />
<br />
If you are doing imputation only a few (<5) times (think twice if this is really true) or under an immediate time pressure, you can use MaCH-Admix, which does not require pre-phased data and takes ~1/7 of the computing time of that typically needed for pre-phasing. For large dataset, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]). Details see [http://www.unc.edu/~yunmli/MaCH-Admix/ MaCH-Admix].<br />
<br />
=== Divide and Conquer ===<br />
See [http://genome.sph.umich.edu/wiki/Mach_DAC MaCH Divide and Conquer] for details.<br />
<br />
=== 2-step imputation ===<br />
See [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F 2-step imputation] for details. <br />
<br />
== Why and how to perform a 2-step imputation? ==<br />
<br />
When one has a large number of individuals (&gt;1000), we recommend a 2-step imputation to speed up. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp; A 2-step imputation contains the following 2 steps:<br> <br />
<br />
&nbsp;&nbsp;&nbsp; (step 1) a representative subset of &gt;= 200 unrelated individuals are used to calibrate model parameters; and<br>&nbsp;&nbsp;&nbsp; (step 2) actual genotype imputation is performed for every person using parameters inferred in step 1. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Example command lines for a 2-step imputation:<br> <br />
<br />
# step 1:<br />
mach1 -d sample.dat -p subset.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip -r 100 -o par_infer &gt; mach.infer.log<br />
<br />
# step 2:<br />
mach1 -d sample.dat -p sample.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip --errorMap par_infer.erate --crossoverMap par_infer.rec --mle --mldetails &gt; mach.imp.log<br />
<br />
In step1, one can use --greedy in combination with --states XX in MaCH versions 16.b and above. We have found that using 1/3 of the reference haplotypes (with 1/9 computational time) results in almost no power loss for the current HapMap and 1000G reference panels.<br><br />
<br />
In step2, each individual is imputed independently and can therefore be split into as many as n (sample size) jobs for each chromosome for parallelism.<br />
<br />
For other approaches to speed up, see [how to speed up].<br />
<br />
== Can MaCH perform imputation for chromosome X? ==<br />
Yes. See [http://genome.sph.umich.edu/wiki/MaCH:_machX MaCH X Chromosome] for details.<br />
<br />
== Where can I find combined HapMap reference files? ==<br />
<br />
You can find them at http://csg.sph.umich.edu//yli/mach/download/HapMap-r21.html or on the HapMap Project website.<br />
<br />
== Where can I find HapMap III / 1000 Genomes reference files? ==<br />
<br />
You can find these at the MaCH download page, which is at http://www.sph.umich.edu/csg/yli/mach/download/<br />
<br />
== Does --mle overwrite input genotypes? ==<br />
<br />
Yes, but not often. The --mle option outputs the most likely genotype configuration taking into account observed genotypes and integration over the most similar reference haplotypes. The original genotypes will be changed only if the underlying reference haplotypes strongly contradict the input genotype. <br />
<br />
== How do I get imputation quality estimates? ==<br />
<br />
A simple approach is to use --mask option (in the second step alone if using two-step imputation). For example, --mask 0.02 masks 2% of the genotypes at random, impute them and compare with the masked original to estimate genotypic and allelic error rates. Messages like the following will be generated to stdout: <br />
<br />
Comparing 948352 masked genotypes with MLE estimates ...<br />
Estimated per genotype error rate is 0.0568<br />
Estimated per allele error rate is 0.0293 <br />
<br />
A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use&nbsp;&nbsp; [http://genome.sph.umich.edu/wiki/CalcMatch CalcMatch ]and [http://www.sph.umich.edu/csg/ylwtx/doseR2.tgz doseR2.pl ]to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from [http://www.sph.umich.edu/csg/ylwtx/software.html http://www.sph.umich.edu/csg/ylwtx/software.html]. <br />
<br />
'''Warning''': Imputation involving masked datasets should be performed separately for imputation quality estimation. For production, one should use all available information.<br />
<br />
== How do I interpret the imputation quality estimates? ==<br />
In the simple approach, you will only get concordance/error estimates. There are two aspects to check. (1) the ratio between the genotypic error and allelic error. We expect that only a small proportion of errors where one homozygote is imputed as the other homozygote. Therefore, a ~2:1 ratio is expected. (2) the absolute error rate. There are several factors influencing imputation quality including the population to be imputed, the reference population and the genotyping panel used. Typically, we expect <2% allelic error rate among Caucasians and East Asians; 3-5% among Africans and African Americans. Figure below show imputation quality from the Human Genome Diversity Project (HGDP) for 52 populations across the world and by different HapMap reference panel.<br />
<br />
http://www.sph.umich.edu/csg/yli/figure3.gif<br />
<br />
Table 3 in the MaCH 1.0 paper tabulates imputation quality by commercial panel in CEU, YRI, and CHB+JPT.<br />
<br />
== Shall I apply QC before or after imputation? If so, how? ==<br />
<br />
We strongly recommend QC both before and after imputation. Before imputation, we recommend the standard battery of QC filters including HWE, MAF (recommended cutoff is 1% for genotyping-based GWAS), completeness, Mendelian inconsistency etc. Post-imputation, we recommend Rsq 0.3 (which removes &gt;70% of poorly-imputed SNPs at the cost of &lt;0.5% well-imputed SNPs) and MAF of 1%. <br />
<br />
== How do I get reference files for an region of interest? ==<br />
<br />
Note that you do not need to extract regional pedigree files for your own samples because SNPs in pedigree but not in reference will be automatically discarded. <br> 1. For HapMapII format, download haplotypes from http://www.sph.umich.edu/csg/ylwtx/HapMapForMach.tgz <br> 2. For MACH format, you can do the following: <br />
<br />
*First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position. <br />
*Then, under csh:<br />
@ first = `grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
@ last = `grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
under bash:<br />
first=`grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
last=`grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
<br />
*Then find out the field that contains the actual haplotypes, where alleles are separated by whitespace<br />
head -1 orig.hap | wc -w<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | head -1 | wc -w<br />
<br />
* Finally (say you got 3 from the above wc -w command. If you got other numbers, replace the 3 in bold below with the number you got):<br />
<br />
awk '{print $'''3'''}' orig.hap | cut -c${first}-${last} &gt; region.hap<br />
<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | awk '{print $'''3'''}' | cut -c${first}-${last} &gt; region.hap<br />
<br />
The created reference files are in MaCH format. You do NOT need to turn on --hapmapFormat option.<br />
<br />
== Do I always have to sort the pedigree file by marker position? ==<br />
<br />
If you use a reference set of haplotypes, you do not have to as long as the external reference is in correct order.<br />
<br />
== What if I specify ''--states R'' where ''R'' exceeds the maximum possible (2*number diploid individuals - 2 + number_haplotypes)? ==<br />
<br />
Mach caps the number of states at the maximum possible value. <br />
<br />
== How is AL1 defined? Which allele dosage is .dose/.mldose counting? ==<br />
<br />
AL1 is an arbitrary allele. Typically, it is the first allele read in the reference haplotypes. The earliest versions (prior to April 2007) of mach counted the expected number copies of AL2 and more recent versions count the number of AL1. One can find out which allele is counted following the steps below. <br />
<br />
#. First, find the two alleles for one of the markers in your data<br />
<br />
<source lang="text"><br />
prompt> head -2 mlinfo/chr21.mlinfo <br />
SNP Al1 Al2 Freq1 MAF Quality Rsq <br />
rs885550 2 4 0.9840 0.0160 0.9682 0.992<br />
</source> <br />
<br />
#. Second, check the dosage for a few individuals at this SNP.<br />
<br />
<source lang="text"><br />
prompt> head -3 mldose/chr21.mldose | cut -f3 -d ' ' <br />
1.962 <br />
1.000<br />
0.078<br />
</source> <br />
<br />
#. Finally, compare these dosages to genotypes.<br />
<br />
<source lang="text"><br />
prompt> head -1 mlgeno/chr21.mlgeno | cut -f3 -d ' ' <br />
2/2 <br />
2/4<br />
4/4<br />
</source> <br />
<br />
In this example, you can see that the first individual has a high dosage count (1.962) and most likely genotype 2/2. The last individual has a low dosage count and most likely genotype 4/4. Thus, the output corresponds to version of Mach released after April 2007, which should tally allele 1 counts. <br />
<br />
Note that, on the example above, .mldose could be replaced with .dose and .mlgeno could be replaced with .geno. <br />
<br />
Based on the three files above, we've confirmed that dosage is the number of AL1 copies: you will only to check for one informative case (i.e, dosage values close to 0 or 2) since it's consistent across all individuals and all SNPs.<br />
<br />
== Can I used an unphased reference? ==<br />
<br />
Yes. You could create pedigree (.ped) and data files (.dat) that include both reference panel and sample genotypes or request that MaCH merge apppropriate files on the fly. <br />
<br />
For example, if you have: <br />
<br />
'''reference.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''reference.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A<br />
<br />
'''sample.dat''' <br />
<br />
M SNP1<br />
M SNP4 <br />
<br />
'''sample.ped''' <br />
<br />
1 1 0 0 1 A/A G/G<br />
<br />
Your could create a combined data set as: <br />
<br />
'''comb.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''comb.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A <br />
1 1 0 0 1 A/A ./. ./. G/G ./. <br />
<br />
Equivalently, you could write -d reference.dat,sample.dat -p reference.ped,sample.ped on the command line and MACH would merge both files ''on-the-fly''.<br />
<br />
== How big are the imputation output file? ==<br />
For 1,000 individuals with 8 million SNPs, gz compressed geno/dose/prob files take ~5Gb/10Gb/15Gb.<br />
<br />
== How long does imputation take? ==<br />
<br />
The following factors/parameters affect computational time: <br />
<br />
#m, # of genotyped markers (number of markers in .dat file)<br> <br />
#n, # of individuals<br> <br />
#h, # of reference haplotypes (determined by --greedy or states, by default, h = 2*number diploid individuals - 2 + number_haplotypes)<br> <br />
#r, # of rounds (-r or --rounds, --mle corresponds to 1-2 rounds)<br />
<br />
Computational time increases linearly with m, n, r and quadratically with h. On our Xeon 3.0GHz machine, imputation with m=25K, n=250, h=120, and r=100 takes ~20 hours (25000*250*120^2*100/4.5/10^11). <br />
<br />
If you have a larger number of individuals to impute (e.g., > 1,000), we recommend a 2-step imputation manner http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F.<br />
<br />
== undefined symbol: gzopen64 ==<br />
If you see this message, you will need to re-compile the program. Type the following commands:<br />
<br />
make clear<br />
make all<br />
<br />
New executables mach1 and thunder will then be generated under folder executables/<br />
<br />
== Install MaCH ==<br />
We have source codes available through the MaCH download page: http://www.sph.umich.edu/csg/yli/mach/download/ <br><br />
<br />
== More questions? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li] or [mailto:goncalo@umich.edu Goncalo Abecasis].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=MaCH_FAQ&diff=14617MaCH FAQ2017-02-02T15:19:37Z<p>Ppwhite: /* minimac */</p>
<hr />
<div>== How to speed up? ==<br />
<br />
=== minimac ===<br />
<br />
This is the new 2-step procedure we are recommending, particularly considering people that are performing imputation multiple times (using HapMap as reference, or using updated releases of the 1000 Genomes data as reference). <br><br />
The first step is a pre-phasing step using MaCH. This step does not need external reference. This is a time-consuming step BUT is a one-time investment. For computational reason, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]) for this step. In general, we recommend >500Kb overlapping region on each side. For example, for Affymetrix 6.0 panel, if we use core region of 10Mb and flanking/overlapping region of 1Mb on each side, it will correspond to ~3500 SNps in the core region and ~350 SNPs on each side. For 2000 individuals, one job with ~4,200 SNPs running with --states 200 and -r 50, this would take ~40 hours. For other combinations, using the following link to estimate computing time [http://csg.sph.umich.edu//yli/MaCH-Admix/runtime.php#est runtime estimate]. <br><br />
The second step is the actual imputation step using minimac. This step can run on whole chromosomes. Regarding computing time, one million markers for 1000 individuals using 100 reference haplotypes takes ~ 1 hour; and computing time increases linearly with all the above three parameters. See [http://genome.sph.umich.edu/wiki/Minimac minimac] for details.<br />
<br />
=== MaCH-Admix ===<br />
<br />
If you are doing imputation only a few (<5) times (think twice if this is really true) or under an immediate time pressure, you can use MaCH-Admix, which does not require pre-phased data and takes ~1/7 of the computing time of that typically needed for pre-phasing. For large dataset, we recommend breaking the genome into small overlapping segments ( [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Divide_and_Conquer Divide-and-Conquer]). Details see [http://www.unc.edu/~yunmli/MaCH-Admix/ MaCH-Admix].<br />
<br />
=== Divide and Conquer ===<br />
See [http://genome.sph.umich.edu/wiki/Mach_DAC MaCH Divide and Conquer] for details.<br />
<br />
=== 2-step imputation ===<br />
See [http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F 2-step imputation] for details. <br />
<br />
== Why and how to perform a 2-step imputation? ==<br />
<br />
When one has a large number of individuals (&gt;1000), we recommend a 2-step imputation to speed up. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp; A 2-step imputation contains the following 2 steps:<br> <br />
<br />
&nbsp;&nbsp;&nbsp; (step 1) a representative subset of &gt;= 200 unrelated individuals are used to calibrate model parameters; and<br>&nbsp;&nbsp;&nbsp; (step 2) actual genotype imputation is performed for every person using parameters inferred in step 1. <br> <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Example command lines for a 2-step imputation:<br> <br />
<br />
# step 1:<br />
mach1 -d sample.dat -p subset.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip -r 100 -o par_infer &gt; mach.infer.log<br />
<br />
# step 2:<br />
mach1 -d sample.dat -p sample.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip --errorMap par_infer.erate --crossoverMap par_infer.rec --mle --mldetails &gt; mach.imp.log<br />
<br />
In step1, one can use --greedy in combination with --states XX in MaCH versions 16.b and above. We have found that using 1/3 of the reference haplotypes (with 1/9 computational time) results in almost no power loss for the current HapMap and 1000G reference panels.<br><br />
<br />
In step2, each individual is imputed independently and can therefore be split into as many as n (sample size) jobs for each chromosome for parallelism.<br />
<br />
For other approaches to speed up, see [how to speed up].<br />
<br />
== Can MaCH perform imputation for chromosome X? ==<br />
Yes. See [http://genome.sph.umich.edu/wiki/MaCH:_machX MaCH X Chromosome] for details.<br />
<br />
== Where can I find combined HapMap reference files? ==<br />
<br />
You can find them at http://www.sph.umich.edu/csg/yli/mach/download/HapMap-r21.html or on the HapMap Project website.<br />
<br />
== Where can I find HapMap III / 1000 Genomes reference files? ==<br />
<br />
You can find these at the MaCH download page, which is at http://www.sph.umich.edu/csg/yli/mach/download/<br />
<br />
== Does --mle overwrite input genotypes? ==<br />
<br />
Yes, but not often. The --mle option outputs the most likely genotype configuration taking into account observed genotypes and integration over the most similar reference haplotypes. The original genotypes will be changed only if the underlying reference haplotypes strongly contradict the input genotype. <br />
<br />
== How do I get imputation quality estimates? ==<br />
<br />
A simple approach is to use --mask option (in the second step alone if using two-step imputation). For example, --mask 0.02 masks 2% of the genotypes at random, impute them and compare with the masked original to estimate genotypic and allelic error rates. Messages like the following will be generated to stdout: <br />
<br />
Comparing 948352 masked genotypes with MLE estimates ...<br />
Estimated per genotype error rate is 0.0568<br />
Estimated per allele error rate is 0.0293 <br />
<br />
A better approach is to mask a small proportion of SNPs (vs. genotypes in the above simple approach). One can generate a mask.dat from the original .dat file by simply changing the flag of a subset of markers from M to S2 without duplicating the .ped file. Post-imputation, one can use&nbsp;&nbsp; [http://genome.sph.umich.edu/wiki/CalcMatch CalcMatch ]and [http://www.sph.umich.edu/csg/ylwtx/doseR2.tgz doseR2.pl ]to estimate genotypic/allelic error rate and correlation respectively. Both programs can be downloaded from [http://www.sph.umich.edu/csg/ylwtx/software.html http://www.sph.umich.edu/csg/ylwtx/software.html]. <br />
<br />
'''Warning''': Imputation involving masked datasets should be performed separately for imputation quality estimation. For production, one should use all available information.<br />
<br />
== How do I interpret the imputation quality estimates? ==<br />
In the simple approach, you will only get concordance/error estimates. There are two aspects to check. (1) the ratio between the genotypic error and allelic error. We expect that only a small proportion of errors where one homozygote is imputed as the other homozygote. Therefore, a ~2:1 ratio is expected. (2) the absolute error rate. There are several factors influencing imputation quality including the population to be imputed, the reference population and the genotyping panel used. Typically, we expect <2% allelic error rate among Caucasians and East Asians; 3-5% among Africans and African Americans. Figure below show imputation quality from the Human Genome Diversity Project (HGDP) for 52 populations across the world and by different HapMap reference panel.<br />
<br />
http://www.sph.umich.edu/csg/yli/figure3.gif<br />
<br />
Table 3 in the MaCH 1.0 paper tabulates imputation quality by commercial panel in CEU, YRI, and CHB+JPT.<br />
<br />
== Shall I apply QC before or after imputation? If so, how? ==<br />
<br />
We strongly recommend QC both before and after imputation. Before imputation, we recommend the standard battery of QC filters including HWE, MAF (recommended cutoff is 1% for genotyping-based GWAS), completeness, Mendelian inconsistency etc. Post-imputation, we recommend Rsq 0.3 (which removes &gt;70% of poorly-imputed SNPs at the cost of &lt;0.5% well-imputed SNPs) and MAF of 1%. <br />
<br />
== How do I get reference files for an region of interest? ==<br />
<br />
Note that you do not need to extract regional pedigree files for your own samples because SNPs in pedigree but not in reference will be automatically discarded. <br> 1. For HapMapII format, download haplotypes from http://www.sph.umich.edu/csg/ylwtx/HapMapForMach.tgz <br> 2. For MACH format, you can do the following: <br />
<br />
*First, find the first and last SNP in the region you are interested in. Say "rsFIRST" and "rsLAST", defined according to position. <br />
*Then, under csh:<br />
@ first = `grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
@ last = `grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
under bash:<br />
first=`grep -nw rsFIRST orig.snps | cut -f1 -d ':'`<br />
last=`grep -nw rsLAST orig.snps | cut -f1 -d ':'`<br />
<br />
*Then find out the field that contains the actual haplotypes, where alleles are separated by whitespace<br />
head -1 orig.hap | wc -w<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | head -1 | wc -w<br />
<br />
* Finally (say you got 3 from the above wc -w command. If you got other numbers, replace the 3 in bold below with the number you got):<br />
<br />
awk '{print $'''3'''}' orig.hap | cut -c${first}-${last} &gt; region.hap<br />
<br />
Note: if the haplotypes are gz compressed, do:<br />
zcat orig.hap.gz | awk '{print $'''3'''}' | cut -c${first}-${last} &gt; region.hap<br />
<br />
The created reference files are in MaCH format. You do NOT need to turn on --hapmapFormat option.<br />
<br />
== Do I always have to sort the pedigree file by marker position? ==<br />
<br />
If you use a reference set of haplotypes, you do not have to as long as the external reference is in correct order.<br />
<br />
== What if I specify ''--states R'' where ''R'' exceeds the maximum possible (2*number diploid individuals - 2 + number_haplotypes)? ==<br />
<br />
Mach caps the number of states at the maximum possible value. <br />
<br />
== How is AL1 defined? Which allele dosage is .dose/.mldose counting? ==<br />
<br />
AL1 is an arbitrary allele. Typically, it is the first allele read in the reference haplotypes. The earliest versions (prior to April 2007) of mach counted the expected number copies of AL2 and more recent versions count the number of AL1. One can find out which allele is counted following the steps below. <br />
<br />
#. First, find the two alleles for one of the markers in your data<br />
<br />
<source lang="text"><br />
prompt> head -2 mlinfo/chr21.mlinfo <br />
SNP Al1 Al2 Freq1 MAF Quality Rsq <br />
rs885550 2 4 0.9840 0.0160 0.9682 0.992<br />
</source> <br />
<br />
#. Second, check the dosage for a few individuals at this SNP.<br />
<br />
<source lang="text"><br />
prompt> head -3 mldose/chr21.mldose | cut -f3 -d ' ' <br />
1.962 <br />
1.000<br />
0.078<br />
</source> <br />
<br />
#. Finally, compare these dosages to genotypes.<br />
<br />
<source lang="text"><br />
prompt> head -1 mlgeno/chr21.mlgeno | cut -f3 -d ' ' <br />
2/2 <br />
2/4<br />
4/4<br />
</source> <br />
<br />
In this example, you can see that the first individual has a high dosage count (1.962) and most likely genotype 2/2. The last individual has a low dosage count and most likely genotype 4/4. Thus, the output corresponds to version of Mach released after April 2007, which should tally allele 1 counts. <br />
<br />
Note that, on the example above, .mldose could be replaced with .dose and .mlgeno could be replaced with .geno. <br />
<br />
Based on the three files above, we've confirmed that dosage is the number of AL1 copies: you will only to check for one informative case (i.e, dosage values close to 0 or 2) since it's consistent across all individuals and all SNPs.<br />
<br />
== Can I used an unphased reference? ==<br />
<br />
Yes. You could create pedigree (.ped) and data files (.dat) that include both reference panel and sample genotypes or request that MaCH merge apppropriate files on the fly. <br />
<br />
For example, if you have: <br />
<br />
'''reference.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''reference.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A<br />
<br />
'''sample.dat''' <br />
<br />
M SNP1<br />
M SNP4 <br />
<br />
'''sample.ped''' <br />
<br />
1 1 0 0 1 A/A G/G<br />
<br />
Your could create a combined data set as: <br />
<br />
'''comb.dat''' <br />
<br />
M SNP1<br />
M SNP2<br />
M SNP3<br />
M SNP4<br />
M SNP5<br />
<br />
'''comb.ped''' <br />
<br />
REF1 REF1 0 0 1 A/C C/C G/G G/A A/A <br />
1 1 0 0 1 A/A ./. ./. G/G ./. <br />
<br />
Equivalently, you could write -d reference.dat,sample.dat -p reference.ped,sample.ped on the command line and MACH would merge both files ''on-the-fly''.<br />
<br />
== How big are the imputation output file? ==<br />
For 1,000 individuals with 8 million SNPs, gz compressed geno/dose/prob files take ~5Gb/10Gb/15Gb.<br />
<br />
== How long does imputation take? ==<br />
<br />
The following factors/parameters affect computational time: <br />
<br />
#m, # of genotyped markers (number of markers in .dat file)<br> <br />
#n, # of individuals<br> <br />
#h, # of reference haplotypes (determined by --greedy or states, by default, h = 2*number diploid individuals - 2 + number_haplotypes)<br> <br />
#r, # of rounds (-r or --rounds, --mle corresponds to 1-2 rounds)<br />
<br />
Computational time increases linearly with m, n, r and quadratically with h. On our Xeon 3.0GHz machine, imputation with m=25K, n=250, h=120, and r=100 takes ~20 hours (25000*250*120^2*100/4.5/10^11). <br />
<br />
If you have a larger number of individuals to impute (e.g., > 1,000), we recommend a 2-step imputation manner http://genome.sph.umich.edu/wiki/MaCH_FAQ#Why_and_how_to_perform_a_2-step_imputation.3F.<br />
<br />
== undefined symbol: gzopen64 ==<br />
If you see this message, you will need to re-compile the program. Type the following commands:<br />
<br />
make clear<br />
make all<br />
<br />
New executables mach1 and thunder will then be generated under folder executables/<br />
<br />
== Install MaCH ==<br />
We have source codes available through the MaCH download page: http://www.sph.umich.edu/csg/yli/mach/download/ <br><br />
<br />
== More questions? ==<br />
<br />
Email [mailto:yunli@med.unc.edu Yun Li] or [mailto:goncalo@umich.edu Goncalo Abecasis].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=TrioCaller:Archive&diff=14616TrioCaller:Archive2017-02-02T15:14:12Z<p>Ppwhite: </p>
<hr />
<div>Archive<br />
<br />
Log: add more sensitive sanity check <br />
<br />
Binary file only: [http://csg.sph.umich.edu//weich/TrioCaller.04242012.binary.tgz TrioCaller.04242012.binary.tgz]. <br />
<br />
Binary file with example datasets : [http://csg.sph.umich.edu//weich/TrioCaller.04242012.tgz TrioCaller.04242012.tgz].</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=RareSimu&diff=14615RareSimu2017-02-02T15:11:59Z<p>Ppwhite: /* Download */</p>
<hr />
<div>Genetic Model-based Simulator [GMS] is an efficient c++ program for simulating case control data sets based on genetic models. The input is a pool of haplotypes and a text file for model specification. <br />
The output is a set of simulated datasets in the format of Merlin ped file. <br />
<br />
== Basic Usage Example ==<br />
<br />
In a typical command line, a few options need to be specified together with the input files. <br />
Here is an example of how GMS works:<br />
<br />
./GMS --hapfile test.hap --snplist test.lst --model model.heter.txt --f0 0.01 --<br />
nrep 100 --ncase 250 --nctrl 250 --causal --prefix tmp<br />
<br />
== Command Line Options ==<br />
<br />
=== Basic Output Options ===<br />
<br />
--hapfile a pool of simulated or real haplotypes, one chromosome per row<br />
--snplist snp names in the order ofhaplotypes in hapfile, one snp per row<br />
--model a model file specifying genetic models, see below for details<br />
--nrep the number of replications<br />
--seed seed for random number generator<br />
--ncase the number of cases in each replicate<br />
--nctrl the number of controls in each replicate<br />
--f0 overall baseline prevalence<br />
--prefix prefix of output files (e.g. prefix.rep1.ped, prefix.rep2.ped)<br />
--causal only generate causal SNPs in the output pedigree file<br />
<br />
<br />
=== Model File Annotation ===<br />
The model file includes one header line and multiple rows after. Each row responding to a set of <br />
SNPs with desired frequency range and relate risk (RR) or odds ratio (OR)<br />
<br />
1. Heterogeneity Model<br />
<br />
a) COUNT FREQ_MIN FREQ_MAX RR1 RR2<br />
<br />
b) FRACTION FREQ_MIN FREQ_MAX RR1 RR2<br />
<br />
2. Logistic Model<br />
<br />
a) COUNT FREQ_MIN FREQ_MAX OR1 OR2<br />
<br />
b) FRACTION FREQ_MIN FREQ_MAX OR1 OR2<br />
<br />
== How It Works ==<br />
There are two underlying models. Disease status follows a Bernoulli distribution with P <br />
<br />
1. Heterogeneity Model<br />
<math> P(D | (AA,AA,...,AA)) = f_0 </math><br />
<br />
<math> P = \sum_{i=1}^N P(D|x_i) </math><br />
<br />
<br />
2. Logistic Model<br />
<br />
<math>logit(y) = \beta_0 + \sum_{i=1}^{N}\beta_i\times x_i</math><br />
<br />
<math> P = \frac{e^{\beta_0 + \sum_{i=1}^{N}\beta_i\times x_i}}{1+e^{\beta_0 + \sum_{i=1}^{N}\beta_i\times x_i}}</math><br />
<br />
== Download ==<br />
<br />
The current version is available for download from http://csg.sph.umich.edu//weich/GMS.tar.gz<br />
<br />
== TODO ==<br />
1. Support Quantitative trait.<br />
<br />
2. Support family structures.<br />
<br />
3. Support more "reasonable" models.</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=BamGenotypeCheck&diff=14614BamGenotypeCheck2017-02-02T15:11:17Z<p>Ppwhite: /* Download bamGenotypeCheck */</p>
<hr />
<div>{| style="width:100%; background:#FF8989; margin-top:1.2em; border:1px solid #ccc;" |<br />
| style="width:100%; text-align:center; white-space:nowrap; color:#000;" | <br />
<div style="font-size:162%; border:none; margin:0; padding:.1em; color:#000;">This tool has been DEPRECATED, and replaced by [[VerifyBamID]]</div><br />
|}<br />
<br />
'''bamGenotypeCheck''' is a program that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals).<br />
<br />
<br />
== Download bamGenotypeCheck ==<br />
<br />
To get a copy go to the [http://csg.sph.umich.edu//pha/karma/download/ Karma Download] download page.<br />
<br />
== Build bamGenotypeCheck ==<br />
<br />
Karma (which includes bamGenotypeCheck) is designed to be reasonably portable. <br />
<br />
However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, and later, there are likely other portability issues. <br />
<br />
We support Karma only on Ubuntu 9.10 and later on 64-bit processors.<br />
<br />
== Usage ==<br />
<br />
A key step in any genetic analysis is to verify whether data being generated matches expectations. This program checks whether reads in a BAM file match previous genotypes for a specific sample. <br />
<br />
Using a mathematical model that relates observed sequence reads to an hypothetical true genotype, bamGenotypeCheck tries to decide whether sequence reads match a particular individual or are more likely to be contaminated (including a small proportion of foreign DNA), derived from a closely related individual, or derived from a completely different individual.<br />
<br />
== Basic Usage Example ==<br />
<br />
Here is a typical command line:<br />
<br />
bamGenotypeCheck -r /data/local/ref/karma.ref/human.g1k.v37.fa \<br />
-k BAMfiles.txt -p test.ped -d test.dat -m test.map<br />
<br />
== Command Line Options ==<br />
<br />
=== Input Files ===<br />
<br />
-r ''genome reference in [http://en.wikipedia.org/wiki/Fasta_format simplified FASTA format]''<br />
-a ''allele Frequency file in [[MERLIN format]]''<br />
-p ''pedigree file in [[MERLIN format]]''<br />
-d ''data file in [[MERLIN format]]''<br />
-m ''map file in [[MERLIN format]]''<br />
<br />
-k ''a list of BAM files to check''<br />
-c [int] ''stop after reading [int] filtered sequence reads''<br />
-C [int] ''stop after reading [int] reads, filtered or not''<br />
<br />
=== Output Options ===<br />
<br />
-v ''verbose output''<br />
<br />
=== Filtering ===<br />
<br />
-b [int] ''exclude bases with quality less than [int]''<br />
-M [int] ''exclude reads with map quality less than [int]''<br />
-f [float] ''drop markers with minor allele frequency smaller than [float]''<br />
-F [int] ''set custom BAM flags filter (not implemented at the moment)''<br />
<br />
=== Other Options ===<br />
<br />
-e [float] '' set minimum error base error to [float]''<br />
<br />
== Principle of Operation ==<br />
<br />
Each read group in a BAM file is evaluated independently. This means that in file with multiple read groups, problems will be flagged at the read group level (a plus). However, it also means that it might be hard to discern the correct assignment of read groups with very little data.<br />
<br />
For each aligned base that overlaps a known genotype, we calculate the probability the probability that it was derived from a particular known genotype. This comparison considers only bases that overlap previously known genotypes and that meet the base quality and mapping quality thresholds.<br />
<br />
Each individual in a pedigree has a different combination of genotypes, and bamGenotypeCheck will systematically search for the individual whose genotypes best match the observed read data.<br />
<br />
For more about the technical details, see the page [[Verifying Sample Identities - Implementation]]<br />
<br />
== TODO ==</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=Karma&diff=14613Karma2017-02-02T15:10:34Z<p>Ppwhite: /* Download Karma */</p>
<hr />
<div><!-- BANNER ACROSS TOP OF PAGE --><br />
{| style="width:100%; background:#ffb6c1; margin-top:1.2em; border:1px solid #ccc;" |<br />
| style="width:100%; text-align:center; white-space:nowrap; color:#000;" | <br />
<div style="font-size:162%; border:none; margin:0; padding:.1em; color:#000;">KARMA is obsolete and not maintained</div><br />
|}<br />
<br />
<br />
'''K-tuple Alignment with Rapid Matching Algorithm''' <br />
<br />
Karma uses an existing reference to align short reads, such as those generated by Illumina sequencers. <br />
<br />
Primary features: <br />
<br />
#High performance, high sensitivity <br />
#Large and small gap detection by default <br />
#Multiple gaps per read by default <br />
#Single or paired end reads <br />
#No read length limit <br />
#Quality scores are used to assess quality of maps <br />
#All potential locations are examined exhaustively, none are omitted <br />
#Reasonable memory per CPU ratio on high core count machines<br />
<br />
The current version, 0.9.0, is optimized to rapidly map base space reads from Illumina sequencers. <br />
<br />
Color space and LS454 sequence alignments are not currently supported. These features will return in Karma 0.9.1. <br />
<br />
= Download Karma =<br />
<br />
To get a copy go to [http://csg.sph.umich.edu//pha/karma/download/ Karma Download]<br />
<br />
= Build Karma =<br />
<br />
Karma is designed to be reasonably portable. <br />
<br />
However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, there are likely other portability issues. <br />
<br />
We support Karma only on Ubuntu 9.10 on 64-bit processors. <br />
<br />
== Dependencies ==<br />
<br />
Karma requires that the following debian packages be installed on the host Linux machine: <br />
<br />
#libssl-dev <br />
#zlib1g-dev<br />
<br />
Without these installed, Karma will not build. <br />
<br />
== Building ==<br />
<br />
Assuming the karma tar file is named karma.tgz, do the following <br />
<br />
tar xvzf karma.tgz<br />
cd karma-0.9<br />
make<br />
mkdir ~/bin<br />
cp karma/karma ~/bin<br />
<br />
Alternatively, if you want to share the karma binary install it in /usr/local/bin/karma. <br />
<br />
== Testing the build ==<br />
<br />
To test karma, go to the build tree subdirectory named ''karma'', and type the command: <br />
<br />
make test<br />
<br />
The test script builds a reference for the small phiX genome, then runs single end as well as paired end alignments. It compares the results of that with known results. Differences are printed to the console, and currently look something like this: <br />
<pre>diff phiX.sam.good phiX.sam <br />
3c3<br />
&lt; @RG DT:2010-04-08T17:29Z ID:boingboing SM:NA12345<br />
---<br />
&gt; @RG DT:2010-04-08T18:13Z ID:boingboing SM:NA12345<br />
</pre> <br />
Any differences greater than that are an error and need to be fixed by the author. <br />
<br />
= Normal Workflow =<br />
<br />
Karma works using a set of index and hash files created from an existing reference. Once created, this set of reference index and hash files must always be specified in the command line when aligning reads. <br />
<br />
In concept, the simplest workflow is to first create a reference index using ''karma create'', then align reads using ''karma map''. You only have to build the index and hash once. <br />
<br />
Because the reference can be large, and because Karma will share the reference among many running instances of Karma, it is useful to put well known references in a common location readily accessible to you and your collaborators. <br />
<br />
= Build reference index and hash =<br />
<br />
Building a reference index and hash with Karma is straightforward, but because it is time consuming for longer genomes, you typically save the reference index between runs. <br />
<br />
The simplest example for creating a reference and index using a wordsize of 11-mer words is: <br />
<br />
karma create -i -w 11 phiX.fa<br />
<br />
More generally, three primary parameters are necessary for building a Karma reference index: <br />
<br />
#a boolean flag indicating base or color space <br />
#the index table word occurrence cutoff value <br />
#the word size<br />
<br />
Although the input reference is always expected to be base space and in [http://en.wikipedia.org/wiki/FASTA_format FASTA] format, the binary version of the reference, and the corresponding index and hash files, can be in either color space (ABI SOLiD) or base space (Illumina or LS454). For a given reference [http://en.wikipedia.org/wiki/FASTA_format FASTA] file, you may have either a color or base space binary reference, as well as either color or base space index/hash files, any in varying word sizes or occurrence cutoffs. <br />
<br />
Because the index and hash files are dependent on the occurrence cutoff parameter and the word size, the output files created by karma have those values in the file name. This allows you to create a variety of index/hash tables, depending on your expected use (ABI SOLiD, in particular, is sensitive to read length). <br />
<br />
== Options for building reference index and hash ==<br />
<br />
-r ''reference'' Reference file in [http://en.wikipedia.org/wiki/FASTA_format FASTA] format<br />
-w ''word size'' Word size for index and hash (default 15, typically 10-16)<br />
-O ''occurrence cutoff'' Upper count of number of word positions to store in word positions table (default 5000)<br />
-c Creates a color space reference and index/hash<br />
-i Create the index and hash as well as the binary reference<br />
<br />
<br> <br />
<br />
= Aligning Reads =<br />
<br />
Aligning reads to the reference is easy: <br />
<br />
karma map -r phiX.fa -w 11 phiX.fastq<br />
<br />
or for paired reads: <br />
<br />
karma map -r phiX.fa -w 11 phiX-mate1.fastq phiX-mate2.fastq<br />
<br />
In both of the above examples, the -r option names the reference originally used to build the index/hash, and the -w 11 specifies that we are using the index/hash built for 11-mer words. Although you can use the default word size of 15 for phiX, the index is 4^15 * 4 = 4GBytes, so a shorter word size is prudent. <br />
<br />
Since Karma uses the word size and occurrence cutoff to help construct the actual index and hash filenames, you must specify them the same way you did when you created the reference index and hash. <br />
<br />
== Options for aligning reads ==<br />
<br />
-a [int] -&gt; maximum insert size<br />
-B [int] -&gt; max number of bases in millions<br />
-E -&gt; show reference bases (default off)<br />
-H -&gt; set SAM header line values (e.g. -H RG:SM:NA12345)<br />
-o -&gt; required output sam/bam file<br />
-O [int] -&gt; occurrence cutoff (default 5000)<br />
-q -&gt; quiet mode (no output except errors)<br />
-r [name] -&gt; required genome reference<br />
-R [int] -&gt; max number of reads<br />
-w [int] -&gt; index word size (default 15)<br />
<br />
== Aligning Reads (Illumina) ==<br />
<br />
Karma is set up so that the default options work well for mapping Illumina reads to the Human genome. <br />
<br />
== Aligning Reads (ABI SOLiD) ==<br />
<br />
Karma has been designed to align color space reads. However, in Karma 0.9.0, this functionality is not working. <br />
<br />
== Aligning Reads (LS 454) ==<br />
<br />
Karma has been designed to align LS 454 reads. However, in Karma 0.9.0, this functionality is not working. <br />
<br />
= Karma Performance Tuning =<br />
<br />
There are four components to the Karma index and hash. A pure index array, based on an N-mer word index. This is used as a pointer into a word positions table, which is an ordered list of genome positions in which that N-mer word appears. There is a cap called the ''occurrence cutoff'', which once exceeded, causes that index word to be marked as a high repeat pattern. Once marked as high repeat, the N-mer word is instead combined with both the N-mer word preceding it, as well as the N-mer word succeeding it to create a 2 * N-mer word hash key. Two hash tables are populated, a left and a right hash. These are then used when that pattern is found in a read. <br />
<br />
== Index Word Size ==<br />
<br />
Choosing an appropriate word size for larger genome is critical to performance. The easiest case is for Illumina base space reads with the human genome (3Gbases), where the default 15-mer word size is fine. <br />
<br />
For smaller genomes, consider using a smaller word size. Genomes smaller than a few million bases should be perfectly fine with a word size of 11 or 12. <br />
<br />
Since the primary index table into the word positions table is 2^(wordsize) * 4 bytes, it can grow large rapidly. All else being equal, a smaller word size leads to longer sets of word positions for each index value. Each increment of word size approximately quadruples storage requirements, and halves runtime. Similarly, each decrement of word size reduces the index table size by 75%, and doubles runtime. These approximations are old, but serve a useful rule of thumb. <br />
<br />
For ABI SOLiD reads, the word size is critical, due to the shorter length of reads as compared to Illumina or LS 454. <br />
<br />
The optimal minimum word size is chosen such that it is 1/4 the minimum expected average read length. It also must be chosen to be 1/2 the minimum expected read length, since at least 2 full words must exist in the read. <br />
<br />
So for 48-mer reads, a reasonable value of word size is 12. Although the base space default of 15 is fine, too, Karma is able to take advantage of a higher number of index words per read, yielding substantial speedups even with the shorter read. Similarly, 52-mer reads would map better with a 13-mer word size, and 56-mer reads would map best with a 14-mer word size. <br />
<br />
== Occurrence Cutoff ==<br />
<br />
The occurrence cutoff value determines how quickly an N-mer pattern is declared to be ''high repeat'' and left out of the index in favor of a hash. The default value of 5000 seems adequate for Illumina reads with the human genome. If ultimate performance is necessary, some experimentation is called for with this value. <br />
<br />
== Shared Memory ==<br />
<br />
Karma uses memory mapped files to share the potentially large reference index and hash data structures. <br />
<br />
Karma uses this to great effect on our 8 processors with hyperthreading enabled. 16 copies of karma can share one reference index and hash, yielding a very acceptable memory per CPU ratio of around 1GB/CPU. <br />
<br />
A problem with large reference index and hash data structures is that they are more prone to being paged out. On a shared machine that is being used extensively even just simple disk I/O, memory pages are being reclaimed such that Karma will become swapped out. <br />
<br />
While Karma can recover on its own, it is best to either run in a production manner on dedicated machines, or to run a program such as the utility ''mapfile'' found in the utilities sub-folder. This program continually touches each page of the data structures in sequential order, forcing them to the head of the disk buffer pool, so they don't get aged out of the queue. <br />
<br />
= Modifying the Reference Header =<br />
<br />
''NB: This feature is not yet complete'' <br />
<br />
To facilitate SAM RG values being set automatically in a production environment, we keep a header in the binary version of the reference. The header can be viewed and edited using the header subcommands here. <br />
<br />
To view the header: <br />
<br />
karma header -r phiX.fa<br />
<br />
To view and edit the header: <br />
<br />
karma header -r phiX.fa -e<br />
<br />
= Optional flags =<br />
<br />
Besides conforming to SAM specification, Karma developed its own optional tags to&nbsp; help evaluate mapping. <br />
<br />
{| cellspacing="1" cellpadding="1" border="1" style="width: 1034px; height: 291px;"<br />
|-<br />
| XA <br />
| Alignment PathTag<br />
|-<br />
| RG <br />
| Sample GroupID<br />
|-<br />
| HA <br />
| numMatchContributors (possible hits checked by karma)<br />
|-<br />
| UQ <br />
| phred quality of this read, assuming it is mapped correctly<br />
|-<br />
| NB <br />
| Number of equally best hits<br />
|-<br />
| ER <br />
| <br />
Different kind of errors when mapping:<br />
<br />
ER:Z:no_match -&gt; UNSET_QUALITY<br>ER:Z:invalid_bases -&gt; INVALID_DATA (not used yet)<br>ER:Z:duplicates -&gt; EARLYSTOP_DUPLICATES ( early stop due to reaching max number of best matches, with all bases matched)<br>ER:Z:quality -&gt; EARLYSTOP_QUALITY (early stop due to reaching max posterior quality)<br>ER:Z:repeats -&gt; REPEAT_QUALITY (not used yet)<br><br />
<br />
|}<br />
<br />
<br> <br />
<br />
<br><br />
<br />
= Other test and check capabilities =<br />
<br />
Due to the size and complexity of Karma input, output and index files, various checks and tests are useful, so we include some diagnostics capabilities: <br />
<br />
Tests for external files: <br />
<br />
karma check [options...] file.bam file.fastq file.sam file.fa file.umfa<br />
<br />
Tests internal to Karma: <br />
<br />
karma test [options...]<br />
-d -&gt; debug<br />
-s [int] -&gt; set random number seed [12345]<br />
<br />
= Karma file structure =<br />
<br />
Upon successfully building references, you will obtain a list of reference files like below: <br />
<br />
{| cellspacing="1" cellpadding="1" border="1" width="571" style="width: 571px; height: 288px;"<br />
|-<br />
| <br />
| <br />
Base Space <br />
<br />
| Color Space<br />
|-<br />
| <br />
Reference genome <br />
<br />
| <br />
NCBI37-bs.umfa <br />
<br />
| NCBI37-cs.umfa<br />
|-<br />
| <br />
Word Index <br />
<br />
| <br />
NCBI37-bs.15.5000.umwiwp <br />
<br />
NCBI37-bs.15.5000.umwihi <br />
<br />
| <br />
NCBI37-cs.15.5000.umwiwp <br />
<br />
NCBI37-cs.15.5000.umwihi <br />
<br />
|-<br />
| <br />
Word Hash (Left) <br />
<br />
| <br />
NCBI37-bs.15.5000.umwhl <br />
<br />
| NCBI37-cs.15.5000.umwhl<br />
|-<br />
| <br />
Word Hash (Right) <br />
<br />
| <br />
NCBI37-bs.15.5000.umwhr <br />
<br />
| NCBI37-cs.15.5000.umwhr<br />
|}<br />
<br />
<br> <br />
<br />
<br><br />
<br />
= Karma TODO List =<br />
<br />
#command line help is muddled up - UserOptions.h needs work <br />
#color space read handling needs to be re-integrated and tested <br />
#LS 454 needs to be tested, and the code adapted <br />
#pre-process some number of records to establish an appropriate max insert size <br />
#finish reference header view/edit code <br />
#investigate and document maximum memory use during ''create'' sub-command <br />
#finish and improve check and test commands<br />
<br />
= Karma CHANGELOG =<br />
<br />
*Karma 0.9.0<br />
<br />
* reference may now contain an arbitrary number of chromosomes <br />
* local re-alignment is drastically improved (handle small and large gaps better)<br />
* command line is re-vamped - now easier to use<br />
* create/naming/using reference index and hashes is easier<br />
<br />
*Karma 0.8.8S<br />
<br />
* add first version of local re-alignment<br />
* bump max number of chromosomes to 200<br />
* we no longer do Smith-Waterman on each candidate location<br />
<br />
*Karma 0.8.8 <br />
*Karma 0.8.6<br />
<br />
= Other useful links =<br />
<br />
[http://lh3lh3.users.sourceforge.net/bioinfo.shtml Heng Li's thoughts about aligners] <br />
<br />
[http://lh3lh3.users.sourceforge.net/udb.shtml Benchmark of Dictionary Structures] <br />
<br />
[[Category:Software]]</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=Template:SeqShopExtSetup&diff=14612Template:SeqShopExtSetup2017-02-02T15:09:55Z<p>Ppwhite: /* Download the Workshop Example Data */</p>
<hr />
<div>== Setup when running on your own outside of the SeqShop Workshop ==<br />
''This section is specifically for running on your own outside of the SeqShop Workshop.''<br />
<div class="mw-collapsible" style="width:600px"><br />
''If you are running during the SeqShop Workshop, please skip this section.''<br />
<div class="mw-collapsible-content"><br />
<br />
=== Workshop Setup Steps ===<br />
'''If this is your first SeqShop tutorial, you need to run some initial setup steps.'''<br />
<br />
''These setup steps only need to be run once.''<br />
<div class="mw-collapsible" style="width:700px"><br />
If you have already run another SeqShop tutorial, you can skip these steps (select 'Collapse' to hide the steps that you can skip)<br />
<br />
<div class="mw-collapsible-content"><br />
==== Download & Build GotCloud ====<br />
* cd to where you want GotCloud installed (you can change this to any directory you want)<br />
mkdir -p ~/seqshop<br />
cd ~/seqshop/<br />
* download, decompress, and build the version of gotcloud that was tested with this tutorial:<br />
wget https://github.com/statgen/gotcloud/archive/gotcloud.workshop.tar.gz<br />
tar xvf gotcloud.workshop.tar.gz<br />
mv gotcloud-gotcloud.workshop gotcloud<br />
cd gotcloud/src<br />
make<br />
cd ../..<br />
<br />
Remember the path to gotcloud/ - you will need to set the GC variable to the path.<br />
<br />
<br />
==== Download the Workshop Example Data ====<br />
wget http://csg.sph.umich.edu//mktrost/seqshopExample.tar.gz<br />
</div><br />
</div><br />
<br />
=== Setup your run environment ===<br />
<br />
Environment variables will be used throughout the tutorial.<br />
<br />
We recommend that you setup these variables so you won't have to modify every command in the tutorial.<br />
<br />
<br />
<div class="mw-collapsible mw-collapsed" style="width:500px"><br />
I'm using bash (replace the paths below with the appropriate paths):<br />
<div class="mw-collapsible-content"><br />
* Point to where you installed GotCloud<br />
*:<pre>export GC=~/seqshop/gotcloud</pre><br />
* Point to where you installed the seqshop files<br />
*:<pre>export SS=~/seqshop/example</pre><br />
* Point to where you want the output to go<br />
*:<pre>export OUT=~/seqshop/output</pre><br />
</div><br />
</div><br />
<br />
<div class="mw-collapsible mw-collapsed" style="width:500px"><br />
I'm using tcsh (replace the paths below with the appropriate paths):<br />
<div class="mw-collapsible-content"><br />
* Point to where you installed GotCloud<br />
*:<pre>setenv GC ~/seqshop/gotcloud</pre><br />
* Point to where you installed the seqshop files<br />
*:<pre>setenv SS ~/seqshop/example</pre><br />
* Point to where you want the output to go<br />
*:<pre>setenv OUT ~/seqshop/output</pre><br />
</div><br />
</div><br />
<br />
</div><br />
</div></div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=SeqShop:_Sequence_Mapping_and_Assembly_Practical,_June_2014&diff=14611SeqShop: Sequence Mapping and Assembly Practical, June 20142017-02-02T15:07:20Z<p>Ppwhite: /* Download the example data */</p>
<hr />
<div>'''Note:''' the latest version of this practical is available at: [[SeqShop: Sequence Mapping and Assembly Practical]]<br />
* The ones here is the original one from the June workshop (updated to be run from elsewhere)<br />
<br />
== Introduction ==<br />
See the [[Media:SeqShop - GotCloud Align.pdf|introductory slides]] for an intro to this tutorial.<br />
<br />
== Goals of This Session ==<br />
* What we want to learn <br />
** Basic sequence data file formats (FASTQ, BAM) <br />
** How to generate aligned sequences that are ready for variant calling from raw sequence reads<br />
** How to evaluate the quality of sequence data<br />
** How to visualize sequence data to examine the reads aligned to particular genomic positions<br />
<br />
== Setup in person at the SeqShop Workshop ==<br />
''This section is specifically for the SeqShop Workshop computers.''<br />
<div class="mw-collapsible mw-collapsed" style="width:600px"><br />
''If you are not running during the SeqShop Workshop, please skip this section.''<br />
<div class="mw-collapsible-content"><br />
<br />
<br />
{{SeqShopLogin}}<br />
<br />
=== Setup your run environment===<br />
<br />
This will setup some environment variables to point you to<br />
* [[GotCloud]] program<br />
* Tutorial input files<br />
* Setup an output directory<br />
source /home/mktrost/seqshop/setup.txt<br />
* You won't see any output after running <code>source</code><br />
** It silently sets up your environment<br />
<div class="mw-collapsible mw-collapsed" style="width:200px"><br />
View setup.txt<br />
<div class="mw-collapsible-content"><br />
[[File:setup.png|500px]]<br />
</div><br />
</div><br />
</div><br />
</div><br />
<br />
== Setup when running on your own outside of the SeqShop Workshop ==<br />
''This section is specifically for running on your own outside of the SeqShop Workshop.''<br />
<div class="mw-collapsible" style="width:600px"><br />
''If you are running during the SeqShop Workshop, please skip this section.''<br />
<div class="mw-collapsible-content"><br />
=== Download & Build GotCloud ===<br />
If you do not already have GotCloud:<br />
* cd to where you want GotCloud installed (you can change this to any directory you want)<br />
mkdir -p ~/seqshop<br />
cd ~/seqshop/<br />
* download, decompress, and build the version of gotcloud that was tested with this tutorial:<br />
wget https://github.com/statgen/gotcloud/archive/gotcloud.workshop.tar.gz<br />
tar xvf gotcloud.workshop.tar.gz<br />
mv gotcloud-gotcloud.workshop gotcloud<br />
cd gotcloud/src<br />
make<br />
cd ../..<br />
<br />
Remember the path to gotcloud/ that is what you will need to set your GC variable to.<br />
<br />
=== Download the example data ===<br />
Download and untar file containing the example data used in the practicals:<br />
wget http://csg.sph.umich.edu//mktrost/seqshopExample.tar.gz<br />
tar xvf seqshopExample.tar.gz<br />
<br />
You will see the names of all the files included in the example data scrolling on the screen as they are unpacked from the tar file.<br />
<br />
{{SeqShopRemoteEnv}}<br />
<br />
== Examining [[GotCloud]] Align Input Files ==<br />
=== Examining Raw Sequence Reads : FASTQs ===<br />
FASTQ : standard file format provided to you by those who did the sequencing.<br />
: For more information on the FASTQ format, see: http://en.wikipedia.org/wiki/FASTQ_format<br />
<br />
For this tutorial, we will use FASTQs for 4 1000 Genome samples<br />
* Subset of FASTQs - should map to chromosome 22 36000000-37000000<br />
<br />
ls ${SS}/fastq/<br />
There are 24 fastq files: combination of single-end & paired-end. <br />
<br />
;Can you tell which files are single-end and which are paired-end?<br />
<ul><br />
<div class="mw-collapsible mw-collapsed" style="width:400px"><br />
<li>View answer:</li><br />
<div class="mw-collapsible-content"><br />
<ul><br />
<li> Paired-end files have a '''_1.fastq''' or '''_2.fastq''' extension</li><br />
<li> This convention isn't mandatory, but something similar is common</li><br />
[[File:Fastqsm.png|700px]]<br />
</div><br />
</div><br />
</ul><br />
</ul><br />
<br />
<br />
Look at a couple of FASTQs:<br />
less -S ${SS}/fastq/HG00551.SRR190851_1.fastq<br />
<code>less</code> is a Linux command that allows you to look at a file.<br />
*<code>-S</code> option prevents line wrap<br />
* Use the arrow (up/down/left/right) keys to scroll through the file<br />
* Use the <code>space bar</code> to jump down a page<br />
Use <code>'q'</code> to exit out of <code>less</code><br />
q<br />
<br />
;Do you remember the parts of a FASTQ?<br />
<ul><br />
<div class="mw-collapsible mw-collapsed" style="width:200px"><br />
<li>No, remind me:</li><br />
<div class="mw-collapsible-content"><br />
[[File:Fastq.png|500px]]<br />
</div><br />
</div><br />
</ul><br />
<br />
<br />
Look at the paired read:<br />
less -S ${SS}/fastq/HG00551.SRR190851_2.fastq <br />
<br />
Remember, use <code>'q'</code> to exit out of <code>less</code><br />
q<br />
<br />
;Do you notice something in common?<br />
<br />
<ul><br />
<div class="mw-collapsible mw-collapsed" style="width:400px"><br />
<li>View answer:</li><br />
<div class="mw-collapsible-content"><br />
<ul><br />
<li> Paired-end reads have matching read names with a different extensions</li><br />
<li> This convention isn't mandatory, but something similar is common</li><br />
[[File:Fastq3.png|500px]]<br />
</div><br />
</div><br />
</ul><br />
</ul><br />
<br />
=== Reference Files ===<br />
Reference files can be downloaded with [[GotCloud]] or from other sources<br />
* See [[GotCloud: Genetic Reference and Resource Files]] for more information on downloading/generating reference files<br />
<br />
For alignment, you need:<br />
# Reference genome FASTA file<br />
#* Contains the reference base for each position of each chromosome<br />
#* Additional information on the FASTA format: http://en.wikipedia.org/wiki/FASTA_format<br />
# VCF (variant call format) files with chromosomes/positions<br />
#* dbsnp - used to skip known variants when recalibrating<br />
#* hapmap - used for sample contamination/sample swap validation<br />
<br />
Take a look at the chromosome 22 reference files included for this tutorial:<br />
ls ${SS}/ref22<br />
<br />
<ul><br />
<div class="mw-collapsible mw-collapsed" style="width:200px"><br />
<li>View Screenshot</li><br />
<div class="mw-collapsible-content"><br />
[[File:RefDir.png|700px]]<br />
</div><br />
</div><br />
</ul><br />
<br />
Let's read the reference FASTA file (all reference bases for the chromosome):<br />
less ${SS}/ref22/human.g1k.v37.chr22.fa<br />
<br />
Remember, use <code>'q'</code> to exit out of <code>less</code><br />
q<br />
<br />
; Where is the reference sequence?<br />
<ul><br />
<div class="mw-collapsible mw-collapsed" style="width:500px"><br />
<li>Answer:</li><br />
<div class="mw-collapsible-content"><br />
<ul><br />
<li>The ends of a chromosome are 'N' - unknown bases</li><br />
<li>Let's look at 5 lines of the file starting at line 300,000</li><br />
tail -n+300000 ${SS}/ref22/human.g1k.v37.chr22.fa |head -n 5<br />
[[File:Fasta.png|500px]]<br />
</div><br />
</div><br />
</ul><br />
</ul><br />
<br />
If you want to access the FASTA file by position, you can use <code>samtools faidx</code> command<br />
${GC}/bin/samtools faidx ${SS}/ref22/human.g1k.v37.chr22.fa 22:36000000 | less<br />
or <br />
${GC}/bin/samtools faidx ${SS}/ref22/human.g1k.v37.chr22.fa 22:36000000-36000100<br />
<br />
=== GotCloud FASTQ Index File ===<br />
The FASTQ index file is created by you to tell GotCloud about each of your FASTQ files:<br />
* Where to find it<br />
* Sample name<br />
** Each sample can have multiple FASTQs<br />
** Each FASTQ is for a single sample<br />
* Run identifier<br />
** For recalibration we need to know which reads were in the same run.<br />
<br />
FASTQ Index Format:<br />
* Tab delimited<br />
* Starts with a header line<br />
* One line per single-end read<br />
* One line per paired-end read (only 1 line per pair). <br />
<br />
Let's look a look at the index file I prepared for this tutorial:<br />
less -S ${SS}/align.index <br />
<br />
Remember, use <code>'q'</code> to exit out of <code>less</code><br />
q<br />
<br />
; Which samples had multiple runs?<br />
<ul><br />
<div class="mw-collapsible mw-collapsed" style="width:500px"><br />
<li>Need a reminder of the format?</li><br />
<div class="mw-collapsible-content"><br />
[[File:fqindex.png|750px]]<br />
</div><br />
</div><br />
<ul><br />
<li>Note: in the screenshots, the fields are shifted into clear columns to make it easier to read</li><br />
<ul><br />
<li>When you view the file, the fields will not line up in neat columns and it can be hard to read</li><br />
</ul><br />
</ul><br />
<div class="mw-collapsible mw-collapsed" style="width:500px"><br />
<li>Hard to read the index? Need a hint?</li><br />
<div class="mw-collapsible-content"><br />
<ul><br />
<li>Use cut to extract just the MERGE_NAME & RGID fields </li><br />
cut -f 1,4 ${SS}/align.index<br />
</ul><br />
</div><br />
</div><br />
<div class="mw-collapsible mw-collapsed" style="width:500px"><br />
<li>Answer:</li><br />
<div class="mw-collapsible-content"><br />
<ul><br />
<li>HG00553 & HG00640</li><br />
<li>They have multiple unique values in the RGID field</li><br />
[[File:fqindexRG.png|800px]]<br />
</div><br />
</div><br />
</ul><br />
</ul><br />
<br />
<br />
How do you point GotCloud to your index file?<br />
* Command-line <code>--index_file</code> option<br />
: or<br />
* Configuration file <code>INDEX_FILE</code> setting. <br />
<br />
The command-line setting takes precedence over the configuration file setting.<br />
<br />
=== GotCloud Configuration File ===<br />
This file is created by you to configure GotCloud for your data.<br />
<br />
* Default values are provided in ${GC}/bin/gotcloudDefaults.conf<br />
** Most values should be left as the defaults<br />
* Specify values in your configuration file as:<br />
** <code>KEY = value</code><br />
* Use $(KEY) to refer to another key's value<br />
* If a KEY is specified twice, the later value is used<br />
* Does not have access to environment variables<br />
* '#' indicates a comment<br />
<br />
Let's look at the configuration file I created for this test:<br />
more ${SS}/gotcloud.conf<br />
<br />
Use the <code>space bar</code> to advance if the whole file isn't displayed.<br />
<br />
; If your references are in a different path than what is specified, what would you change?<br />
<ul><br />
<div class="mw-collapsible mw-collapsed" style="width:300px"><br />
<li>Answer:</li><br />
<div class="mw-collapsible-content"><br />
<ul><br />
<li>You would change <code>REF_DIR</code> to the new path</li><br />
[[File:gcConf.png|800px]]<br />
</div><br />
</div><br />
</ul><br />
</ul><br />
<br />
== Run [[GotCloud]] Align ==<br />
<br />
[[File:AlignDiagram.png|500px]]<br />
<br />
Now that we have all of our input files, we need just a simple command to run them<br />
${GC}/gotcloud align --conf ${SS}/gotcloud.conf --numcs 2 --base_prefix ${SS} --outdir ${OUT}<br />
<br />
* <code>${GC}/gotcloud</code> runs GotCloud<br />
* <code>align</code> tells GotCloud you want to run the alignment pipeline.<br />
* <code>--conf</code> tells GotCloud the name of the configuration file to use.<br />
** The configuration for this test was downloaded with the seqshop input files.<br />
* <code>--numcs</code> means to run 2 samples at a time.<br />
** How many you can run concurrently depends on your system.<br />
* <code>--base_prefix</code> tells GotCloud the prefix to append to relative paths.<br />
** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS}<br />
** Alternatively, gotcloud.conf could be updated to specify the full paths<br />
* <code>--out_dir</code> tells GotCloud where to write the output.<br />
** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line<br />
<br />
[[File:gcalignStart.png|850px]]<br />
<br />
This should take 1-3 minutes to run.<br />
<br />
It should end with a line like: <code>Processing finished in 133 secs with no errors reported</code><br />
<br />
If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off.<br />
<br />
Inside GotCloud align, not only sequence alignment but also pre-processing of sequence data, including deduplication and base quality recalibration are performed along with quality assessment, as illustrated below.<br />
<br />
[[File:Gotcloud_align_detail.png|500px]]<br />
<br />
== Examining GotCloud Align Output ==<br />
<br />
Let's look at the output directory:<br />
ls ${OUT}<br />
[[File:gcalignOutM.png|600px]]<br />
<br />
=== Quality Control Files ===<br />
Let's take a look at our quality control output directory:<br />
ls ${OUT}/QCFiles <br />
[[File:GcalignOutQCm.png|600px]]<br />
<br />
==== Sample Contamination/Swap ====<br />
Check for sample contamination:<br />
* *.selfSM : Main output file containing the contamination estimate. <br />
** Check the 'FREEMIX' column for genotype-free estimate of contamination<br />
**** 0-1 scale, the lower, the better<br />
**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination<br />
** See [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information<br />
less -S ${OUT}/QCFiles/HG00551.genoCheck.selfSM<br />
<br />
Remember, use <code>'q'</code> to exit out of <code>less</code><br />
q<br />
<br />
; Is there evidence of sample contamination?<br />
<ul><br />
<div class="mw-collapsible mw-collapsed" style="width:200px"><br />
<li>Answer:</li><br />
<div class="mw-collapsible-content"><br />
<ul><br />
<li>No, FREEMIX = 0.00000 (<0.03)</li><br />
</ul><br />
[[File:Contam1.png|700px]]<br />
</div><br />
</div><br />
</ul><br />
<br />
==== QC Metrics ====<br />
See: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results.<br />
<br />
Let's look at some quality control metrics:<br />
cat ${OUT}/QCFiles/HG00551.qplot.stats<br />
<br />
; What is the mapping rate & average coverage for HG00551?<br />
<ul><br />
<div class="mw-collapsible mw-collapsed" style="width:200px"><br />
<li>Answer</li><br />
<div class="mw-collapsible-content"><br />
<ul><br />
<li> 98.93% Mapped</li><br />
<li>7.43 MeanDepth</li><br />
</ul><br />
[[File:qplots.png|200px]]<br />
</div><br />
</div><br />
</ul><br />
<br />
Generate a pdf of quality metrics:<br />
Rscript ${OUT}/QCFiles/HG00551.qplot.R<br />
<br />
Examine the PDF:<br />
evince ${OUT}/QCFiles/HG00551.qplot.pdf&<br />
<br />
It is ok if you see a warning message when opening evince. It should still open. If not, let me know. To close evince, just close the pdf window.<br />
<br />
;Does the Empirical vs reported Phred score look as good as we would like?<br />
<ul><br />
<div class="mw-collapsible mw-collapsed" style="width:400px"><br />
<li>Answer</li><br />
<div class="mw-collapsible-content"><br />
<ul><br />
<li> No, it is well above the line</li><br />
<li> This is due to the small region used for recalibration</li><br />
[[File:Qplotpdf.png|400px]]<br />
<li> Look at the PDF I produced when I ran the whole genome:</li> <br />
evince ${SS}/ext/HG00551.wg.qplot.pdf&<br />
</ul><br />
[[File:Qplotpdfwg.png|400px]]<br />
</div><br />
</div><br />
</ul><br />
<br />
=== BAM Files ===<br />
Binary Sequence Alignment/Map (SAM) Format<br />
* Maps reads to Chromosome/Position<br />
* For a detailed explanation of the SAM/BAM format, see:<br />
** SAM/BAM Spec: http://samtools.github.io/hts-specs/SAMv1.pdf<br />
** Additional information I put together as I started working with SAM/BAM: [[SAM]]<br />
<br />
Let's look at the BAMs (aligned reads that are ready for variant calling):<br />
ls ${OUT}/bams<br />
[[File:GcalignOutBAMm.png|600px]]<br />
<br />
Let's examine at the first 5 lines of the BAM file using [http://samtools.sourceforge.net/samtools.shtml#3 samtools view]:<br />
${GC}/bin/samtools view -h ${OUT}/bams/HG00551.recal.bam|head -n 5<br />
<br />
; What are the chromosome and position of the first record in the BAM file?<br />
<ul><br />
<div class="mw-collapsible mw-collapsed" style="width:300px"><br />
<li>Need a reminder of the format?</li><br />
<div class="mw-collapsible-content"><br />
[[File:Bam.png|750px]]<br />
</div><br />
</div><br />
<div class="mw-collapsible mw-collapsed" style="width:300px"><br />
<li>Answer</li><br />
<div class="mw-collapsible-content"><br />
<ul><br />
<li>Chr 22, Pos: 16114122</li><br />
</ul><br />
[[File:BamRec.png|650px]]<br />
</div><br />
</div><br />
</ul><br />
<br />
==== Accessing BAMs by Position ====<br />
BAM's are so big, what if we want to see a position part way through the file?<br />
*[http://samtools.sourceforge.net/samtools.shtml#3 samtools] has an option for that.<br />
<br />
Add a region to the view command we used above. Let's find all reads that overlap positions 36907000-36907005:<br />
${GC}/bin/samtools view -h ${OUT}/bams/HG00551.recal.bam 22:36907000-36907005<br />
* Just a few reads.<br />
<br />
Let's visualize what reads in that area look like using samtools tview:<br />
${GC}/bin/samtools tview ${OUT}/bams/HG00551.recal.bam ${SS}/ref22/human.g1k.v37.chr22.fa<br />
* Type ‘g’ <br />
** Type 22:36907000 <br />
* Type ‘n’ to color by nucleotide <br />
* Use the arrow keys to move around and look at the area.<br />
<br />
Understanding the syntax:<br />
* '.' : match to the reference on the forward strand<br />
* ',' : match to the reference on the reverse strand<br />
* ACGTN : mismatch to reference on the forward strand<br />
* acgtn : mismatch to reference on the reverse strand<br />
<br />
; Do you see anything interesting?<br />
<ul><br />
<div class="mw-collapsible mw-collapsed" style="width:500px"><br />
<li>Screenshot</li><br />
<div class="mw-collapsible-content"><br />
<ul><br />
<li>We will have to remember this region when we run snpcall to see what it says.</li><br />
</ul><br />
[[File:tview.png|750px]]<br />
</div><br />
</div><br />
</ul><br />
<br />
Other tview commands:<br />
* Type '?' for a help screen<br />
* Type 'q' to quit tview<br />
<br />
Feel free to play around more and browse the BAM files.<br />
<br />
==== Other tools for BAMs ====<br />
We have developed a lot of tools that operate on BAM files.<br />
<br />
See [[Software#BAM_Util_Tools|Software: BamUtil Tools]] for a list<br />
* Many operations:<br />
** diff : diff 2 BAM files<br />
** stats: per positions statistics<br />
** bam2Fastq : convert a BAM back to a FASTQ (how I created the fastqs for this tutorial)<br />
** Lots of others<br />
* Feel free to try some out<br />
* If you have any questions, let me know, I wrote most of them and am happy to help.<br />
<br />
== Logging Off ==<br />
<br />
''This section is specifically for the SeqShop Workshop computers.''<br />
<div class="mw-collapsible mw-collapsed" style="width:600px"><br />
''If you are not running during the SeqShop Workshop, please skip this section.''<br />
<div class="mw-collapsible-content"><br />
To logout of seqshop-server, type:<br />
exit<br />
And close the windows.<br />
<br />
When done, log out of the Windows machine.<br />
</div><br />
</div></div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=BAM_Review_Action_Items&diff=14610BAM Review Action Items2017-02-02T15:06:26Z<p>Ppwhite: /* Useful Links */</p>
<hr />
<div>[[Category:libStatGen]]<br />
[[Category:libStatGen BAM]]<br />
<br />
== Review Sept 20th ==<br />
=== Notes ===<br />
* returning const char*<br />
* SamFileHeader change referenceContigs, etc to private from public<br />
* Add way to copy a SAM record.<br />
<br />
== Review Sept 17th ==<br />
=== Topics Discussed ===<br />
* [[#Return Statuses|Checking if methods succeeded/failed (checking return values/return statuses)]]<br />
* [[#Accessing String Values|Strings as return values]]<br />
<br />
=== NOTES From Meeting ===<br />
* General Notes:<br />
**InputFile should not use <code>long int</code>. Should instead use: <code>long long</code><br />
* Error Handling Notes:<br />
**Anytime have an error could call handleError which would have a switch to return the error, throw exception, or abort. Call it with an error code and a string. Maybe an error handler class where you could use everywhere. Each class would have a member of that class type that would contain that information.<br />
*Returning values of Strings Notes:<br />
** Problems with returning const char*<br />
*** If the pointer is stored when returned, it becomes invalid if the class modifies the underlying string.<br />
** Problems with passing in std::string& as a parameter to be set.<br />
*** people typically want to operate on the return of the method.<br />
** One idea was returning a reference to a string<br />
*** Does that solve the problem? Won't the contents change when a new one is read? Is that what we want?<br />
<br />
<br />
=== Useful Links ===<br />
BAM Library FAQs: http://genome.sph.umich.edu/wiki/SAM/BAM_Library_FAQs<br />
<br />
Source Code: http://csg.sph.umich.edu//mktrost/doxygen/html/<br />
<br />
Test code for setting values in the library: http://csg.sph.umich.edu//mktrost/doxygen/html/WriteFiles_8cpp-source.html<br />
<br />
=== Topics for Discussion ===<br />
==== Return Statuses ====<br />
Currently anytime you do anything on a SAM/BAM file, you have to check the status for failure:<br />
<source lang="cpp"><br />
SamFile samIn;<br />
if(!samIn.OpenForRead(argv[1]))<br />
{<br />
fprintf(stderr, "%s\n", samIn.GetStatusMessage());<br />
return(samIn.GetStatus());<br />
}<br />
<br />
// Read the sam header.<br />
SamFileHeader samHeader;<br />
if(!samIn.ReadHeader(samHeader))<br />
{<br />
fprintf(stderr, "%s\n", samIn.GetStatusMessage());<br />
return(samIn.GetStatus());<br />
}<br />
</source><br />
A previous recommendation was to "Add an option by class that says whether or not to abort on failure. (or even an option on each method)"<br />
<br />
I am proposing modifying the classes to throw exceptions on failures. It would then be up to the user to catch them if they want to handle them or to let them exit the program (which would print out the error message)<br />
<source lang="cpp"><br />
SamFile samIn;<br />
samIn.OpenForRead(argv[1]);<br />
<br />
// Read the sam header.<br />
SamFileHeader samHeader;<br />
samIn.ReadHeader(samHeader);<br />
<br />
// Open the output file for writing.<br />
SamFile samOut;<br />
try<br />
{<br />
samOut.OpenForWrite(argv[2]);<br />
samOut.WriteHeader(samHeader);<br />
}<br />
catch(GenomeException e)<br />
{<br />
std::cout << "Caught an Exception" << e.what() << std::endl;<br />
}<br />
std::cout << "Continue Processing\n";<br />
</source><br />
For caught exceptions, you would see the following and processing would continue:<br />
<pre><br />
Caught Exception:<br />
FAIL_IO: Failed to Open testFiles/unknown for writing<br />
Continue Processing<br />
</pre><br />
<br />
For an uncaught exception, you would see the following and processing would be stopped:<br />
<pre><br />
terminate called after throwing an instance of 'GenomeException'<br />
what(): <br />
FAIL_IO: Failed to Open testFiles/unknown for reading<br />
Aborted<br />
</pre><br />
<br />
<br />
==== Accessing String Values ====<br />
SAM/BAM files have strings in them that people will want to read out.<br />
How should we handle this interface?<br />
Currently we do a mix of returning const char*, like:<br />
<source lang="cpp"><br />
const char* SamRecord::getSequence()<br />
{<br />
myStatus = SamStatus::SUCCESS;<br />
if(mySequence.Length() == 0)<br />
{<br />
// 0 Length, means that it is in the buffer, but has not yet<br />
// been synced to the string, so do the sync.<br />
setSequenceAndQualityFromBuffer();<br />
}<br />
return mySequence.c_str();<br />
}<br />
const std::string& SamRecord::getSequence()<br />
{<br />
myStatus = SamStatus::SUCCESS;<br />
if(mySequence.Length() == 0)<br />
{<br />
// 0 Length, means that it is in the buffer, but has not yet<br />
// been synced to the string, so do the sync.<br />
setSequenceAndQualityFromBuffer();<br />
}<br />
return &mySequence;<br />
}<br />
<br />
</source><br />
and passing in references to strings, like:<br />
<source lang="cpp"><br />
// Set the passed in string to the header line at the specified index.<br />
// It does NOT clear the current contents of header.<br />
// NOTE: some indexes will return blank if the entry was deleted.<br />
bool SamFileHeader::getHeaderLine(unsigned int index, std::string& header) const<br />
{<br />
// Check to see if the index is in range of the header records vector.<br />
if(index < myHeaderRecords.size())<br />
{<br />
// In range of the header records vector, so get the string for<br />
// that record.<br />
SamHeaderRecord* hdrRec = myHeaderRecords[index];<br />
hdrRec->appendString(header);<br />
return(true);<br />
}<br />
else<br />
{<br />
unsigned int commentIndex = index - myHeaderRecords.size();<br />
// Check to see if it is in range of the comments.<br />
if(commentIndex < myComments.size())<br />
{<br />
// It is in range of the comments, so add the type.<br />
header += "@CO\t";<br />
// Add the comment.<br />
header += myComments[commentIndex];<br />
// Add the new line.<br />
header += "\n";<br />
return(true);<br />
}<br />
}<br />
// Invalid index.<br />
return(false);<br />
}<br />
</source><br />
<br />
http://www.sph.umich.edu/csg/mktrost/doxygen/html/SamRecord_8h-source.html<br />
<br />
==== SamFileHeader ====<br />
*Should this be renamed to SamHeader?<br />
*Do we like the classes being named starting with Sam? Should it be Bam?<br />
<br />
Should we add the following to SamFileHeader:<br />
<source lang="cpp"><br />
//////////////////////////////////<br />
// Set methods for header fields.<br />
bool setVersion(const char* version);<br />
bool setSortOrder(const char* sortOrder);<br />
bool addSequenceName(const char* sequenceName);<br />
bool setSequenceLength(const char* keyID, int sequenceLength);<br />
bool setGenomeAssemblyId(const char* keyID, const char* genomeAssemblyId);<br />
bool setMD5Checksum(const char* keyID, const char* md5sum);<br />
bool setURI(const char* keyID, const char* uri);<br />
bool setSpecies(const char* keyID, const char* species);<br />
bool addReadGroupID(const char* readGroupID);<br />
bool setSample(const char* keyID, const char* sample);<br />
bool setLibrary(const char* keyID, const char* library);<br />
bool setDescription(const char* keyID, const char* description);<br />
bool setPlatformUnit(const char* keyID, const char* platform);<br />
bool setPredictedMedianInsertSize(const char* keyID, const char* isize);<br />
bool setSequencingCenter(const char* keyID, const char* center);<br />
bool setRunDate(const char* keyID, const char* runDate);<br />
bool setTechnology(const char* keyID, const char* technology);<br />
bool addProgram(const char* programID);<br />
bool setProgramVersion(const char* keyID, const char* version);<br />
bool setCommandLine(const char* keyID, const char* commandLine);<br />
<br />
///////////////////////////////////<br />
// Get methods for header fields.<br />
// Returns the number of SQ entries in the header.<br />
int32_t getSequenceDictionaryCount();<br />
// Return the Sort Order value that is set in the Header.<br />
// If this field does not exist, "" is returned.<br />
const char* getSortOrder();<br />
/// Additional gets for the rest of the fields.<br />
</source><br />
Should these also be added to SamHeaderRG, SamHeaderSQ, etc as appropriate....<br />
<br />
== Review June 7th ==<br />
<br />
* <S>Move the examples from the SamFile wiki page to their own page</s><br />
** <S>include links from the main library page and the SamFile page.</s><br />
** <S>look into why the one example have two if checks on SamIn status</s> <span style="color:blue">- one was printing the result and one was setting the return value - cleaned up to be in one if statement.</span><br />
* <S>Create 1 library for all of our library code rather than having libcsg, libbam, libfqf separated.</s><br />
** <S>What should this library be called?</s> <span style="color:blue">- Created library: libstatgen and reorganized into a new repository: statgen.</span><br />
*** <S>libdna</s><br />
*** <S>libdna++</s><br />
*** <S>libsequence++</s><br />
*** <S>libDNA</s><br />
*** <S>libgenotype</s><br />
* Add an option by class that says whether or not to abort on failure. (or even an option on each method)<br />
** This allows calling code to set that option and then not have to check for failures since the code it calls would abort on a failure.<br />
** Could/should this be achieved using exceptions? User can decide to catch them or let them terminate the program.<br />
*<S>SamFile add a constructor that takes the filename and a flag to indicate open for read/write. (abort on failure to open)</s><br />
** <S>Also have 2 subclasses one that opens for read, one for write: SamReadFile, SamWriteFile? Or SamFileRead, SamFileWrite?</s> <span style="color:blue">- went with SamFileReader and SamFileWriter</span><br />
* Add a function that says: skipInvalidRecords, validateRecords, etc.<br />
** That way, ReadRecord will keep reading records until a valid/parseable one is found.<br />
*SamFileHeader::setTag - instead of having separate ones for PG, RG, etc, have a generic one that takes as a parameter which one it is.<br />
** KeyID, then Value as parameters....(keyID first, then value)<br />
* SamFileHEader::setProgramName, etc...have specific methods for setting fields so users don't need to know the specific tags, etc used for certain values in the header.<br />
** KeyID, then Value as parameters....(keyID first, then value)<br />
* BAM write utility could add a PG field with default settings (user could specify alternate settings) when it writes a file.<br />
* Future methods to add:<br />
** <S>SamFile::setReadSection(const std::string& refName) - take in the reference name by string since that is what most people will know.</s><br />
*** <S>"" would indicate the ones not associated with a reference.</s></div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=C%2B%2B_Class:_SamFileHeader&diff=14609C++ Class: SamFileHeader2017-02-02T15:05:16Z<p>Ppwhite: /* Sam Header Basics */</p>
<hr />
<div>[[Category:C++]]<br />
[[Category:libStatGen]]<br />
[[Category:libStatGen BAM]]<br />
<br />
== SamFileHeader ==<br />
This class allows a user to get/set the fields in a SAM/BAM Header.<br />
<br />
This class is part of [[C++ Library: libStatGen]].<br />
<br />
=== Sam Header Basics ===<br />
The SamFileHeader is comprised of multiple [http://csg.sph.umich.edu//mktrost/doxygen/current/classSamHeaderRecord.html SamHeaderRecords].<br />
<br />
There are 4 types of SAM Header Records:<br />
# HD - Header<br />
# SQ - Sequence Dictionary<br />
# RG - Read Group<br />
# PG - Program<br />
<br />
A SAM Header Record is comprised of Tag/Value pairs. Each tag only appears once within a specific record.<br />
<br />
A SAM Header can have 0 or 1 HD records, 0 or more PG records, 0 or more SQ Records, and 0 or more RG records. The PG records are keyed off of the ID tag. The SQ records are keyed off of the SN, Sequence Name, tag. The RG records are keyed off of the ID, Unique Read Group Identifier, tag. The keys must be unique for that record type within the file.<br />
<br />
The '''SamFileHeader''' also contains Comments, type CO. They are not included as part of the '''SamHeaderRecord''' class since they do not contain Tag/Value pairs.<br />
<br />
See: http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFileHeader.html for documentation.<br />
<br />
==== Additional Proposed Accessors ====<br />
* HD<br />
** getVersion - returns the VN field (will only be one)<br />
* SQ<br />
** getRefSequenceCount - count of the number of SQ entries in the header<br />
** getRefSequenceName - gets the next reference sequence name.<br />
** getRefSequenceLength - gets the length associated with the specified reference sequence.<br />
* RG<br />
** getSampleID - for a specified Read Group....???? but SampleID is the key...maybe passing in a record?<br />
** getReadGroup - pass in record, return a read group structure?<br />
** getLibrary - for a given read group<br />
** getSample - for a given read group<br />
** getTechnology - for a given read group<br />
** getPlatformUnit - for a given read group<br />
'''NOTE: More Get Accessors will be coming. Let me know if you need a specific one, and I can add that first'''</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=C%2B%2B_Class:_SamFileHeader&diff=14608C++ Class: SamFileHeader2017-02-02T15:04:47Z<p>Ppwhite: /* Sam Header Basics */</p>
<hr />
<div>[[Category:C++]]<br />
[[Category:libStatGen]]<br />
[[Category:libStatGen BAM]]<br />
<br />
== SamFileHeader ==<br />
This class allows a user to get/set the fields in a SAM/BAM Header.<br />
<br />
This class is part of [[C++ Library: libStatGen]].<br />
<br />
=== Sam Header Basics ===<br />
The SamFileHeader is comprised of multiple [http://www.sph.umich.edu/csg/mktrost/doxygen/current/classSamHeaderRecord.html SamHeaderRecords].<br />
<br />
There are 4 types of SAM Header Records:<br />
# HD - Header<br />
# SQ - Sequence Dictionary<br />
# RG - Read Group<br />
# PG - Program<br />
<br />
A SAM Header Record is comprised of Tag/Value pairs. Each tag only appears once within a specific record.<br />
<br />
A SAM Header can have 0 or 1 HD records, 0 or more PG records, 0 or more SQ Records, and 0 or more RG records. The PG records are keyed off of the ID tag. The SQ records are keyed off of the SN, Sequence Name, tag. The RG records are keyed off of the ID, Unique Read Group Identifier, tag. The keys must be unique for that record type within the file.<br />
<br />
The '''SamFileHeader''' also contains Comments, type CO. They are not included as part of the '''SamHeaderRecord''' class since they do not contain Tag/Value pairs.<br />
<br />
See: http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFileHeader.html for documentation.<br />
<br />
==== Additional Proposed Accessors ====<br />
* HD<br />
** getVersion - returns the VN field (will only be one)<br />
* SQ<br />
** getRefSequenceCount - count of the number of SQ entries in the header<br />
** getRefSequenceName - gets the next reference sequence name.<br />
** getRefSequenceLength - gets the length associated with the specified reference sequence.<br />
* RG<br />
** getSampleID - for a specified Read Group....???? but SampleID is the key...maybe passing in a record?<br />
** getReadGroup - pass in record, return a read group structure?<br />
** getLibrary - for a given read group<br />
** getSample - for a given read group<br />
** getTechnology - for a given read group<br />
** getPlatformUnit - for a given read group<br />
'''NOTE: More Get Accessors will be coming. Let me know if you need a specific one, and I can add that first'''</div>Ppwhitehttp://genome.sph.umich.edu/w/index.php?title=C%2B%2B_Class:_SamFile&diff=14607C++ Class: SamFile2017-02-02T15:03:54Z<p>Ppwhite: /* SamFileWriter */</p>
<hr />
<div>[[Category:C++]]<br />
[[Category:libStatGen]]<br />
[[Category:libStatGen BAM]]<br />
<br />
== Reading/Writing SAM/BAM Files In Your Program ==<br />
The '''SamFile''' class allows a user to easily read/write a SAM/BAM file.<br />
<br />
The '''SamFile''' class contains additional functionality that allows a user to read specific sections of sorted & indexed BAM files. In order take advantage of this capability, the index file must be read prior to setting the read section. This logic saves the time of having to read the entire file and takes advantage of the seeking capability of BGZF files. <br />
<br />
'''Future Enhancements:''' Add the ability to read alignments that match a given start, end position for a specific reference sequence. <br />
<br />
This class is part of [[C++ Library: libStatGen|C++ Library: libStatGen]].<br />
<br />
=== Class Documentation ===<br />
<br />
See: http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFile.html<br />
<br />
== Child Classes ==<br />
=== SamFileReader ===<br />
http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFileReader.html<br />
<br />
=== SamFileWriter ===<br />
http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFileWriter.html<br />
<br />
== Statistics ==<br />
=== Statistic Generation ===<br />
<br />
The following statistics can be optionally recorded when reading a SamFile by specifying <code>SamFile::GenerateStatistics()</code> and displayed with <code>SamFile::PrintStatistics()</code><br />
<br />
The statistics only reflect alignments that were successfully read from the BAM file. Alignments that failed to parse from the file are not reflected in the statistics, but alignments that are invalid for other reasons may show up in the statistics.<br />
<br />
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"<br />
|-style="background: #f2f2f2; text-align: center;" <br />
|+ '''Read Counts'''<br />
! Statistic !! Description<br />
|-<br />
|TotalReads<br />
| Total number of alignments that were successfully read from the file.<br />
|-<br />
|MappedReads<br />
| Total number of alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).<br />
|-<br />
|PairedReads<br />
| Total number of alignments that were successfully read from the file with FLAG bit 0x001 set to 1 (paired).<br />
|-<br />
|ProperPair<br />
| Total number of alignments that were successfully read from the file with FLAG bits 0x001 set to 1 (paired) AND 0x002 (proper pair).<br />
|-<br />
|DuplicateReads<br />
| Total number of alignments that were successfully read from the file with FLAG bit 0x400 set to 1 (PCR or optical duplicate).<br />
|-<br />
|QCFailureReads<br />
| Total number of alignments that were successfully read from the file with FLAG bit 0x200 set to 1 (failed quality checks).<br />
|}<br />
<br />
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"<br />
|-style="background: #f2f2f2; text-align: center;"<br />
|- '''Read Percentages'''<br />
! Statistic !! Description<br />
|-<br />
|MappingRate(%)<br />
| 100 * MappedReads/TotalReads<br />
|-<br />
|PairedReads(%)<br />
| 100 * PairedReads/TotalReads<br />
|-<br />
|ProperPair(%)<br />
| 100 * ProperPair/TotalReads<br />
|-<br />
|DupRate(%)<br />
| 100 * DuplicateReads/TotalReads<br />
|-<br />
|QCFailRate(%)<br />
| 100 * QCFailureReads/TotalReads<br />
|}<br />
<br />
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"<br />
|-style="background: #f2f2f2; text-align: center;"<br />
|- '''Base Counts'''<br />
! Statistic !! Description<br />
|-<br />
|TotalBases<br />
| Sum of the SEQ lengths for all alignments that were successfully read from the file.<br />
NOTE: Includes bases that are 'N'.<br />
|-<br />
|BasesInMappedReads<br />
| Sum of the SEQ lengths for all alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).<br />
NOTE: Includes bases that are 'N'.<br />
|}<br />
<br />
NOTE: If the TotalReads is greater than 10^6, then the Read Counts and Base Counts specify the total counts divided by 10^6. This is indicated in the output with a (e6) appended to the field name.<br />
<br />
==== Example Statistics Output ====<br />
<pre><br />
TotalReads(e6) 18.90<br />
MappedReads(e6) 14.77<br />
PairedReads(e6) 18.90<br />
ProperPair(e6) 11.28<br />
DuplicateReads(e6) 0.00<br />
QCFailureReads(e6) 0.00<br />
<br />
MappingRate(%) 78.17<br />
PairedReads(%) 100.00<br />
ProperPair(%) 59.68<br />
DupRate(%) 0.00<br />
QCFailRate(%) 0.00<br />
<br />
TotalBases(e6) 699.30<br />
BasesInMappedReads(e6) 546.67<br />
</pre><br />
<br />
== Usage Examples ==<br />
[[Sam Library Usage Examples]]</div>Ppwhite