Changes

From Genome Analysis Wiki
Jump to navigationJump to search
22 bytes added ,  17:23, 23 February 2013
no edit summary
Line 1: Line 1:  
= checkVCF.py =
 
= checkVCF.py =
   −
checkVCF.py is a small tools written in Python script to check input [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 VCF] files before association tests. It will report monomorphic sites, sites with reference alleles inconsistent with the reference genome, sites with invalid genotypes, non-snp site (e.g. indels), and all sites with allele frequencies greater than ''0.5''. After you passed the checking, you can go on to run [https://github.com/zhanxw/rvtests rvtests] - rare-variant test software.
+
checkVCF.py is a small tool written in [[www.python.org/|Python]] to check input [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 VCF] files before association tests. It can report monomorphic sites, sites with reference alleles inconsistent with the reference genome, sites with invalid genotypes, non-SNP site (e.g. indels), and all sites with allele frequencies greater than ''0.5''. After you passed the checking, you can go on to run [https://github.com/zhanxw/rvtests rvtests] - rare-variant test software.
    
== Download ==
 
== Download ==
   −
Download from [http://www.sph.umich.edu/csg/zhanxw/software/checkVCF/checkVCF-20130223.tar.gz this] and unzip downloaded file. This includes checkVCF.py script, reference genome in FASTA format and its index file.
+
Download from [http://www.sph.umich.edu/csg/zhanxw/software/checkVCF/checkVCF-20130223.tar.gz this] and unzip the downloaded file. This includes checkVCF.py script, reference genome in FASTA format and its index file.
    
== Example ==
 
== Example ==
Line 14: Line 14:  
=== Console output and .log file ===
 
=== Console output and .log file ===
   −
Upon successfully running checkVCF.py on the example file, you will obtain following information:
+
Upon successfully running checkVCF.py on the example file, you will see following outputs:
    
<pre>checkVCF.py -- check validity of VCF file for meta-analysis
 
<pre>checkVCF.py -- check validity of VCF file for meta-analysis
Line 35: Line 35:  
=== .check.nonSnp file ===
 
=== .check.nonSnp file ===
   −
This file includes all non-SNP sites. These can be detected when the length of your reference allele or alternative allele is larger than one. For example, reference allele is AT. Non-SNP sites also includes reference alleles that are not composited from 'A', 'C', 'G', 'T' alleles or alternative alleles that are not composited from 'A', 'T', 'G', 'C', '.' alleles.
+
This file includes all non-SNP sites. These sites can be detected when the length of the reference allele or alternative allele is larger than one. For example, reference allele is AT. Non-SNP sites also include reference alleles that are not composited of 'A', 'C', 'G', 'T' alleles or alternative alleles that are not composited of 'A', 'T', 'G', 'C', '.' alleles.
    
=== .check.ref file ===
 
=== .check.ref file ===
Line 45: Line 45:  
=== .check.geno file ===
 
=== .check.geno file ===
   −
This file contains the line number in which genotypes are not found or not formatted correctly. You will get &quot;IndividualMissingGTField&quot; and &quot;IndividualHasInvalidGT&quot; warnings.
+
This file contains line numbers in which genotypes are not found or not formatted correctly. You will get either &quot;IndividualMissingGTField&quot; warning or &quot;IndividualHasInvalidGT&quot; warnings.
    
=== .check.af file ===
 
=== .check.af file ===
   −
This file contains the sites where alternative allele frequency larger than 0.5 . It is normal that this files contains a number of lines. For human exome chip, you are like to see 10k lines in this file. That means out of total 250k variants, around 10k SNP variants have allele frequency larger than 0.5.
+
This file contains the sites where alternative allele frequencies are larger than 0.5 . It is normal that this file contains a number of lines. For human exome chip, you are likely to have ~10k lines in this file. That means out of total ~250k variants, around 10k SNP variants have allele frequencies larger than 0.5.
    
=== .check.mono file ===
 
=== .check.mono file ===
   −
This file contains the monomorphic sites. It is normal that this files contains a number of lines. In the ideal case, VCF files should only contain variant sites. However, it is practical or convenient to keep some monomorhipc sites in the VCF file. This file records the number of monomorphic sites.
+
This file contains the monomorphic sites. It is normal that this file contains a number of lines. In the ideal case, VCF files should only contain variant sites. However, it is practical or convenient to keep some monomorhipc sites in the VCF file. This file records these monomorphic sites.
    
== Contact ==
 
== Contact ==
    
Questions or comments can be sent to [mailto:zhanxw@umich.edu Xiaowei Zhan] or [mailto:dajiang@umich.edu Dajiang Liu].
 
Questions or comments can be sent to [mailto:zhanxw@umich.edu Xiaowei Zhan] or [mailto:dajiang@umich.edu Dajiang Liu].
255

edits

Navigation menu