Difference between revisions of "VcfCooker"
Line 72: | Line 72: | ||
--qGeno : Assigns genotype likelihood on the VCF file with fixed quality values (useful for data integration) | --qGeno : Assigns genotype likelihood on the VCF file with fixed quality values (useful for data integration) | ||
− | == Subsetting | + | == Subsetting the VCF file == |
+ | |||
+ | Suppose that you have the following index file consisting of subset of individuals in the VCF file as [subset-index] | ||
+ | |||
+ | IND_ID_1 GROUP1,GROUP2,GROUP3 | ||
+ | IND_ID_2 GROUP2 | ||
+ | IND_ID_3 GROUP1,GROUP3 | ||
+ | IND_ID_4 GROUP2,GROUP3 | ||
+ | |||
+ | If you run the following command: | ||
+ | vcfCooker --in-vcf [input-vcf-file] --out [output-prefix] --verbose --subset --in-subset [subset-index] --bgzf | ||
+ | |||
+ | Will create the following set of files | ||
+ | [output-prefix].GROUP1.vcf.gz | ||
+ | [output-prefix].GROUP2.vcf.gz | ||
+ | [output-prefix].GROUP3.vcf.gz | ||
+ | |||
+ | Where each VCF contains a marker polymorphic only within the group (AC>0). AC and AN fields will be updated reflecting the changes in the subset. | ||
+ | |||
+ | Additional Options Includes | ||
+ | --mono-subset : Includes monomorphic SNPs for the subsetting | ||
+ | --filt-only-subset : Use PASS-filter SNPs only for subsetting. | ||
== Upgrading glfMultiples outputs (v 3.3 to v 4.0) == | == Upgrading glfMultiples outputs (v 3.3 to v 4.0) == |
Revision as of 12:02, 20 January 2012
(Updated at 2012/01/20 10:47PM)
vcfCooker is a software that converts VCF/BED file formats in various forms. vcfCooker is currently under development, and will be publicly released soon. The current documentation contains the minimal information of currently working functions.
Current Binary Location
Current binary version of vcfCooker is available at /net/fantasia/home/hmkang/sw/vcfCooker .
Basic Usage
The following parameters are available. Ones with "[]" are in effect:
Available Options Recipes : --write-bed, --write-vcf, --upgrade, --summarize, --filter, --subset VCF Input options : --in-vcf [] BED Input options : --in-bfile [], --in-bed [], --in-bim [], --in-fam [], --ref [/data/local/ref/karma.ref/human.g1k.v37.fa] Subsetting options : --in-subset [], --mono-subset, --filt-only-subset Output Options : --out [./vcfCooker], --qGeno, --print-every [10000] Output compression Options : --plain [ON], --bgzf, --gzip Genotype-level Filter Options : --minGQ, --minGD Filter Options : --winIndel, --indelVCF [], --minQUAL, --minMQ, --maxDP [2147483647], --minDP, --maxABL [100], --winFFRQ, --maxFFRQ, --winFVAR, --merFVAR, --maxFVAR, --minNS, --maxSTP [2147483647], --maxTTT [2147483647], --minTTT [-2147483648], --maxSTR [100], --minSTR [-100], --maxSTZ [2147483647], --minSTZ [-2147483648], --maxCBR [100], --minCBR [-100], --maxQBR [100], --minQBR [-100], --maxCBZ [2147483647], --maxCSR [100], --minCSR [-100], --maxLQZ [2147483647], --minLQZ [-2147483648], --maxRBZ [2147483647], --minRBZ [-2147483648], --maxIOZ [2147483647], --minIOR [-2147483648], --maxIOR [2147483647], --maxAOZ [2147483647], --maxAOI [2147483647], --maxMQ0 [100], --maxMQ10 [100], --maxMQ20 [100], --minFIC [-2147483648], --minABE [-100], --maxABE [100], --maxLQR [100], --minMBR [-100], --maxMBR [100], --minABZ [-2147483648], --maxABZ [2147483647], --maxBCS [2147483647], --keepFilter
Converting between VCF/PLINK file format
In order to convert from VCF to PLINK (binary PED) format, use the following command
vcfCooker --in-vcf [input-vcf-file] --out [output-bfile] --write-bed --verbose
This command will convert the file to PLINK format. It will work correctly only on biallelic SNPs.
In order to convert from PLINK (binary PED) format to VCF format, use the following command
vcfCooker --in-bfile [input-bfile] --out [output-vcf] --write-vcf --bgzf --verbose
This command will convert PLINK format into VCF format, matching the reference sequence assuming forward strand by default. More specifically
- If either of the two alleles matches with reference allele, it assumes forward strand and determine REF/ALT
- Otherwise, it try to see if strand flipping make either allele match to the reference allele. if it does, it flips the strand and determine REF/ALT
Additional Options Includes --ref [/data/local/ref/karma.ref/human.g1k.v37.fa] : To change the genome reference sequence to compare against --qGeno : Assigns genotype likelihood on the VCF file with fixed quality values (useful for data integration)
Subsetting the VCF file
Suppose that you have the following index file consisting of subset of individuals in the VCF file as [subset-index]
IND_ID_1 GROUP1,GROUP2,GROUP3 IND_ID_2 GROUP2 IND_ID_3 GROUP1,GROUP3 IND_ID_4 GROUP2,GROUP3
If you run the following command:
vcfCooker --in-vcf [input-vcf-file] --out [output-prefix] --verbose --subset --in-subset [subset-index] --bgzf
Will create the following set of files
[output-prefix].GROUP1.vcf.gz [output-prefix].GROUP2.vcf.gz [output-prefix].GROUP3.vcf.gz
Where each VCF contains a marker polymorphic only within the group (AC>0). AC and AN fields will be updated reflecting the changes in the subset.
Additional Options Includes --mono-subset : Includes monomorphic SNPs for the subsetting --filt-only-subset : Use PASS-filter SNPs only for subsetting.
Upgrading glfMultiples outputs (v 3.3 to v 4.0)
If you have a output from glfMultiples (06-16-2010), you can upgrade the output files using the following command
vcfCooker --in-vcf [input-vcf-file-from-glfMultiples] --upgrade --out [output-vcf-file]
Upgraded VCFs will have the following improvements.
- The additional tab between FORMAT field and genotype values will be removed, if exists
- The REF and ALT alleles will be presented as capital letters.
- The FORMAT field value, GT:GD:GQ will be changed to GT:DP:GQ:PL
- depth will be changed to DP in the INFO field
- mapQ will be changed to MQ in the INFO field
- MAF will be changed to AF (AlleleFrequency) in the INFO field, with proper changes if needed.
- NS (NumSamples) will be added as a new INFO field
- AC (AlleleCount) will be added as a new INFO field
- AN (NumAlleles) will be added as a new INFO field
- AB (AlleleBalance) will be added as a new INFO field (suggested by Tom Blackwell at Genotype_Likelihood_Based_Allele_Balance)
Filtering a VCF file
A example command line of upgrading / filtering a glfMultiples output is as follows.
vcfCooker --in-vcf /home/csidore/1000g_CEUTSI_WG/analysis_chr20/vcf/TSI+CEU+GBR.Q10.chr20.vcf --out 1KG.20100517.EUR.chr20.vcf.gz --bgzf --upgrade \ --filter --maxAB 65 --indelVCF /share/swg/hmkang/data/1000G/pilot_indels_2010_07/1kg.pilot_release.merged.indels.sites.hg19.chr20.vcf --winIndel 10 \ --minDP 93 --maxDP 1860 --minNS 19 --minQUAL 10 --write-vcf --winFFRQ 10 --maxFFRQ 30
Acknowledgements
vcfCooker is a result from collaborative effort by Hyun Min Kang, Matthew Flickinger, Matthew Snyder, Paul Anderson, Tom Blackwell, Mary Kate Trost, and Goncalo Abecasis. Please email to Hyun Min Kang [hmkang@umich.edu ] for any questions.