VcfCooker

From Genome Analysis Wiki
Jump to: navigation, search


Please see GotCloud to download vcfCooker. Downloads are on github.

(Updated at 2012/01/20 10:47PM)

vcfCooker is a software that converts VCF/BED file formats in various forms. vcfCooker is currently under development, and will be publicly released soon. The current documentation contains the minimal information of currently working functions.

Current Binary Location

Current binary version of vcfCooker is available at /usr/cluster/bin/vcfCooker (as an in-house software).

Basic Usage

The following parameters are available.  Ones with "[]" are in effect:
Available Options
                         Recipes : --write-bed, --write-vcf, --upgrade,
                                   --summarize, --filter, --subset
               VCF Input options : --in-vcf []
               BED Input options : --in-bfile [], --in-bed [], --in-bim [],
                                   --in-fam [],
                                   --ref [/data/local/ref/karma.ref/human.g1k.v37.fa]
              Subsetting options : --in-subset [], --mono-subset,
                                   --filt-only-subset
                  Output Options : --out [./vcfCooker], --qGeno,
                                   --print-every [10000]
      Output compression Options : --plain [ON], --bgzf, --gzip
   Genotype-level Filter Options : --minGQ, --minGD
                  Filter Options : --winIndel, --indelVCF [], --minQUAL,
                                   --minMQ, --maxDP [2147483647], --minDP,
                                   --maxABL [100], --winFFRQ, --maxFFRQ,
                                   --winFVAR, --merFVAR, --maxFVAR, --minNS,
                                   --maxSTP [2147483647],
                                   --maxTTT [2147483647],
                                   --minTTT [-2147483648], --maxSTR [100],
                                   --minSTR [-100], --maxSTZ [2147483647],
                                   --minSTZ [-2147483648], --maxCBR [100],
                                   --minCBR [-100], --maxQBR [100],
                                   --minQBR [-100], --maxCBZ [2147483647],
                                   --maxCSR [100], --minCSR [-100],
                                   --maxLQZ [2147483647],
                                   --minLQZ [-2147483648],
                                   --maxRBZ [2147483647],
                                   --minRBZ [-2147483648],
                                   --maxIOZ [2147483647],
                                   --minIOR [-2147483648],
                                   --maxIOR [2147483647],
                                   --maxAOZ [2147483647],
                                   --maxAOI [2147483647], --maxMQ0 [100],
                                   --maxMQ10 [100], --maxMQ20 [100],
                                   --minFIC [-2147483648], --minABE [-100],
                                   --maxABE [100], --maxLQR [100],
                                   --minMBR [-100], --maxMBR [100],
                                   --minABZ [-2147483648],
                                   --maxABZ [2147483647],
                                   --maxBCS [2147483647], --keepFilter

Converting between VCF/PLINK file format

In order to convert from VCF to PLINK (binary PED) format, use the following command

vcfCooker --in-vcf [input-vcf-file] --out [output-bfile] --write-bed --verbose

This command will convert the file to PLINK format. It will work correctly only on biallelic SNPs.

In order to convert from PLINK (binary PED) format to VCF format, use the following command

vcfCooker --in-bfile [input-bfile] --out [output-vcf] --write-vcf --bgzf --verbose 

This command will convert PLINK format into VCF format, matching the reference sequence assuming forward strand by default. More specifically

  • If either of the two alleles matches with reference allele, it assumes forward strand and determine REF/ALT
  • Otherwise, it try to see if strand flipping make either allele match to the reference allele. if it does, it flips the strand and determine REF/ALT
Additional Options Includes
--ref [/data/local/ref/karma.ref/human.g1k.v37.fa] : To change the genome reference sequence to compare against
--qGeno : Assigns genotype likelihood on the VCF file with fixed quality values (useful for data integration)

Subsetting the VCF file

Suppose that you have the following index file consisting of subset of individuals in the VCF file as [subset-index]

IND_ID_1  GROUP1,GROUP2,GROUP3
IND_ID_2  GROUP2
IND_ID_3  GROUP1,GROUP3
IND_ID_4  GROUP2,GROUP3

If you run the following command:

vcfCooker --in-vcf [input-vcf-file] --out [output-prefix] --verbose --subset --in-subset [subset-index] --bgzf

Will create the following set of files

[output-prefix].GROUP1.vcf.gz
[output-prefix].GROUP2.vcf.gz
[output-prefix].GROUP3.vcf.gz

Where each VCF contains a marker polymorphic only within the group (AC>0). AC and AN fields will be updated reflecting the changes in the subset.

Additional Options Includes
--mono-subset : Includes monomorphic SNPs for the subsetting
--filt-only-subset : Use PASS-filter SNPs only for subsetting.

Genotype-level filtering a VCF file

If you want to filter individual genotypes base on genotype quality (GQ field) of genotype depth of (GD field), use either of the following commands

vcfCooker --in-vcf [input-vcf] --out [output-vcf] --bgzf --minGD [GD_thres] --write-vcf 
vcfCooker --in-vcf [input-vcf] --out [output-vcf] --bgzf --minGQ [GQ_thres] --write-vcf 

This will generate vcf file by changing the genotypes below the threshold to missing (./.), updating the AN and AC entry in the INFO field accordingly

Site level filtering a VCF file

The following options allows filtering a VCF file

vcfCooker --in-vcf [input-vcf] --out [output-vcf] --bgzf --filter --write-vcf 

After filtering the FILTER column updated with a filter tag as combination of key name (uppercase for maximum bound and lowercase for minimum bound) and threshold value. For example, DP10000 means that the site was filtered by the criteria of --maxDP 10000. dp10 means that the site was filtered by --minDP 10.

The following is currently supported filtering criteria (to PASS filters). Other criteria are not rigorously tested, so please use at your own risk

 --winIndel : Minimum distance with nearby INDELs (--indelVCF must be used together)
 --indelVCF : VCF file containing the known INDELs
 --minQUAL : Minimum SNP quality allows
 --minMQ : Minimum Mapping Quality
 --maxDP [2147483647] : Maximum Read Depth
 --minDP : Minimum Read Depth
 --maxABL [100] : Maximum % allele balance value based on genotype likelihood (formula by Tom Blackwell)
 --minNS : Minimum # of samples with positive depth
 --maxSTR [100] :  Maximum % strand balance correlation between REF/ALT and FWD/REV (-100 to 100) 
 --minSTR [-100] :  Minimum % strand balance correlation between REF/ALT and FWD/REV (-100 to 100) 
 --maxSTZ [2147483647] : Maximum Z score of the strand bias between REF/ALT and FWD/REV 
 --minSTZ [-2147483648] : Minimum Z score of strand bias between REF/ALT and FWD/REV 
 --maxCBR [100] : Maximum % cycle bias correlation between REF/ALT and read position
 --minCBR [-100] : Maximum % cycle bias correlation between REF/ALT and read position 
 --maxLQR [100] : Maximum % of low-quality base (0-100) among all reads
 --maxAOI [2147483647] : Maximum z-score quantifying incorrect calibration of base qualities 
 --maxMQ0 [100] : Maximum % of mapping quality = 0 reads
 --maxMQ10 [100] : Maximum % of mapping quality <= 10 reads
 --maxMQ20 [100] : Maximum % of mapping quality <= 20 reads
 --maxMQ30 [100] : Maximum % of mapping quality <= 30 reads
 --minFIC [-2147483648] : Minimum % inbreeding coefficient (-100 to 100)
 --minABE [0] : Minimum % allele balance (0 to 100) based on exact base quality
 --maxABE [100] : Minimum % allele balance (0 to 100) based on exact base quality
 --minABZ [-2147483648] : Minimum allele balance z-score based on exact base quality 
 --maxABZ [-2147483648] : Maximum allele balance z-score based on exact base quality 
 --keepFilter : Do not reset the filter, add filter tags to existing filter (default is OFF).

Upgrading glfMultiples outputs (v 3.3 to v 4.0)

If you have a output from glfMultiples (06-16-2010), you can upgrade the output files using the following command

vcfCooker --in-vcf [input-vcf-file-from-glfMultiples] --upgrade --out [output-vcf-file]

Upgraded VCFs will have the following improvements.

  • The additional tab between FORMAT field and genotype values will be removed, if exists
  • The REF and ALT alleles will be presented as capital letters.
  • The FORMAT field value, GT:GD:GQ will be changed to GT:DP:GQ:PL
  • depth will be changed to DP in the INFO field
  • mapQ will be changed to MQ in the INFO field
  • MAF will be changed to AF (AlleleFrequency) in the INFO field, with proper changes if needed.
  • NS (NumSamples) will be added as a new INFO field
  • AC (AlleleCount) will be added as a new INFO field
  • AN (NumAlleles) will be added as a new INFO field
  • AB (AlleleBalance) will be added as a new INFO field (suggested by Tom Blackwell at Genotype_Likelihood_Based_Allele_Balance)

Acknowledgements

vcfCooker is a result from collaborative effort by Hyun Min Kang, Matthew Flickinger, Matthew Snyder, Paul Anderson, Tom Blackwell, Mary Kate Wing, and Goncalo Abecasis. Please email to Hyun Min Kang [hmkang@umich.edu ] for any questions.