Difference between revisions of "DosageConvertor"

From Genome Analysis Wiki
Jump to: navigation, search
(Command Line Options)
(Command Line Options)
Line 88: Line 88:
 
The command options for DosageConvertor are explained below.  
 
The command options for DosageConvertor are explained below.  
  
*<code>--vcfDose</code> is a mandatory parameter indicating the minimac3/4 VCF dosage file to be converted
+
{| class="wikitable"  style="text-align:center"  border="1" cellpadding="2"
*<code>--info</code> is the info file generated by minimac3/4 at the same time as the VCF dosage file. This parameter is optional, but if NO info file is provided, the output MaCH info file will have missing columns
+
|- bgcolor="white"
*<code>--prefix</code> sets the prefix for output files (default value: <code>Converted.Dosage</code>)
+
! Option
*<code>--type</code> sets the output file format (available options: <code>plink</code> (default) or <code>mach</code>)
+
! Description
*<code>--tag</code> indicates whether to import imputed values from dosages (<code>DS</code>: default), genotype probabilities (<code>GP</code>), or hard genotype calls (<code>GT</code>) from the input VCF file
+
|-
*<code>--format</code> sets the format of the converted output file.
+
| <code>--vcfDose</code>
**If <code>--type plink</code> is used, <code>--format</code> can take values 1, 2, or 3. Each of these values correspond to the three different formats available for PLINK dosage files (details given [http://www.cog-genomics.org/plink/1.9/assoc#dosage here])
+
|
**If <code>--type mach</code> is used, <code>--format</code> can take values 1 or 2. Details are given in [[#Convert to MaCH Files| Convert to MaCH Files]]  
+
mandatory parameter indicating the minimac3/4 VCF dosage file to be converted
*<code>--buffer</code> sets the number of markers to import at a time (MaCH format only) (default value <code>10000</code>)
+
|-
*<code>--idDelimiter</code> indicates the delimiter character used to split '''VCF Sample ID''' into '''FID''' and '''IID''' for PLINK format
+
| <code>--info</code>
*<code>--allDiploid</code> indicates whether to assume all samples are diploid (necessary for chromosome X). If this option is active, the output PLINK <code>.fam</code> will NOT contain any sex information
+
|
*<code>--sexFile</code> indicates a file containing sample sex information, which requires two columns: the first column contains the sample names as found in the VCF file, and the second columns contains either M (for males) or F (for females)
+
the info file generated by minimac3/4 at the same time as the VCF dosage file  
*<code>--TrimAlleles</code> indicates whether to trim alleles and variants IDs to 100 characters. Since PLINK does not allow variant IDs longer than 16,000 characters, this option can be used if variant names are too long
+
 
 +
(This parameter is optional, but if NO info file is provided, the output MaCH info file will have missing columns.)
 +
|-
 +
| <code>--prefix</code>  
 +
|
 +
sets the prefix for output files (default value: <code>Converted.Dosage</code>)
 +
|-
 +
| <code>--type</code>
 +
|
 +
sets the output file format (available options: <code>plink</code> (default) or <code>mach</code>)
 +
|-
 +
| <code>--tag</code>
 +
|
 +
indicates whether to import imputed values from dosages (<code>DS</code>: default), genotype probabilities (<code>GP</code>), or hard genotype calls (<code>GT</code>) from the input VCF file
 +
|-
 +
| <code>--format</code>
 +
|
 +
sets the format of the converted output file:
 +
 
 +
*If <code>--type plink</code> is used, <code>--format</code> can take values 1, 2, or 3. Each of these values correspond to the three different formats available for PLINK dosage files (details given [http://www.cog-genomics.org/plink/1.9/assoc#dosage here])
 +
*If <code>--type mach</code> is used, <code>--format</code> can take values 1 or 2. Details are given in [[#Convert to MaCH Files| Convert to MaCH Files]]  
 +
|-
 +
| <code>--buffer</code>
 +
|
 +
sets the number of markers to import at a time (MaCH format only) (default value <code>10000</code>)
 +
|-
 +
| <code>--idDelimiter</code>
 +
|
 +
indicates the delimiter character used to split '''VCF Sample ID''' into '''FID''' and '''IID''' for PLINK format
 +
|-
 +
| <code>--allDiploid</code>
 +
|
 +
indicates whether to assume all samples are diploid (necessary for chromosome X). If this option is active, the output PLINK <code>.fam</code> will NOT contain any sex information
 +
|-
 +
| <code>--sexFile</code>
 +
|
 +
indicates a file containing sample sex information, which requires two columns: the first column contains the sample names as found in the VCF file, and the second columns contains either M (for males) or F (for females)
 +
|-
 +
| <code>--TrimAlleles</code>
 +
|
 +
indicates whether to trim alleles and variants IDs to 100 characters. Since PLINK does not allow variant IDs longer than 16,000 characters, this option can be used if variant names are too long
 +
|}
  
 
= Contact =
 
= Contact =
  
 
In case of any queries and bugs please contact [mailto:sayantan@umich.edu Sayantan Das].
 
In case of any queries and bugs please contact [mailto:sayantan@umich.edu Sayantan Das].

Revision as of 19:35, 11 July 2017

  • Download/Re-Clone Release Version 1.0.4 (Updated July 2017) !

Introduction

DosageConvertor is a C++ tool to convert dosage files (in VCF format) from Minimac3/4 to other formats such as MaCH or PLINK.

Download

VERSION: 1.0.4 (Updated 7.12.2017) !

[NOTE: Cloning from GitHub is recommened so that updates can be easily pulled back]

Description Download Link
Github Repository

DosageConvertor - Github

Source Files

UNIX Users

Binary Executable

(Ubuntu 4.8.4)

UNIX Users

Binary executables are NOT guaranteed to run on every LINUX machine. Please compile from source files if you have trouble with the executable, or clone from the github repository. Else contact the author Sayantan Das.

Installation

Users should follow the following steps to compile DosageConvertor (if they downloaded the source files).

## EXTRACT M3VCFTOOLS AND COMPILE
 
wget ftp://share.sph.umich.edu/minimac3/DosageConvertor/DosageConvertor.v1.0.3.tar.gz
tar -xzvf DosageConvertor.v1.0.3.tar.gz
cd DosageConvertor/
make

Usage

Convert to PLINK Files

The following command will convert a input VCF dosage file to a PLINK dosage file, which can be used for downstream analysis using PLINK1.9 or PLINK2.0.

./DosageConvertor         --vcfDose      TestDataImputedVCF.dose.vcf.gz
                          --info         TestDataImputedVCF.info          (optional)
                          --prefix       OutPrefix
                          --type         plink                            (default)
                          --format       1                                (or 2,3)

This command will create three files : OutPrefix.plink.dosage.gz, OutPrefix.fam, OutPrefix.map. The .fam and .map formats are described here. The --format parameter can take values 1, 2 and 3. Each of these values correspond to the three different formats available for PLINK dosage files (details on PLINK dosage files are given here). Note that the generated OutPrefix.map does NOT contain any phenotype information (which will need to be manually edited before PLINK can perform association tests). The OutPrefix.fam will NOT contain sex information unless chromosome X is available. See Converting Chromosome X Files for details.

Convert to MaCH Files

The following command will convert an input VCF dosage file to a MaCH/minimac dosage file (the format for previous versions of minimac). The generated dosage files can be tested for association using mach2dat.

./DosageConvertor         --vcfDose      TestDataImputedVCF.dose.vcf.gz
                          --info         TestDataImputedVCF.info         (optional)
                          --prefix       OutPrefix
                          --type         mach
                          --format       1                               (or 2)

When --type mach is used, the --format parameter can only take values 1 and 2.

  • If the value is 1, the code generates OutPrefix.mach.dose.gz and OutPrefix.info, where OutPrefix.mach.dose.gz contains the expected alternate allele count (one value per sample per marker).
  • If the value is 2, it generates OutPrefix.mach.gprob.gz and OutPrefix.info, where OutPrefix.mach.gprob.gz contains the genotype likelihoods for reference homozygote and heterozygote (two values per sample per marker).

Note that inputting the info file using --info is optional. However, if this info file is NOT provided, the output OutPrefix.info file will have some empty columns. Thus, if available, the generated info file should be provided along with the VCF file as input.

Converting Chromosome X Files

For a minimac3/4 output file containing the pseudo-autosomal region (PAR) on chromosome X, no extra parameter is necessary. For files containing the non-PAR region, please ensure the following:

  • If your input VCF dosage file has males as diploids, then just add handle --allDiploid. This will NOT generate sex information in the output PLINK .fam file.
    • If you still need the sex column in .fam file to be correctly updated, then supply a sex file using --sexFile SomeFile where SomeFile has two columns: the first column has the sample names as found in the VCF file, and the second columns has M (for males) or F (for females).
  • If your input VCF dosage file has males as haploids and also has GT information, the tool with automatically determine the sex of the samples and report them in the output .fam file. No extra parameters are required.
    • If GT tags are NOT available, you would need to supply the sex file as described above. Otherwise it will throw an error.
    • NOTE: If your VCF file has males as haploids, do NOT use --allDiploid as the code would NOT throw any error, but the output results would be erroneous.

Command Line Options

The command options for DosageConvertor are explained below.

Option Description
--vcfDose

mandatory parameter indicating the minimac3/4 VCF dosage file to be converted

--info

the info file generated by minimac3/4 at the same time as the VCF dosage file

(This parameter is optional, but if NO info file is provided, the output MaCH info file will have missing columns.)

--prefix

sets the prefix for output files (default value: Converted.Dosage)

--type

sets the output file format (available options: plink (default) or mach)

--tag

indicates whether to import imputed values from dosages (DS: default), genotype probabilities (GP), or hard genotype calls (GT) from the input VCF file

--format

sets the format of the converted output file:

  • If --type plink is used, --format can take values 1, 2, or 3. Each of these values correspond to the three different formats available for PLINK dosage files (details given here)
  • If --type mach is used, --format can take values 1 or 2. Details are given in Convert to MaCH Files
--buffer

sets the number of markers to import at a time (MaCH format only) (default value 10000)

--idDelimiter

indicates the delimiter character used to split VCF Sample ID into FID and IID for PLINK format

--allDiploid

indicates whether to assume all samples are diploid (necessary for chromosome X). If this option is active, the output PLINK .fam will NOT contain any sex information

--sexFile

indicates a file containing sample sex information, which requires two columns: the first column contains the sample names as found in the VCF file, and the second columns contains either M (for males) or F (for females)

--TrimAlleles

indicates whether to trim alleles and variants IDs to 100 characters. Since PLINK does not allow variant IDs longer than 16,000 characters, this option can be used if variant names are too long

Contact

In case of any queries and bugs please contact Sayantan Das.