Open main menu

Genome Analysis Wiki β

Changes

LibStatGen: VCF

4,412 bytes added, 16:09, 23 January 2013
no edit summary
<code>VcfFileReader</code> is declared in <code>VcfFileReader.h</code>, so be sure to include that file.
<source lang="cpp">
#include "AspFileReaderVcfFileReader.h"
</source>
</source>
==== Subsetting Reading a Subset of Samples ====
To select only a subset of samples to keep, when opening the file also specify the name of the file containing the names of the samples to keep and the delimiter separating the sample names (default is a new line, '\n').
<source lang="cpp">
and discards reads that do not have <code>PASS</code> in the <code>FILTER</code> field and reads that have a genotype that is not phased or have no <code>GT</code> in the <code>FORMAT</code> fields.
===== Additional Rules Minimum Alternate Allele Count =====There are additional discard rules that can be specified by calling methods on VcfFileReader. 
To Discard any records without a minimum number of alternate alleles, use:
<source lang="cpp">
</source>
The <code>minAltAlleleCount</code> parameter is the minimum number of alternate alleles found in the specified subset (if specified) in order for the record to be kept. The <code>VcfSubsetSamples* subset</code> parameter is a pointer to the subset of samples that you want to include when counting the number of alternate alleles. If all samples that are read/kept are to be included, NULL should be passed in.  See [[#Handling a Subset of Samples|Handling a Subset of Samples]] for how to use <code>VcfSubsetSamples</code>. Use the following method to remove the DiscardMinAltAlleleCount rule:<source lang="cpp">VcfFileReader::rmDiscardMinAltAlleleCount()</source> Example: Minimum Alternate Allele Count = 4 Sample1 Sample2 Sample3 Keep/Discard 0|0 1|1 2|2 Keep 0|0 0|1 2|2 Discard, only 3 Alternates (1 Allele1 & 2 Allele 2) 0|0 1|1 1|2 Keep 0|2 1|1 2|2 Keep 2|1 0|1 2|0 Keep Example: Minimum Alternate Allele Count = 3 & Exclude Sample2 (without the exclusion, all would be kept) Sample1 Sample2 Sample3 Keep/Discard 0|0 1|1 2|2 Discard, only 2 Alternates (0 Allele1 & 2 Allele 2) 0|0 0|1 2|2 Discard, only 2 Alternates (0 Allele1 & 2 Allele 2) 0|0 1|1 1|2 Discard, only 2 Alternates (1 Allele1 & 1 Allele 2) 0|2 1|1 2|2 Keep 2|1 0|1 2|0 Keep  ===== Minimum Minor Allele Count =====To Discard any records without a minimum number of minor alleles, use:<source lang="cpp">VcfFileReader::addDiscardMinMinorAlleleCount(int32_t minMinorAlleleCount, VcfSubsetSamples* subset)</source> The <code>minMinorAlleleCount</code> parameter is the minimum number of minor alleles found in the specified subset (if specified) in order for the record to be kept. The <code>VcfSubsetSamples* subset</code> parameter is a pointer to the subset of samples that you want to include when counting the number of alleles. If all samples that are read/kept are to be included, NULL should be passed in.  See [[#Handling a Subset of Samples|Handling a Subset of Samples]] for how to use <code>VcfSubsetSamples</code>. Use the following method to remove the DiscardMinMinorAlleleCount rule:<source lang="cpp">VcfFileReader::rmDiscardMinMinorAlleleCount()</source> Example: Minimum Minor Allele Count = 2 Sample1 Sample2 Sample3 Keep/Discard 0|0 1|1 2|2 Keep 0|0 0|1 2|2 Discard, only 1 Allele1 0|0 1|1 1|2 Discard, only 1 Allele2 0|2 1|1 2|2 Discard, only 1 Allele0 2|1 0|1 2|0 Keep
The <code>minAltAlleleCount</code> parameter is the minimum number of alternate alleles found in the subset in order for Example: Minimum Minor Allele Count = 1 & Exclude Sample2 (without the record to exclusion, all would be kept.) Sample1 Sample2 Sample3 Keep/Discard 0|0 1|1 2|2 Discard, 0 Allele1 0|0 0|1 2|2 Discard, 0 Allele1 0|0 1|1 1|2 Keep 0|2 1|1 2|2 Discard, 0 Allele1 2|1 0|1 2|0 Keep
==== Read only Certain Sections of the File / Using a VCF Index (TABIX) File ====
** <code>allPhased()</code> returns true if all the samples are phased and none unphased and false if any are not phased
** <code>allUnphased()</code> returns true if all the samples are unphased and none phased and false if any are not unphased
** <code>hasAllGenotypeAlleles()</code> returns true if all the samples have all the genotype alleles specified and false if any are missing or the GT field is missing** <code>getNumSamples()</code> returns the number of samples (that are kept)** <code>getNumGTs(int index)</code> returns the number of GTs for the specified sample index (starts at 0)** <code>getGT(int sampleNum, unsigned int gtIndex)</code> returns the integer GT value for the specified sampleNum and GT index (both start at 0). A GT of VcfGenotypeSample::INVALID_GT is returned if the sampleNum or GT index is out of range. A GT of VcfGenotypeSample::MISSING_GT if the GT is '.' (This method is also found in VcfRecordGenotype.
=== VcfRecordFilter ===
<code>VcfRecords</code> contain the data from the <code>INFO</code> field in a <code>VcfRecordFilter</code> object.
 
 
== Handling a Subset of Samples ==
 
When reading a file if you only want to process/keep a subset of samples, use [[#Reading a Subset of Samples|Reading a Subset of Samples]]. When that method is used, only the specified samples are stored. Any further processing will only be on those samples.
 
Some methods allow the user to specify a subset of samples to operate on. The subset specified when reading the VCF file, if any, is automatically applied since only those samples were stored. If a different/additional subset needs to be applied for other processing, you can use the <code>VcfSubsetSamples</code> class.
 
To setup a VcfSubsetSamples object, pass the already set VCF header to:
<source lang="cpp">
void VcfSubsetSamples::init(const VcfHeader& header, bool include)
</source>
Set the <code>include</code> parameter to:
* true if all samples should be included except any that are specified as excluded.
* false if all samples should be excluded except any that are specified as included.
 
NOTE: the header is not modified to add/remove any samples.
 
To mark a specific sample as excluded use:
<source lang="cpp">
bool VcfSubsetSamples::addExcludeSample(const char* sampleName);
</source>
To mark a specific sample as included use:
<source lang="cpp">
bool VcfSubsetSamples::addIncludeSample(const char* sampleName);
</source>
60
edits