Changes

From Genome Analysis Wiki
Jump to navigationJump to search
5,630 bytes added ,  18:44, 28 March 2012
Created page with 'Category:C++ Category:libStatGen Category:libStatGen VCF = Variant Call Format (VCF) = The documentation on the Variant Call Format (VCF) can be found at: http://ww…'
[[Category:C++]]
[[Category:libStatGen]]
[[Category:libStatGen VCF]]

= Variant Call Format (VCF) =

The documentation on the Variant Call Format (VCF) can be found at: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

<span style="color:#D2691E">'''Our APIs are currently in the initial test phase, and are not yet available for public use.'''</span>

== API for Reading VCF Files ==

You will use both an <code>VcfFileReader</code> and an <code>VCfRecord</code> for reading VCF files.

=== <code>VcfFileReader</code> ===
An instance of the VcfFileReader class is used to read VCF files.

==== include file ====
<code>VcfFileReader</code> is declared in <code>VcfFileReader.h</code>, so be sure to include that file.
<source lang="cpp">
#include "AspFileReader.h"
</source>

==== Opening the VCF File ====
<code>open</code> opens the specified file and by default throws an exception if it was not successfully opened.

There are two methods for opening a file for reading. One is standard while the other allows the specification of a subset of samples to keep.

Both methods take the filename to be read as well as a reference to the VcfHeader to be populated.

The standard method is:
<source lang="cpp">
// Open the vcf file for reading.
VcfFileReader reader;
VcfFileHeader header;
reader.open("vcfFileName.vcf", header);
</source>

==== Subsetting Samples ====
To select only a subset of samples to keep, when opening the file also specify the name of the file containing the names of the samples to keep and the delimiter separating the sample names (default is a new line, '\n').
<source lang="cpp">
// Open the vcf file for reading.
VcfFileReader reader;
VcfFileHeader header;
// Subset 1 is delimited by new lines, '\n'.
reader.open("vcfFileName.vcf", header, "subsetFile1.txt");
</source>
<source lang="cpp">
// Open the vcf file for reading.
VcfFileReader reader;
VcfFileHeader header;
// Subset 2 is delimited by ';'
reader.open("vcfFileName.vcf", header, "subsetFile2.txt", ';');
</source>

==== Reading a VCF Record ====
Once the file is open, get the next record using <code>readRecord</code>.

It returns true if a record was successfully found and false on EOF or an error. Typically on a reading error an exception is thrown.

<code>readRecord</code> takes a reference to a <code>VcfRecord</code> object as a parameter. When true is returned, the <code>VcfRecord</code> is updated with the next record.

If you only process one record at a time (are done with a record before reading the next one), you can loop until <code>readRecord</code> return false, reusing the same record for each call.
<source lang="cpp">
VcfRecord record;
while(reader.readRecord(record))
{
// Your record specific processing here.
}
// Done reading the file.
</source>

If any subsetting was specified when the file was opened, the record that is returned will have only the specified subset of samples in it.

When reading a record, you can also specify discard rules. Internal processing of <code>readRecord</code> will apply the discard rules on the record that was read and will continue reading records and not return from the method until a record is found that should not be discarded (or until the end of the file). True is returned if a record that should be kept was found, false if the end of the file was hit without finding a record to keep.

==== Specifying Discard Rules ====
When specifying the discard rules, you should use the constants found at the top of VcfFileReader.h

For Example (see the file for the current set of Discard Rules and the associated values):
<source lang="cpp">
static const int DISCARD_NON_PHASED = 0x1;
static const int DISCARD_MISSING_GT = 0x2;
static const int DISCARD_FILTERED = 0x4;
static const int DISCARD_MULTIPLE_ALTS = 0x8;
</source>

You can set the Discard using <code>setDiscardRules(uint32_t)</code>. You can add additional rules to what has already been set by using <code>addDiscardRules(uint32_t)</code>.

<source lang="cpp">
reader.setDiscardRules(DISCARD_FILTERED);
reader.addDiscardRules(DISCARD_NON_PHASED);
</source>
is the same as
<source lang="cpp">
reader.setDiscardRules(DISCARD_FILTERED | DISCARD_NON_PHASED);
</source>
and discards reads that do not have <code>PASS</code> in the <code>FILTER</code> field and reads that have a genotype that is not phased or have no <code>GT</code> in the <code>FORMAT</code> fields.

==== Check if EOF was reached ====
To see if the End Of the File has been reached, use <code>isEOF()</code>.
<source lang="cpp">
reader.isEOF();
</source>
True is returned if the End of the File has been reached, false if not.

==== See How Many Files Were Read ====
To see the total number of records in the file that were read, including any that were discarded, use <code>getNumRecords</code>.

To see only the records in the file that were kept (not discarded) use <code>getNumKeptRecords</code>.
<source lang="cpp">
std::cout << reader.getNumRecords() << " were found in the file\n."
<< "but only " << reader.getNumKeptRecords() << " were processed\n.";
</source>


==== Closing the VCF File ====
<code>close</code> closes the opened VCF file.

<source lang="cpp">
reader.close();
</source>

=== <code>VcfRecord</code> ===
VCF records are read from VCF files using <code>readRecord</code>.

Once you have the record, use methods from the <code>VcfRecord</code> to extract the desired information.

Navigation menu