LibStatGen: VCF

From Genome Analysis Wiki
Revision as of 15:51, 18 April 2012 by Mktrost (talk | contribs)
Jump to navigationJump to search


Variant Call Format (VCF)

The documentation on the Variant Call Format (VCF) can be found at: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

Our APIs are currently in the initial test phase, and are not yet available for public use.

API for Reading VCF Files

You will use both an VcfFileReader and an VCfRecord for reading VCF files.

VcfFileReader

An instance of the VcfFileReader class is used to read VCF files.

include file

VcfFileReader is declared in VcfFileReader.h, so be sure to include that file.

#include "AspFileReader.h"

Opening the VCF File

open opens the specified file and by default throws an exception if it was not successfully opened.

There are two methods for opening a file for reading. One is standard while the other allows the specification of a subset of samples to keep.

Both methods take the filename to be read as well as a reference to the VcfHeader to be populated.

The standard method is:

    // Open the vcf file for reading.
    VcfFileReader reader;
    VcfHeader header;
    reader.open("vcfFileName.vcf", header);

Subsetting Samples

To select only a subset of samples to keep, when opening the file also specify the name of the file containing the names of the samples to keep and the delimiter separating the sample names (default is a new line, '\n').

    // Open the vcf file for reading.
    VcfFileReader reader;
    VcfHeader header;
    // Subset 1 is delimited by new lines, '\n'.
    reader.open("vcfFileName.vcf", header, "subsetFile1.txt");
    // Open the vcf file for reading.
    VcfFileReader reader;
    VcfHeader header;
    // Subset 2 is delimited by ';'
    reader.open("vcfFileName.vcf", header, "subsetFile2.txt", ';');

Reading a VCF Record

Once the file is open, get the next record using readRecord.

It returns true if a record was successfully found and false on EOF or an error. Typically on a reading error an exception is thrown.

readRecord takes a reference to a VcfRecord object as a parameter. When true is returned, the VcfRecord is updated with the next record.

If you only process one record at a time (are done with a record before reading the next one), you can loop until readRecord return false, reusing the same record for each call.

        VcfRecord record;
        while(reader.readRecord(record))
        {
            // Your record specific processing here.
        }
        // Done reading the file.

If any subsetting was specified when the file was opened, the record that is returned will have only the specified subset of samples in it.

When reading a record, you can also specify discard rules. Internal processing of readRecord will apply the discard rules on the record that was read and will continue reading records and not return from the method until a record is found that should not be discarded (or until the end of the file). True is returned if a record that should be kept was found, false if the end of the file was hit without finding a record to keep.

Specifying Discard Rules

When specifying the discard rules, you should use the constants found at the top of VcfFileReader.h

For Example (see the file for the current set of Discard Rules and the associated values):

    static const int DISCARD_NON_PHASED = 0x1;
    static const int DISCARD_MISSING_GT = 0x2;
    static const int DISCARD_FILTERED = 0x4;
    static const int DISCARD_MULTIPLE_ALTS = 0x8;

You can set the Discard using setDiscardRules(uint32_t). You can add additional rules to what has already been set by using addDiscardRules(uint32_t).

        reader.setDiscardRules(DISCARD_FILTERED);
        reader.addDiscardRules(DISCARD_NON_PHASED);

is the same as

        reader.setDiscardRules(DISCARD_FILTERED | DISCARD_NON_PHASED);

and discards reads that do not have PASS in the FILTER field and reads that have a genotype that is not phased or have no GT in the FORMAT fields.

Read only Certain Sections of the File / Using a VCF Index (TABIX) File

The VCF library has the capability of using a tabix file to allow random access in the VCF file.

To use the INDEX File to read sections of the bam file:

  1. Open the VCF file using: VcfFileReader::open
  2. Open the VCF Index file using: VcfFileReader::readVcfIndex
  3. Set the Section to be read: VcfFileReader::setReadSection or VcfFileReader::set1BasedReadSection
  4. Read records from the section: VcfFileReader::readRecord

VcfFileReader::readVcfIndex returns true if the index file was successfully read and false if not.

    bool readVcfIndex(const char * filename);
    bool readVcfIndex();

If you specify a filename, it looks for the index file with that path/name. If you do not specify the index file name, it will attempt to find it. First it searches for a file with your VCF FileName + a ".tbi" extension. If that isn't found, it removes the ".vcf" if found in your VCF FileName and looks for a file with that name and a ".tbi" extension.

After opening both the VCF File and the Index file, set the section that should be read next using setReadSection or set1BasedReadSection.

    bool setReadSection(const char* chromName);
    bool set1BasedReadSection(const char* chromName, 
                              int32_t start, int32_t end);

Both methods currently always return true, since in the current implementation this can't fail. It currently allows the VcfIndex file to be read after this call.

The read section is reset when a new file is opened, so do not set this prior to opening the file.

setReadSection will set the code to read the entire specified chromosome. On the first readRecord call it will jump to the beginning of the chromosome. It will continue to read the chromosome for each readRecord call made until it has read the entire chromosome. Once the whole chromosome has been read, readRecord will return false until a new read section is specified.

set1BasedReadSection will set the code to read the specified chromosome starting at the specified 1-based position up to, but not including, the specified 1-based end position. On the first readRecord call it will jump the file to the beginning of this section. It will continue to read the section for each readRecord call made until it has read the entire section. Once the entire section has been read, readRecord will return false until a new read section is specified.

Implementation NOTE: internally it may read extra records prior to the section, but readRecord will keep reading and will not return until it finds a record in the section or until the section has been passed (no records in the section).

    // Open the vcf file for reading.
    VcfFileReader reader;
    VcfHeader header;
    VcfRecord record
    reader.open("vcfFileName.vcf", header);

    // Index File name is "vcfFileName.vcf.tbi" or "vcfFileName.tbi"
    reader.readVcfIndex();

    reader.set1BasedReadSection("1", 16384, 32767);
    while(reader.readRecord(record))
    {
        // Process a record from this section.
    }

    // Set another section, chromosome "X"
    reader.setReadSection("X");
    while(reader.readRecord(record))
    {
        // Process record on chromosome X
    }

Alternatively, the index file name can be specified.

    // Index File name is specified
    reader.readVcfIndex("myIndex.tbi");

Check if EOF was reached

To see if the End Of the File has been reached, use isEOF().

        reader.isEOF();

True is returned if the End of the File has been reached, false if not.

See How Many Records Were Read

To see the total number of records in the file that were read, including any that were discarded, use getNumRecords.

To see only the records in the file that were kept (not discarded) use getNumKeptRecords.

        std::cout << reader.getNumRecords() << " were found in the file\n."
                  << "but only " << reader.getNumKeptRecords() << " were processed\n.";

Closing the VCF File

close closes the opened VCF file.

    reader.close();


VcfRecord

VCF records are read from VCF files using VcfFileReader::readRecord.

Once you have the record, use methods from the VcfRecord to extract the desired information.

Extracting Data from the Record

Do not delete any of the returned pointers. They are references to strings within the record and will not be valid when the record changes or is deleted.

  • Position Information
    • getChromStr() returns const char* to this record's chromosome string
    • get1BasedPosition() returns an integer containing the 1-based position of the record
  • Identifiers
    • getIDStr() returns const char* to this record's ID. It includes the entire list of unique identifiers delimited by ';'
  • Information on Bases
    • getRefStr() returns const char* to the entire reference string
    • getAltStr() returns const char* to the entire alternate string (alternates seperated by ',')
    • getNumAlts() returns an integer containing the number of alternates in the alt string.
  • Quality Information
    • getQual() returns a float containing the phred quality score
    • getQualStr() returns a string containing the phred quality score
  • Filter Information - see VcfRecordFilter for more information on extracting information
    • getFilter() returns a reference to the Info field information, a VcfRecordInfo object reference
    • passedAllFilters() returns true if the FILTER field indicates PASS, false if not.
  • Aditional Information
    • getInfo() returns a reference to the filter information, a VcfRecordFilter object reference
  • Genotype Information
    • getGenotypeInfo() returns a reference to the information from the genotype fields, a VcfRecordGenotype object reference
    • allPhased() returns true if all the samples are phased and none unphased and false if any are not phased
    • allUnphased() returns true if all the samples are unphased and none phased and false if any are not unphased
    • hasAllGenotypeAlleles() returns true if all the samples have all the genotype alleles specified and false if any are missing or the GT field is missing.

VcfRecordFilter

VcfRecords contain the data from the INFO field in a VcfRecordFilter object.