Changes

From Genome Analysis Wiki
Jump to: navigation, search

LibStatGen: VCF

3,715 bytes added, 15:51, 18 April 2012
no edit summary
// Open the vcf file for reading.
VcfFileReader reader;
VcfFileHeader VcfHeader header;
reader.open("vcfFileName.vcf", header);
</source>
// Open the vcf file for reading.
VcfFileReader reader;
VcfFileHeader VcfHeader header;
// Subset 1 is delimited by new lines, '\n'.
reader.open("vcfFileName.vcf", header, "subsetFile1.txt");
// Open the vcf file for reading.
VcfFileReader reader;
VcfFileHeader VcfHeader header;
// Subset 2 is delimited by ';'
reader.open("vcfFileName.vcf", header, "subsetFile2.txt", ';');
</source>
and discards reads that do not have <code>PASS</code> in the <code>FILTER</code> field and reads that have a genotype that is not phased or have no <code>GT</code> in the <code>FORMAT</code> fields.
 
==== Read only Certain Sections of the File / Using a VCF Index (TABIX) File ====
The VCF library has the capability of using a tabix file to allow random access in the VCF file.
 
To use the INDEX File to read sections of the bam file:
# Open the VCF file using: <code>VcfFileReader::open</code>
# Open the VCF Index file using: <code>VcfFileReader::readVcfIndex</code>
# Set the Section to be read: <code>VcfFileReader::setReadSection</code> or <code>VcfFileReader::set1BasedReadSection</code>
# Read records from the section: <code>VcfFileReader::readRecord</code>
 
<code>VcfFileReader::readVcfIndex</code> returns true if the index file was successfully read and false if not.
<source lang="cpp">
bool readVcfIndex(const char * filename);
bool readVcfIndex();
</source>
If you specify a filename, it looks for the index file with that path/name. If you do not specify the index file name, it will attempt to find it. First it searches for a file with your VCF FileName + a ".tbi" extension. If that isn't found, it removes the ".vcf" if found in your VCF FileName and looks for a file with that name and a ".tbi" extension.
 
After opening both the VCF File and the Index file, set the section that should be read next using <code>setReadSection</code> or <code>set1BasedReadSection</code>.
<source lang="cpp">
bool setReadSection(const char* chromName);
bool set1BasedReadSection(const char* chromName,
int32_t start, int32_t end);
</source>
Both methods currently always return true, since in the current implementation this can't fail. It currently allows the VcfIndex file to be read after this call.
 
The read section is reset when a new file is opened, so do not set this prior to opening the file.
 
<code>setReadSection</code> will set the code to read the entire specified chromosome. On the first <code>readRecord</code> call it will jump to the beginning of the chromosome. It will continue to read the chromosome for each <code>readRecord</code> call made until it has read the entire chromosome. Once the whole chromosome has been read, <code>readRecord</code> will return false until a new read section is specified.
 
<code>set1BasedReadSection</code> will set the code to read the specified chromosome starting at the specified 1-based position up to, but not including, the specified 1-based end position. On the first <code>readRecord</code> call it will jump the file to the beginning of this section. It will continue to read the section for each <code>readRecord</code> call made until it has read the entire section. Once the entire section has been read, <code>readRecord</code> will return false until a new read section is specified.
 
Implementation NOTE: internally it may read extra records prior to the section, but <code>readRecord</code> will keep reading and will not return until it finds a record in the section or until the section has been passed (no records in the section).
 
<source lang="cpp">
// Open the vcf file for reading.
VcfFileReader reader;
VcfHeader header;
VcfRecord record
reader.open("vcfFileName.vcf", header);
 
// Index File name is "vcfFileName.vcf.tbi" or "vcfFileName.tbi"
reader.readVcfIndex();
 
reader.set1BasedReadSection("1", 16384, 32767);
while(reader.readRecord(record))
{
// Process a record from this section.
}
 
// Set another section, chromosome "X"
reader.setReadSection("X");
while(reader.readRecord(record))
{
// Process record on chromosome X
}
</source>
 
Alternatively, the index file name can be specified.
<source lang="cpp">
// Index File name is specified
reader.readVcfIndex("myIndex.tbi");
</source>
==== Check if EOF was reached ====
True is returned if the End of the File has been reached, false if not.
==== See How Many Files Records Were Read ====
To see the total number of records in the file that were read, including any that were discarded, use <code>getNumRecords</code>.
<< "but only " << reader.getNumKeptRecords() << " were processed\n.";
</source>
 
==== Closing the VCF File ====
reader.close();
</source>
 
=== <code>VcfRecord</code> ===

Navigation menu