Changes

From Genome Analysis Wiki
Jump to navigationJump to search
12,631 bytes added ,  16:09, 23 January 2013
no edit summary
Line 19: Line 19:  
<code>VcfFileReader</code> is declared in <code>VcfFileReader.h</code>, so be sure to include that file.
 
<code>VcfFileReader</code> is declared in <code>VcfFileReader.h</code>, so be sure to include that file.
 
<source lang="cpp">
 
<source lang="cpp">
#include "AspFileReader.h"
+
#include "VcfFileReader.h"
 
</source>
 
</source>
   Line 33: Line 33:  
     // Open the vcf file for reading.
 
     // Open the vcf file for reading.
 
     VcfFileReader reader;
 
     VcfFileReader reader;
     VcfFileHeader header;
+
     VcfHeader header;
 
     reader.open("vcfFileName.vcf", header);
 
     reader.open("vcfFileName.vcf", header);
 
</source>
 
</source>
   −
==== Subsetting Samples ====
+
==== Reading a Subset of Samples ====
 
To select only a subset of samples to keep, when opening the file also specify the name of the file containing the names of the samples to keep and the delimiter separating the sample names (default is a new line, '\n').
 
To select only a subset of samples to keep, when opening the file also specify the name of the file containing the names of the samples to keep and the delimiter separating the sample names (default is a new line, '\n').
 
<source lang="cpp">
 
<source lang="cpp">
 
     // Open the vcf file for reading.
 
     // Open the vcf file for reading.
 
     VcfFileReader reader;
 
     VcfFileReader reader;
     VcfFileHeader header;
+
     VcfHeader header;
 
     // Subset 1 is delimited by new lines, '\n'.
 
     // Subset 1 is delimited by new lines, '\n'.
     reader.open("vcfFileName.vcf", header, "subsetFile1.txt");
+
     reader.open("vcfFileName.vcf", header, "subsetFile1.txt", NULL, NULL);
 
</source>
 
</source>
 
<source lang="cpp">
 
<source lang="cpp">
 
     // Open the vcf file for reading.
 
     // Open the vcf file for reading.
 
     VcfFileReader reader;
 
     VcfFileReader reader;
     VcfFileHeader header;
+
     VcfHeader header;
 
     // Subset 2 is delimited by ';'
 
     // Subset 2 is delimited by ';'
     reader.open("vcfFileName.vcf", header, "subsetFile2.txt", ';');
+
     reader.open("vcfFileName.vcf", header, "subsetFile2.txt", NULL, NULL, ';');
 +
</source>
 +
 
 +
To select a subset of samples to remove/ignore, when opening the file also specify the name of the file containing the names of the samples to remove/ignore and the delimiter separating the sample names (default is a new line, '\n').
 +
<source lang="cpp">
 +
    // Open the vcf file for reading.
 +
    VcfFileReader reader;
 +
    VcfHeader header;
 +
    // Subset 1 is delimited by new lines, '\n'.
 +
    reader.open("vcfFileName.vcf", header, NULL, NULL, "subsetFile1.txt");
 +
</source>
 +
<source lang="cpp">
 +
    // Open the vcf file for reading.
 +
    VcfFileReader reader;
 +
    VcfHeader header;
 +
    // Subset 2 is delimited by ';'
 +
    reader.open("vcfFileName.vcf", header, NULL, NULL, "subsetFile2.txt", ';');
 +
</source>
 +
 
 +
If you just have 1 sample to be excluded, you can specify it directly in the open line.
 +
<source lang="cpp">
 +
    // Open the vcf file for reading.
 +
    VcfFileReader reader;
 +
    VcfHeader header;
 +
    // Subset 1 is delimited by new lines, '\n'.
 +
    reader.open("vcfFileName.vcf", header, NULL, "BAD_SAMPLE", NULL);
 
</source>
 
</source>
   Line 76: Line 101:     
==== Specifying Discard Rules ====
 
==== Specifying Discard Rules ====
 +
===== Basic Rules =====
 
When specifying the discard rules, you should use the constants found at the top of VcfFileReader.h
 
When specifying the discard rules, you should use the constants found at the top of VcfFileReader.h
   Line 97: Line 123:  
</source>
 
</source>
 
and discards reads that do not have <code>PASS</code> in the <code>FILTER</code> field and reads that have a genotype that is not phased or have no <code>GT</code> in the <code>FORMAT</code> fields.
 
and discards reads that do not have <code>PASS</code> in the <code>FILTER</code> field and reads that have a genotype that is not phased or have no <code>GT</code> in the <code>FORMAT</code> fields.
 +
 +
===== Minimum Alternate Allele Count =====
 +
To Discard any records without a minimum number of alternate alleles, use:
 +
<source lang="cpp">
 +
VcfFileReader::addDiscardMinAltAlleleCount(int32_t minAltAlleleCount, VcfSubsetSamples* subset)
 +
</source>
 +
 +
The <code>minAltAlleleCount</code> parameter is the minimum number of alternate alleles found in the specified subset (if specified) in order for the record to be kept.
 +
 +
The <code>VcfSubsetSamples* subset</code> parameter is a pointer to the subset of samples that you want to include when counting the number of alternate alleles.  If all samples that are read/kept are to be included, NULL should be passed in. 
 +
 +
See [[#Handling a Subset of Samples|Handling a Subset of Samples]] for how to use <code>VcfSubsetSamples</code>.
 +
 +
Use the following method to remove the DiscardMinAltAlleleCount rule:
 +
<source lang="cpp">
 +
VcfFileReader::rmDiscardMinAltAlleleCount()
 +
</source>
 +
 +
Example:  Minimum Alternate Allele Count = 4
 +
Sample1  Sample2  Sample3  Keep/Discard
 +
  0|0      1|1      2|2    Keep
 +
  0|0      0|1      2|2    Discard, only 3 Alternates (1 Allele1 & 2 Allele 2)
 +
  0|0      1|1      1|2    Keep
 +
  0|2      1|1      2|2    Keep
 +
  2|1      0|1      2|0    Keep
 +
 +
Example:  Minimum Alternate Allele Count = 3 & Exclude Sample2 (without the exclusion, all would be kept)
 +
Sample1  Sample2  Sample3  Keep/Discard
 +
  0|0      1|1      2|2    Discard, only 2 Alternates (0 Allele1 & 2 Allele 2)
 +
  0|0      0|1      2|2    Discard, only 2 Alternates (0 Allele1 & 2 Allele 2)
 +
  0|0      1|1      1|2    Discard, only 2 Alternates (1 Allele1 & 1 Allele 2)
 +
  0|2      1|1      2|2    Keep
 +
  2|1      0|1      2|0    Keep
 +
 +
 +
===== Minimum Minor Allele Count =====
 +
To Discard any records without a minimum number of minor alleles, use:
 +
<source lang="cpp">
 +
VcfFileReader::addDiscardMinMinorAlleleCount(int32_t minMinorAlleleCount, VcfSubsetSamples* subset)
 +
</source>
 +
 +
The <code>minMinorAlleleCount</code> parameter is the minimum number of minor alleles found in the specified subset (if specified) in order for the record to be kept.
 +
 +
The <code>VcfSubsetSamples* subset</code> parameter is a pointer to the subset of samples that you want to include when counting the number of alleles.  If all samples that are read/kept are to be included, NULL should be passed in. 
 +
 +
See [[#Handling a Subset of Samples|Handling a Subset of Samples]] for how to use <code>VcfSubsetSamples</code>.
 +
 +
Use the following method to remove the DiscardMinMinorAlleleCount rule:
 +
<source lang="cpp">
 +
VcfFileReader::rmDiscardMinMinorAlleleCount()
 +
</source>
 +
 +
Example:  Minimum Minor Allele Count = 2
 +
Sample1  Sample2  Sample3  Keep/Discard
 +
  0|0      1|1      2|2    Keep
 +
  0|0      0|1      2|2    Discard, only 1 Allele1
 +
  0|0      1|1      1|2    Discard, only 1 Allele2
 +
  0|2      1|1      2|2    Discard, only 1 Allele0
 +
  2|1      0|1      2|0    Keep
 +
 +
Example:  Minimum Minor Allele Count = 1 & Exclude Sample2 (without the exclusion, all would be kept)
 +
Sample1  Sample2  Sample3  Keep/Discard
 +
  0|0      1|1      2|2    Discard, 0 Allele1
 +
  0|0      0|1      2|2    Discard, 0 Allele1
 +
  0|0      1|1      1|2    Keep
 +
  0|2      1|1      2|2    Discard, 0 Allele1
 +
  2|1      0|1      2|0    Keep
 +
 +
==== Read only Certain Sections of the File / Using a VCF Index (TABIX) File ====
 +
The VCF library has the capability of using a tabix file to allow random access in the VCF file.
 +
 +
To use the INDEX File to read sections of the bam file:
 +
# Open the VCF file using: <code>VcfFileReader::open</code>
 +
# Open the VCF Index file using: <code>VcfFileReader::readVcfIndex</code>
 +
# Set the Section to be read: <code>VcfFileReader::setReadSection</code> or <code>VcfFileReader::set1BasedReadSection</code>
 +
# Read records from the section: <code>VcfFileReader::readRecord</code>
 +
 +
<code>VcfFileReader::readVcfIndex</code> returns true if the index file was successfully read and false if not.
 +
<source lang="cpp">
 +
    bool readVcfIndex(const char * filename);
 +
    bool readVcfIndex();
 +
</source>
 +
If you specify a filename, it looks for the index file with that path/name.  If you do not specify the index file name, it will attempt to find it.  First it searches for a file with your VCF FileName + a ".tbi" extension.  If that isn't found, it removes the ".vcf" if found in your VCF FileName and looks for a file with that name and a ".tbi" extension.
 +
 +
After opening both the VCF File and the Index file, set the section that should be read next using <code>setReadSection</code> or <code>set1BasedReadSection</code>.
 +
<source lang="cpp">
 +
    bool setReadSection(const char* chromName);
 +
    bool set1BasedReadSection(const char* chromName,
 +
                              int32_t start, int32_t end,
 +
                              bool overlap = false);
 +
</source>
 +
Both methods currently always return true, since in the current implementation this can't fail.  It currently allows the VcfIndex file to be read after this call.
 +
 +
The read section is reset when a new file is opened, so do not set this prior to opening the file.
 +
 +
<code>setReadSection</code> will set the code to read the entire specified chromosome.  On the first <code>readRecord</code> call it will jump to the beginning of the chromosome.  It will continue to read the chromosome for each <code>readRecord</code> call made until it has read the entire chromosome.  Once the whole chromosome has been read, <code>readRecord</code> will return false until a new read section is specified.
 +
 +
<code>set1BasedReadSection</code> will set the code to read the specified chromosome starting at the specified 1-based position up to, but not including, the specified 1-based end position.  On the first <code>readRecord</code> call it will jump the file to the beginning of this section.  It will continue to read the section for each <code>readRecord</code> call made until it has read the entire section.  Once the entire section has been read, <code>readRecord</code> will return false until a new read section is specified.
 +
 +
<code>set1BasedReadSection</code> takes an optional parameter, <code>overlap</code>.  It is defaulted to false which means that only records that start within the specified region will be returned by <code>readRecord</code>.  If <code>overlap</code> is set to true, <code>readRecord</code> will also return records that start prior to the specified region, but whose deletions occur in the specified region.
 +
 +
Implementation NOTE: internally it may read extra records prior to the section, but <code>readRecord</code> will keep reading and will not return until it finds a record in the section or until the section has been passed (no records in the section).
 +
 +
<source lang="cpp">
 +
    // Open the vcf file for reading.
 +
    VcfFileReader reader;
 +
    VcfHeader header;
 +
    VcfRecord record
 +
    reader.open("vcfFileName.vcf", header);
 +
 +
    // Index File name is "vcfFileName.vcf.tbi" or "vcfFileName.tbi"
 +
    reader.readVcfIndex();
 +
 +
    reader.set1BasedReadSection("1", 16384, 32767);
 +
    while(reader.readRecord(record))
 +
    {
 +
        // Process a record from this section.
 +
    }
 +
 +
    // Set another section, chromosome "X"
 +
    reader.setReadSection("X");
 +
    while(reader.readRecord(record))
 +
    {
 +
        // Process record on chromosome X
 +
    }
 +
</source>
 +
 +
Alternatively, the index file name can be specified.
 +
<source lang="cpp">
 +
    // Index File name is specified
 +
    reader.readVcfIndex("myIndex.tbi");
 +
</source>
    
==== Check if EOF was reached ====
 
==== Check if EOF was reached ====
Line 105: Line 263:  
True is returned if the End of the File has been reached, false if not.
 
True is returned if the End of the File has been reached, false if not.
   −
==== See How Many Files Were Read ====
+
==== See How Many Records Were Read ====
 
To see the total number of records in the file that were read, including any that were discarded, use <code>getNumRecords</code>.
 
To see the total number of records in the file that were read, including any that were discarded, use <code>getNumRecords</code>.
   Line 113: Line 271:  
                   << "but only " << reader.getNumKeptRecords() << " were processed\n.";
 
                   << "but only " << reader.getNumKeptRecords() << " were processed\n.";
 
</source>
 
</source>
      
==== Closing the VCF File ====
 
==== Closing the VCF File ====
Line 121: Line 278:  
     reader.close();
 
     reader.close();
 
</source>
 
</source>
 +
    
=== <code>VcfRecord</code> ===
 
=== <code>VcfRecord</code> ===
VCF records are read from VCF files using <code>readRecord</code>.
+
VCF records are read from VCF files using <code>VcfFileReader::readRecord</code>.
    
Once you have the record, use methods from the <code>VcfRecord</code> to extract the desired information.
 
Once you have the record, use methods from the <code>VcfRecord</code> to extract the desired information.
 +
 +
==== Extracting Data from the Record ====
 +
Do not delete any of the returned pointers.  They are references to strings within the record and will not be valid when the record changes or is deleted.
 +
 +
* Position Information
 +
** <code>getChromStr()</code> returns const char* to this record's chromosome string
 +
** <code>get1BasedPosition()</code> returns an integer containing the 1-based position of the record
 +
* Identifiers
 +
** <code>getIDStr()</code> returns const char* to this record's ID.  It includes the entire list of unique identifiers delimited by ';'
 +
* Information on Bases
 +
** <code>getRefStr()</code> returns const char* to the entire reference string
 +
** <code>getAltStr()</code> returns const char* to the entire alternate string (alternates seperated by ',')
 +
** <code>getNumAlts()</code> returns an integer containing the number of alternates in the alt string.
 +
* Quality Information
 +
** <code>getQual()</code> returns a float containing the phred quality score
 +
** <code>getQualStr()</code> returns a string containing the phred quality score
 +
* Filter Information - see [[#VcfRecordFilter|VcfRecordFilter]] for more information on extracting information
 +
** <code>getFilter()</code> returns a reference to the Info field information, a VcfRecordInfo object reference
 +
** <code>passedAllFilters()</code> returns true if the FILTER field indicates <code>PASS</code>, false if not.
 +
* Aditional Information
 +
** <code>getInfo()</code> returns a reference to the filter information, a VcfRecordFilter object reference
 +
* Genotype Information
 +
** <code>getGenotypeInfo()</code> returns a reference to the information from the genotype fields, a VcfRecordGenotype object reference
 +
** <code>allPhased()</code> returns true if all the samples are phased and none unphased and false if any are not phased
 +
** <code>allUnphased()</code> returns true if all the samples are unphased and none phased and false if any are not unphased
 +
** <code>hasAllGenotypeAlleles()</code> returns true if all the samples have all the genotype alleles specified and false if any are missing or the GT field is missing
 +
** <code>getNumSamples()</code> returns the number of samples (that are kept)
 +
** <code>getNumGTs(int index)</code> returns the number of GTs for the specified sample index (starts at 0)
 +
** <code>getGT(int sampleNum, unsigned int gtIndex)</code> returns the integer GT value for the specified sampleNum and GT index (both start at 0).  A GT of VcfGenotypeSample::INVALID_GT is returned if the sampleNum or GT index is out of range.  A GT of VcfGenotypeSample::MISSING_GT if the GT is '.'  (This method is also found in VcfRecordGenotype.
 +
 +
=== VcfRecordFilter ===
 +
<code>VcfRecords</code> contain the data from the <code>INFO</code> field in a <code>VcfRecordFilter</code> object.
 +
 +
 +
== Handling a Subset of Samples ==
 +
 +
When reading a file if you only want to process/keep a subset of samples, use [[#Reading a Subset of Samples|Reading a Subset of Samples]].  When that method is used, only the specified samples are stored.  Any further processing will only be on those samples.
 +
 +
Some methods allow the user to specify a subset of samples to operate on.  The subset specified when reading the VCF file, if any, is automatically applied since only those samples were stored.  If a different/additional subset needs to be applied for other processing, you can use the <code>VcfSubsetSamples</code> class.
 +
 +
To setup a VcfSubsetSamples object, pass the already set VCF header to:
 +
<source lang="cpp">
 +
void VcfSubsetSamples::init(const VcfHeader& header, bool include)
 +
</source>
 +
Set the <code>include</code> parameter to:
 +
* true if all samples should be included except any that are specified as excluded.
 +
* false if all samples should be excluded except any that are specified as included.
 +
 +
NOTE: the header is not modified to add/remove any samples.
 +
 +
To mark a specific sample as excluded use:
 +
<source lang="cpp">
 +
bool VcfSubsetSamples::addExcludeSample(const char* sampleName);
 +
</source>
 +
To mark a specific sample as included use:
 +
<source lang="cpp">
 +
bool VcfSubsetSamples::addIncludeSample(const char* sampleName);
 +
</source>
60

edits

Navigation menu