From Genome Analysis Wiki
Jump to navigationJump to search
4,157 bytes added
, 18:57, 14 January 2010
Here I showed an example using Goncalo's library (I assume he agreed me to do so).<br>
The purpose of this program is (1) to extract a range of text from a input file, and count there frequencies. (2) count which line have the same text as the first line<br>
The code will open a file (specified by -h parameter), take the third field, obtain a range of text (range is specified by starting position using -f and ending position using -t).<br>
<source lang="c">#include "StringArray.h"
#include "StringHash.h"
#include "Parameters.h"
#include "Error.h"
int main(int argc, char ** argv)
{
String filename;
int firstMarker = 0, lastMarker = 10;
ParameterList pl;
pl.Add(new StringParameter('h', "Haplotype File", filename));
pl.Add(new IntParameter('f', "From Position", firstMarker));
pl.Add(new IntParameter('t', "To Position", lastMarker));
pl.Read(argc, argv);
pl.Status();
IFILE input = ifopen(filename, "rt");
if (input == NULL)
error("Failed to open file %s", (const char *) filename);
String buffer;
StringArray tokens;
StringArray haplos, names;
StringIntHash haploCounts;
while (!ifeof(input))
{
buffer.ReadLine(input);
tokens.ReplaceTokens(buffer);
if (buffer.Length() > 0 && tokens.Length() != 3)
{
printf("Expect 3 words per line but the line beginning with \"%.10s\" looks different ...", (const char *) buffer);
continue;
}
haplos.Push(tokens[2].Mid(firstMarker, lastMarker));
names.Push(tokens[0]);
haploCounts.IncrementCount(tokens[2].Mid(firstMarker, lastMarker));
}
printf("Haplotype Counts\n");
for (int i = 0; i < haploCounts.Capacity(); i++)
if (haploCounts.SlotInUse(i))
printf("%s %d\n", (const char *) haploCounts[i], haploCounts.Integer(i));
printf("Haplotypes that match the first one\n");
for (int i = 1; i < haplos.Length(); i++)
if (haplos[i] == haplos[0])
printf("%s (%d)\n", (const char *) names[i], i + 1);
printf("\n");
ifclose(input);
}</source>
A example input, say INPUT.txt is like:<br>
WTCCC66061->WTCCC66061 HAPLO1 AGACTCTGATAGCGATAACC<br>WTCCC66061->WTCCC66061 HAPLO2 GGGTTCCGATGGCGATAACC<br>WTCCC66062->WTCCC66062 HAPLO1 AGACTCTGATGGCGCTAACC<br>WTCCC66062->WTCCC66062 HAPLO2 AGACTCTGATAGCGATGATC<br>WTCCC66063->WTCCC66063 HAPLO1 AGACTCTTATGGCGCTAGCC<br>WTCCC66063->WTCCC66063 HAPLO2 AGACTCTTATAGCGATAACC<br>WTCCC66064->WTCCC66064 HAPLO1 AGACTCTGATGGCGATAGCC<br>WTCCC66064->WTCCC66064 HAPLO2 AGACTCTGATGACGCTAGCC<br>WTCCC66065->WTCCC66065 HAPLO1 AGACTCTGATGGCGATAACC<br>WTCCC66065->WTCCC66065 HAPLO2 AGACTCTGATGGCGATAGCC<br>
And if we run "extractHaplo -h INPUT.txt -f 1 -t 3", and the ouput looks like<br>
The following parameters are in effect:<br> Haplotype File : INPUT.txt (-hname)<br> From Position : 1 (-f9999)<br> To Position : 3 (-t9999)
Haplotype Counts<br>GGT 1<br>GAC 9<br>60 1<br>Haplotypes that match the first one<br>WTCCC66062->WTCCC66062 (3)<br>WTCCC66062->WTCCC66062 (4)<br>WTCCC66063->WTCCC66063 (5)<br>WTCCC66063->WTCCC66063 (6)<br>WTCCC66064->WTCCC66064 (7)<br>WTCCC66064->WTCCC66064 (8)<br>WTCCC66065->WTCCC66065 (9)<br>WTCCC66065->WTCCC66065 (10)<br><br>
*To read a file, use IFILE class, which is wrapper for read/write file. A particular useful thing is that it handle gzipped file transparently. Important functions are: ifopen(), ifclose().<br>
*To handle strings, we prefer to use String class. In this area, handling string is a versatile task. A String class encapsulate basic operations such as index[], append(+), equality(=), extract(Left, Right, Mid). String class can be seamlessly used with IFILE class to acces file. See the while loop in the example code and notice the function ReadLine().<br>
*To tokenize a String class, we can use StringArray class. It has ReplaceToken() which will store each token field like an array.<br>
*To associate a String class to a integer type, there is a class named StringIntHash, important functions are IncrementCount(), Capacity() and SlotInUse().
<br>