An example of using libcsg
Here I showed an example using Goncalo's library (I assume he agreed me to do so).
The purpose of this program is (1) to extract a range of text from a input file, and count there frequencies. (2) count which line have the same text as the first line
The code will open a file (specified by -h parameter), take the third field, obtain a range of text (range is specified by starting position using -f and ending position using -t).
#include "StringArray.h"
#include "StringHash.h"
#include "Parameters.h"
#include "Error.h"
int main(int argc, char ** argv)
{
String filename;
int firstMarker = 0, lastMarker = 10;
ParameterList pl;
pl.Add(new StringParameter('h', "Haplotype File", filename));
pl.Add(new IntParameter('f', "From Position", firstMarker));
pl.Add(new IntParameter('t', "To Position", lastMarker));
pl.Read(argc, argv);
pl.Status();
IFILE input = ifopen(filename, "rt");
if (input == NULL)
error("Failed to open file %s", (const char *) filename);
String buffer;
StringArray tokens;
StringArray haplos, names;
StringIntHash haploCounts;
while (!ifeof(input))
{
buffer.ReadLine(input);
tokens.ReplaceTokens(buffer);
if (buffer.Length() > 0 && tokens.Length() != 3)
{
printf("Expect 3 words per line but the line beginning with \"%.10s\" looks different ...", (const char *) buffer);
continue;
}
haplos.Push(tokens[2].Mid(firstMarker, lastMarker));
names.Push(tokens[0]);
haploCounts.IncrementCount(tokens[2].Mid(firstMarker, lastMarker));
}
printf("Haplotype Counts\n");
for (int i = 0; i < haploCounts.Capacity(); i++)
if (haploCounts.SlotInUse(i))
printf("%s %d\n", (const char *) haploCounts[i], haploCounts.Integer(i));
printf("Haplotypes that match the first one\n");
for (int i = 1; i < haplos.Length(); i++)
if (haplos[i] == haplos[0])
printf("%s (%d)\n", (const char *) names[i], i + 1);
printf("\n");
ifclose(input);
}
A example input, say INPUT.txt is like:
WTCCC66061->WTCCC66061 HAPLO1 AGACTCTGATAGCGATAACC
WTCCC66061->WTCCC66061 HAPLO2 GGGTTCCGATGGCGATAACC
WTCCC66062->WTCCC66062 HAPLO1 AGACTCTGATGGCGCTAACC
WTCCC66062->WTCCC66062 HAPLO2 AGACTCTGATAGCGATGATC
WTCCC66063->WTCCC66063 HAPLO1 AGACTCTTATGGCGCTAGCC
WTCCC66063->WTCCC66063 HAPLO2 AGACTCTTATAGCGATAACC
WTCCC66064->WTCCC66064 HAPLO1 AGACTCTGATGGCGATAGCC
WTCCC66064->WTCCC66064 HAPLO2 AGACTCTGATGACGCTAGCC
WTCCC66065->WTCCC66065 HAPLO1 AGACTCTGATGGCGATAACC
WTCCC66065->WTCCC66065 HAPLO2 AGACTCTGATGGCGATAGCC
And if we run "extractHaplo -h INPUT.txt -f 1 -t 3", and the ouput looks like
The following parameters are in effect:
Haplotype File : INPUT.txt (-hname)
From Position : 1 (-f9999)
To Position : 3 (-t9999)
Haplotype Counts
GGT 1
GAC 9
60 1
Haplotypes that match the first one
WTCCC66062->WTCCC66062 (3)
WTCCC66062->WTCCC66062 (4)
WTCCC66063->WTCCC66063 (5)
WTCCC66063->WTCCC66063 (6)
WTCCC66064->WTCCC66064 (7)
WTCCC66064->WTCCC66064 (8)
WTCCC66065->WTCCC66065 (9)
WTCCC66065->WTCCC66065 (10)
- To read a file, use IFILE class, which is wrapper for read/write file. A particular useful thing is that it handle gzipped file transparently. Important functions are: ifopen(), ifclose().
- To handle strings, we prefer to use String class. In this area, handling string is a versatile task. A String class encapsulate basic operations such as index[], append(+), equality(=), extract(Left, Right, Mid). String class can be seamlessly used with IFILE class to acces file. See the while loop in the example code and notice the function ReadLine().
- To tokenize a String class, we can use StringArray class. It has ReplaceToken() which will store each token field like an array.
- To associate a String class to a integer type, there is a class named StringIntHash, important functions are IncrementCount(), Capacity() and SlotInUse().