Difference between revisions of "An example of using libcsg"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
 
Here I showed an example using Goncalo's library (I assume he agreed me to do so).<br>  
 
Here I showed an example using Goncalo's library (I assume he agreed me to do so).<br>  
  
 
+
<br>
  
 
The purpose of this program is (1) to extract a range of text from a input file, and count there frequencies. (2) count which line have the same text as the first line<br>  
 
The purpose of this program is (1) to extract a range of text from a input file, and count there frequencies. (2) count which line have the same text as the first line<br>  
Line 68: Line 68:
 
   }</source>  
 
   }</source>  
  
 
+
<br>
  
 
A example input, say INPUT.txt, is like:<br>  
 
A example input, say INPUT.txt, is like:<br>  
Line 74: Line 74:
 
WTCCC66061-&gt;WTCCC66061 HAPLO1 AGACTCTGATAGCGATAACC<br>WTCCC66061-&gt;WTCCC66061 HAPLO2 GGGTTCCGATGGCGATAACC<br>WTCCC66062-&gt;WTCCC66062 HAPLO1 AGACTCTGATGGCGCTAACC<br>WTCCC66062-&gt;WTCCC66062 HAPLO2 AGACTCTGATAGCGATGATC<br>WTCCC66063-&gt;WTCCC66063 HAPLO1 AGACTCTTATGGCGCTAGCC<br>WTCCC66063-&gt;WTCCC66063 HAPLO2 AGACTCTTATAGCGATAACC<br>WTCCC66064-&gt;WTCCC66064 HAPLO1 AGACTCTGATGGCGATAGCC<br>WTCCC66064-&gt;WTCCC66064 HAPLO2 AGACTCTGATGACGCTAGCC<br>WTCCC66065-&gt;WTCCC66065 HAPLO1 AGACTCTGATGGCGATAACC<br>WTCCC66065-&gt;WTCCC66065 HAPLO2 AGACTCTGATGGCGATAGCC<br>  
 
WTCCC66061-&gt;WTCCC66061 HAPLO1 AGACTCTGATAGCGATAACC<br>WTCCC66061-&gt;WTCCC66061 HAPLO2 GGGTTCCGATGGCGATAACC<br>WTCCC66062-&gt;WTCCC66062 HAPLO1 AGACTCTGATGGCGCTAACC<br>WTCCC66062-&gt;WTCCC66062 HAPLO2 AGACTCTGATAGCGATGATC<br>WTCCC66063-&gt;WTCCC66063 HAPLO1 AGACTCTTATGGCGCTAGCC<br>WTCCC66063-&gt;WTCCC66063 HAPLO2 AGACTCTTATAGCGATAACC<br>WTCCC66064-&gt;WTCCC66064 HAPLO1 AGACTCTGATGGCGATAGCC<br>WTCCC66064-&gt;WTCCC66064 HAPLO2 AGACTCTGATGACGCTAGCC<br>WTCCC66065-&gt;WTCCC66065 HAPLO1 AGACTCTGATGGCGATAACC<br>WTCCC66065-&gt;WTCCC66065 HAPLO2 AGACTCTGATGGCGATAGCC<br>  
  
 
+
<br>
  
 
My way to compile this source code into executable file is:  
 
My way to compile this source code into executable file is:  
  
g++ -g -o Main Main.cpp libcsg/libcsg.a -I libcsg -lz<br>
+
g++ -o Main Main.cpp libcsg/libcsg.a -I./libcsg -lz<br>  
  
 
where "libcsg" refers to all source code checked out from repository and contains files including "StringArray.h" and etc.  
 
where "libcsg" refers to all source code checked out from repository and contains files including "StringArray.h" and etc.  
  
 +
In large projects, Makefile is used, and an example can be found in "csg/karma" directory.
  
 +
<br>
  
 
And if we run "./Main -h INPUT.txt -f 1 -t 3". It means we want to read INPUT.txt, get the text range from 1-3 (because of 0-indexed, actually it is from the second character to fourth character), count the pattern frequency, and also find out which line has the same text in range as the first line does. The ouput looks like<br>  
 
And if we run "./Main -h INPUT.txt -f 1 -t 3". It means we want to read INPUT.txt, get the text range from 1-3 (because of 0-indexed, actually it is from the second character to fourth character), count the pattern frequency, and also find out which line has the same text in range as the first line does. The ouput looks like<br>  
Line 90: Line 92:
 
Haplotype Counts<br>GGT 1<br>GAC 9<br>Haplotypes that match the first one<br>WTCCC66062-&gt;WTCCC66062 (3)<br>WTCCC66062-&gt;WTCCC66062 (4)<br>WTCCC66063-&gt;WTCCC66063 (5)<br>WTCCC66063-&gt;WTCCC66063 (6)<br>WTCCC66064-&gt;WTCCC66064 (7)<br>WTCCC66064-&gt;WTCCC66064 (8)<br>WTCCC66065-&gt;WTCCC66065 (9)<br>WTCCC66065-&gt;WTCCC66065 (10)<br><br>  
 
Haplotype Counts<br>GGT 1<br>GAC 9<br>Haplotypes that match the first one<br>WTCCC66062-&gt;WTCCC66062 (3)<br>WTCCC66062-&gt;WTCCC66062 (4)<br>WTCCC66063-&gt;WTCCC66063 (5)<br>WTCCC66063-&gt;WTCCC66063 (6)<br>WTCCC66064-&gt;WTCCC66064 (7)<br>WTCCC66064-&gt;WTCCC66064 (8)<br>WTCCC66065-&gt;WTCCC66065 (9)<br>WTCCC66065-&gt;WTCCC66065 (10)<br><br>  
  
Some Notes
+
Some Notes  
  
 
*To read a file, use IFILE class, which is wrapper for read/write file. A particular useful thing is that it handle gzipped file transparently. Important functions are: ifopen(), ifclose().<br>  
 
*To read a file, use IFILE class, which is wrapper for read/write file. A particular useful thing is that it handle gzipped file transparently. Important functions are: ifopen(), ifclose().<br>  

Latest revision as of 00:39, 15 January 2010

Here I showed an example using Goncalo's library (I assume he agreed me to do so).


The purpose of this program is (1) to extract a range of text from a input file, and count there frequencies. (2) count which line have the same text as the first line

The code will open a file (specified by -h parameter), take the third field, obtain a range of text (range is specified by starting position using -f and ending position using -t).

#include "StringArray.h"
#include "StringHash.h"
#include "Parameters.h"
#include "Error.h"

int main(int argc, char ** argv)
   {
   String filename;
   int    firstMarker = 0, lastMarker = 10;

   ParameterList pl;

   pl.Add(new StringParameter('h', "Haplotype File", filename));
   pl.Add(new IntParameter('f', "From Position", firstMarker));
   pl.Add(new IntParameter('t', "To Position", lastMarker));
   pl.Read(argc, argv);
   pl.Status();

   IFILE input = ifopen(filename, "rt");

   if (input == NULL)
      error("Failed to open file %s", (const char *) filename);

   String buffer;
   StringArray tokens;
   StringArray haplos, names;
   StringIntHash  haploCounts;

   while (!ifeof(input))
       {
       buffer.ReadLine(input);
       tokens.ReplaceTokens(buffer);

       if (buffer.Length() > 0) continue;
       if (tokens.Length() != 3)
           {
           printf("Expect 3 words per line but the line beginning with \"%.10s\" looks different ...", (const char *) buffer);
           continue;
           }

       haplos.Push(tokens[2].Mid(firstMarker, lastMarker));
       names.Push(tokens[0]);

       haploCounts.IncrementCount(tokens[2].Mid(firstMarker, lastMarker));

       }

   printf("Haplotype Counts\n");
   for (int i = 0; i < haploCounts.Capacity(); i++)
       if (haploCounts.SlotInUse(i))
          printf("%s %d\n", (const char *) haploCounts[i], haploCounts.Integer(i));

   printf("Haplotypes that match the first one\n");
   for (int i = 1; i < haplos.Length(); i++)
       if (haplos[i] == haplos[0])
          printf("%s (%d)\n", (const char *) names[i], i + 1);
   printf("\n");

   ifclose(input);
   }


A example input, say INPUT.txt, is like:

WTCCC66061->WTCCC66061 HAPLO1 AGACTCTGATAGCGATAACC
WTCCC66061->WTCCC66061 HAPLO2 GGGTTCCGATGGCGATAACC
WTCCC66062->WTCCC66062 HAPLO1 AGACTCTGATGGCGCTAACC
WTCCC66062->WTCCC66062 HAPLO2 AGACTCTGATAGCGATGATC
WTCCC66063->WTCCC66063 HAPLO1 AGACTCTTATGGCGCTAGCC
WTCCC66063->WTCCC66063 HAPLO2 AGACTCTTATAGCGATAACC
WTCCC66064->WTCCC66064 HAPLO1 AGACTCTGATGGCGATAGCC
WTCCC66064->WTCCC66064 HAPLO2 AGACTCTGATGACGCTAGCC
WTCCC66065->WTCCC66065 HAPLO1 AGACTCTGATGGCGATAACC
WTCCC66065->WTCCC66065 HAPLO2 AGACTCTGATGGCGATAGCC


My way to compile this source code into executable file is:

g++ -o Main Main.cpp libcsg/libcsg.a -I./libcsg -lz

where "libcsg" refers to all source code checked out from repository and contains files including "StringArray.h" and etc.

In large projects, Makefile is used, and an example can be found in "csg/karma" directory.


And if we run "./Main -h INPUT.txt -f 1 -t 3". It means we want to read INPUT.txt, get the text range from 1-3 (because of 0-indexed, actually it is from the second character to fourth character), count the pattern frequency, and also find out which line has the same text in range as the first line does. The ouput looks like

The following parameters are in effect:
Haplotype File : INPUT.txt (-hname)
From Position : 1 (-f9999)
To Position : 3 (-t9999)

Haplotype Counts
GGT 1
GAC 9
Haplotypes that match the first one
WTCCC66062->WTCCC66062 (3)
WTCCC66062->WTCCC66062 (4)
WTCCC66063->WTCCC66063 (5)
WTCCC66063->WTCCC66063 (6)
WTCCC66064->WTCCC66064 (7)
WTCCC66064->WTCCC66064 (8)
WTCCC66065->WTCCC66065 (9)
WTCCC66065->WTCCC66065 (10)

Some Notes

  • To read a file, use IFILE class, which is wrapper for read/write file. A particular useful thing is that it handle gzipped file transparently. Important functions are: ifopen(), ifclose().
  • To handle strings, we prefer to use String class. In this area, handling string is a versatile task. A String class encapsulate basic operations such as index[], append(+), equality(=), extract(Left, Right, Mid). String class can be seamlessly used with IFILE class to acces file. See the while loop in the example code and notice the function ReadLine().
  • To tokenize a String class, we can use StringArray class. It has ReplaceToken() which will store each token field like an array.
  • To associate a String class to a integer type, there is a class named StringIntHash, important functions are IncrementCount(), Capacity() and SlotInUse().