Difference between revisions of "An example of using libcsg"

From Genome Analysis Wiki
Jump to: navigation, search
(Created page with 'Here I showed an example using Goncalo's library (I assume he agreed me to do so).<br> The purpose of this program is (1) to extract a range of text from a input file, and count…')
 
Line 1: Line 1:
Here I showed an example using Goncalo's library (I assume he agreed me to do so).<br>
+
Here I showed an example using Goncalo's library (I assume he agreed me to do so).<br>  
  
The purpose of this program is (1) to extract a range of text from a input file, and count there frequencies. (2) count which line have the same text as the first line<br>
+
The purpose of this program is (1) to extract a range of text from a input file, and count there frequencies. (2) count which line have the same text as the first line<br>  
  
The code will open a file (specified by -h parameter), take the third field, obtain a range of text (range is specified by starting position using -f and ending position using -t).<br>
+
The code will open a file (specified by -h parameter), take the third field, obtain a range of text (range is specified by starting position using -f and ending position using -t).<br>  
  
 
<source lang="c">#include "StringArray.h"
 
<source lang="c">#include "StringArray.h"
Line 63: Line 63:
  
 
   ifclose(input);
 
   ifclose(input);
   }</source>
+
   }</source>  
  
A example input, say INPUT.txt is like:<br>
+
A example input, say INPUT.txt is like:<br>  
  
WTCCC66061-&gt;WTCCC66061 HAPLO1 AGACTCTGATAGCGATAACC<br>WTCCC66061-&gt;WTCCC66061 HAPLO2 GGGTTCCGATGGCGATAACC<br>WTCCC66062-&gt;WTCCC66062 HAPLO1 AGACTCTGATGGCGCTAACC<br>WTCCC66062-&gt;WTCCC66062 HAPLO2 AGACTCTGATAGCGATGATC<br>WTCCC66063-&gt;WTCCC66063 HAPLO1 AGACTCTTATGGCGCTAGCC<br>WTCCC66063-&gt;WTCCC66063 HAPLO2 AGACTCTTATAGCGATAACC<br>WTCCC66064-&gt;WTCCC66064 HAPLO1 AGACTCTGATGGCGATAGCC<br>WTCCC66064-&gt;WTCCC66064 HAPLO2 AGACTCTGATGACGCTAGCC<br>WTCCC66065-&gt;WTCCC66065 HAPLO1 AGACTCTGATGGCGATAACC<br>WTCCC66065-&gt;WTCCC66065 HAPLO2 AGACTCTGATGGCGATAGCC<br>
+
WTCCC66061-&gt;WTCCC66061 HAPLO1 AGACTCTGATAGCGATAACC<br>WTCCC66061-&gt;WTCCC66061 HAPLO2 GGGTTCCGATGGCGATAACC<br>WTCCC66062-&gt;WTCCC66062 HAPLO1 AGACTCTGATGGCGCTAACC<br>WTCCC66062-&gt;WTCCC66062 HAPLO2 AGACTCTGATAGCGATGATC<br>WTCCC66063-&gt;WTCCC66063 HAPLO1 AGACTCTTATGGCGCTAGCC<br>WTCCC66063-&gt;WTCCC66063 HAPLO2 AGACTCTTATAGCGATAACC<br>WTCCC66064-&gt;WTCCC66064 HAPLO1 AGACTCTGATGGCGATAGCC<br>WTCCC66064-&gt;WTCCC66064 HAPLO2 AGACTCTGATGACGCTAGCC<br>WTCCC66065-&gt;WTCCC66065 HAPLO1 AGACTCTGATGGCGATAACC<br>WTCCC66065-&gt;WTCCC66065 HAPLO2 AGACTCTGATGGCGATAGCC<br>  
  
And if we run "extractHaplo -h INPUT.txt -f 1 -t 3", and the ouput looks like<br>
+
And if we run "extractHaplo -h INPUT.txt -f 1 -t 3". It means we want to read INPUT.txt, get the text range from 1-3 (because of 0-indexed, actually it is from the second character to fourth character), count the pattern frequency, and also find out which line has the same text in range as the first line does. The ouput looks like<br>  
  
The following parameters are in effect:<br> Haplotype File : INPUT.txt (-hname)<br> From Position : 1 (-f9999)<br> To Position : 3 (-t9999)
+
The following parameters are in effect:<br> Haplotype File&nbsp;: INPUT.txt (-hname)<br> From Position&nbsp;: 1 (-f9999)<br> To Position&nbsp;: 3 (-t9999)  
  
Haplotype Counts<br>GGT 1<br>GAC 9<br>60 1<br>Haplotypes that match the first one<br>WTCCC66062-&gt;WTCCC66062 (3)<br>WTCCC66062-&gt;WTCCC66062 (4)<br>WTCCC66063-&gt;WTCCC66063 (5)<br>WTCCC66063-&gt;WTCCC66063 (6)<br>WTCCC66064-&gt;WTCCC66064 (7)<br>WTCCC66064-&gt;WTCCC66064 (8)<br>WTCCC66065-&gt;WTCCC66065 (9)<br>WTCCC66065-&gt;WTCCC66065 (10)<br><br>
+
Haplotype Counts<br>GGT 1<br>GAC 9<br>60 1<br>Haplotypes that match the first one<br>WTCCC66062-&gt;WTCCC66062 (3)<br>WTCCC66062-&gt;WTCCC66062 (4)<br>WTCCC66063-&gt;WTCCC66063 (5)<br>WTCCC66063-&gt;WTCCC66063 (6)<br>WTCCC66064-&gt;WTCCC66064 (7)<br>WTCCC66064-&gt;WTCCC66064 (8)<br>WTCCC66065-&gt;WTCCC66065 (9)<br>WTCCC66065-&gt;WTCCC66065 (10)<br><br>  
  
*To read a file, use IFILE class, which is wrapper for read/write file. A particular useful thing is that it handle gzipped file transparently. Important functions are: ifopen(), ifclose().<br>
+
*To read a file, use IFILE class, which is wrapper for read/write file. A particular useful thing is that it handle gzipped file transparently. Important functions are: ifopen(), ifclose().<br>  
*To handle strings, we prefer to use String class. In this area, handling string is a versatile task. A String class encapsulate basic operations such as index[], append(+), equality(=), extract(Left, Right, Mid). String class can be seamlessly used with IFILE class to acces file. See the while loop in the example code and notice the function ReadLine().<br>
+
*To handle strings, we prefer to use String class. In this area, handling string is a versatile task. A String class encapsulate basic operations such as index[], append(+), equality(=), extract(Left, Right, Mid). String class can be seamlessly used with IFILE class to acces file. See the while loop in the example code and notice the function ReadLine().<br>  
*To tokenize a String class, we can use StringArray class. It has ReplaceToken() which will store each token field like an array.<br>
+
*To tokenize a String class, we can use StringArray class. It has ReplaceToken() which will store each token field like an array.<br>  
 
*To associate a String class to a integer type, there is a class named StringIntHash, important functions are IncrementCount(), Capacity() and SlotInUse().
 
*To associate a String class to a integer type, there is a class named StringIntHash, important functions are IncrementCount(), Capacity() and SlotInUse().
  
 
<br>
 
<br>

Revision as of 19:15, 14 January 2010

Here I showed an example using Goncalo's library (I assume he agreed me to do so).

The purpose of this program is (1) to extract a range of text from a input file, and count there frequencies. (2) count which line have the same text as the first line

The code will open a file (specified by -h parameter), take the third field, obtain a range of text (range is specified by starting position using -f and ending position using -t).

#include "StringArray.h"
#include "StringHash.h"
#include "Parameters.h"
#include "Error.h"

int main(int argc, char ** argv)
   {
   String filename;
   int    firstMarker = 0, lastMarker = 10;

   ParameterList pl;

   pl.Add(new StringParameter('h', "Haplotype File", filename));
   pl.Add(new IntParameter('f', "From Position", firstMarker));
   pl.Add(new IntParameter('t', "To Position", lastMarker));
   pl.Read(argc, argv);
   pl.Status();

   IFILE input = ifopen(filename, "rt");

   if (input == NULL)
      error("Failed to open file %s", (const char *) filename);

   String buffer;
   StringArray tokens;
   StringArray haplos, names;
   StringIntHash  haploCounts;

   while (!ifeof(input))
       {
       buffer.ReadLine(input);
       tokens.ReplaceTokens(buffer);

       if (buffer.Length() > 0 && tokens.Length() != 3)
           {
           printf("Expect 3 words per line but the line beginning with \"%.10s\" looks different ...", (const char *) buffer);
           continue;
           }

       haplos.Push(tokens[2].Mid(firstMarker, lastMarker));
       names.Push(tokens[0]);

       haploCounts.IncrementCount(tokens[2].Mid(firstMarker, lastMarker));

       }

   printf("Haplotype Counts\n");
   for (int i = 0; i < haploCounts.Capacity(); i++)
       if (haploCounts.SlotInUse(i))
          printf("%s %d\n", (const char *) haploCounts[i], haploCounts.Integer(i));

   printf("Haplotypes that match the first one\n");
   for (int i = 1; i < haplos.Length(); i++)
       if (haplos[i] == haplos[0])
          printf("%s (%d)\n", (const char *) names[i], i + 1);
   printf("\n");

   ifclose(input);
   }

A example input, say INPUT.txt is like:

WTCCC66061->WTCCC66061 HAPLO1 AGACTCTGATAGCGATAACC
WTCCC66061->WTCCC66061 HAPLO2 GGGTTCCGATGGCGATAACC
WTCCC66062->WTCCC66062 HAPLO1 AGACTCTGATGGCGCTAACC
WTCCC66062->WTCCC66062 HAPLO2 AGACTCTGATAGCGATGATC
WTCCC66063->WTCCC66063 HAPLO1 AGACTCTTATGGCGCTAGCC
WTCCC66063->WTCCC66063 HAPLO2 AGACTCTTATAGCGATAACC
WTCCC66064->WTCCC66064 HAPLO1 AGACTCTGATGGCGATAGCC
WTCCC66064->WTCCC66064 HAPLO2 AGACTCTGATGACGCTAGCC
WTCCC66065->WTCCC66065 HAPLO1 AGACTCTGATGGCGATAACC
WTCCC66065->WTCCC66065 HAPLO2 AGACTCTGATGGCGATAGCC

And if we run "extractHaplo -h INPUT.txt -f 1 -t 3". It means we want to read INPUT.txt, get the text range from 1-3 (because of 0-indexed, actually it is from the second character to fourth character), count the pattern frequency, and also find out which line has the same text in range as the first line does. The ouput looks like

The following parameters are in effect:
Haplotype File : INPUT.txt (-hname)
From Position : 1 (-f9999)
To Position : 3 (-t9999)

Haplotype Counts
GGT 1
GAC 9
60 1
Haplotypes that match the first one
WTCCC66062->WTCCC66062 (3)
WTCCC66062->WTCCC66062 (4)
WTCCC66063->WTCCC66063 (5)
WTCCC66063->WTCCC66063 (6)
WTCCC66064->WTCCC66064 (7)
WTCCC66064->WTCCC66064 (8)
WTCCC66065->WTCCC66065 (9)
WTCCC66065->WTCCC66065 (10)

  • To read a file, use IFILE class, which is wrapper for read/write file. A particular useful thing is that it handle gzipped file transparently. Important functions are: ifopen(), ifclose().
  • To handle strings, we prefer to use String class. In this area, handling string is a versatile task. A String class encapsulate basic operations such as index[], append(+), equality(=), extract(Left, Right, Mid). String class can be seamlessly used with IFILE class to acces file. See the while loop in the example code and notice the function ReadLine().
  • To tokenize a String class, we can use StringArray class. It has ReplaceToken() which will store each token field like an array.
  • To associate a String class to a integer type, there is a class named StringIntHash, important functions are IncrementCount(), Capacity() and SlotInUse().