Difference between revisions of "Make file tutorial"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 6: Line 6:
  
 
Statistical genetics analyses (or any big data analyses in general) often requires multiple steps to prepare the data, running computationally expensive analyses, and then collating the data.  
 
Statistical genetics analyses (or any big data analyses in general) often requires multiple steps to prepare the data, running computationally expensive analyses, and then collating the data.  
 +
 +
Make can instead of simply compiling codes, may also execute the steps in your analyses.
  
 
Make allows you to redo part of your analyses and rerun only the parts which are affected by the change.
 
Make allows you to redo part of your analyses and rerun only the parts which are affected by the change.

Revision as of 15:54, 15 June 2015

Introduction

GNU Make is often thought of as a tool for managing the compilation of large C programs. This is true, but its potential is not limited to this!

At its core, it is a generic pipelining framework that is aware of dependencies and can run steps in parallel.

Statistical genetics analyses (or any big data analyses in general) often requires multiple steps to prepare the data, running computationally expensive analyses, and then collating the data.

Make can instead of simply compiling codes, may also execute the steps in your analyses.

Make allows you to redo part of your analyses and rerun only the parts which are affected by the change.

Using Make potentially save you lots of time and hair pulling especially when your supervisor asks for ALL the analyses again but this time only with rare variants.

Using a script to generate a make file allows you to document the steps required in the analysis too and makes it easier in the future when the analysis is revisited.

Basic Idea

 The general format of a make file is as follows:
 <target> : <dependency> ...
      <command 1>
      <command 2>
 
 The target is usually a small file that is created using "touch <target>".  
 It can be considered as a text and in this case, it is referred via "make <target>"
 
 The dependency(ies) are files.

 The commands are single line commands in linux.  
 The last command is the touch command usually.  
 This allows the creation of a file to signify that the prior commands were executed successfully.
 
 A perl script is written to generate the make file,  in this script, you may document the analyses and 
 allow options to customize the variables in your analyses.
 Once the make file is generated, you can run it with make using the -j option for parallelization.
 If a part of the analyses has to be re performed, simply delete the relevant target file, make will
 rerun the analyses and redo steps that occur after that particular step.  
 Some other useful options in Make are -k for running the analyses as far as possible without 
 terminating the entire pipeline and -t for generating all the target files chronologically.
 For commands that involves a series of pipes, you can use "set pipefail" in a bash environment
 to ensure that the an error is returned if any stage of the pipe fails.  If this is not done, the return
 code of the last process in the pipe will be returned and Make will think that this series of commands
 has completed successfully.

Example

This example does the following:

  1. generate 100 log files with a number written to it
  2. concatenate the 100 log files into one file
  3. delete the 100 log files

The example files may be found in /net/fantasia/home/atks/makefile_tutorial

 #generate make file using perl script
 ./generate_simple_stuff
 #generate make file using perl script to launch jobs on slurm
 ./generate_simple_stuff -l slurm
 #generate make file using perl script to launch jobs on slurm
 #files are stored in <dir> which must be described as an absolute path
 ./generate_simple_stuff -l slurm -o <dir>
 #run make file sequentially
 make -f simple_stuff.mk
 #run make file in parallel to at most 100 jobs
 make -f simple_stuff.mk -j 100
 #clear files from run
 make -f simple_stuff.mk clean

Script

#!/usr/bin/perl -w

use warnings;
use strict;
use POSIX;
use Getopt::Long;
use File::Path;
use File::Basename;
use Pod::Usage;

=head1 NAME

generate_simple_stuff_makefile

=head1 SYNOPSIS

 generate_simple_stuff_makefile [options]

  -o     output directory : location of all output files
  -m     output make file

 example: ./generate_simple_stuff_makefile.pl

=head1 DESCRIPTION

=cut

#option variables
my $help;
my $verbose;
my $debug;
my $outputDir = getcwd();
my $makeFile = "simple_stuff.mk";
my $launchMethod = "local";

#initialize options
Getopt::Long::Configure ('bundling');

if(!GetOptions ('h'=>\$help, 'v'=>\$verbose, 'd'=>\$debug,
                'o:s'=>\$outputDir,
                'l:s'=>\$launchMethod,
                'm:s'=>\$makeFile)
  || !defined($outputDir)
  || scalar(@ARGV)!=0)
{
    if ($help)
    {
        pod2usage(-verbose => 2);
    }
    else
    {
        pod2usage(1);
    }
}

if ($launchMethod ne "local" && $launchMethod ne "slurm")
{
    print STDERR "Launch method has to be local or slurm\n";
    exit(1);
}

##############
#print options
##############
printf("Options\n");
printf("\n");
printf("output directory : %s\n", $outputDir);
printf("launch method    : %s\n", $launchMethod);
printf("\n");

my @nodes = ();
for my $i (140..171)
{
    push(@nodes, "$i");
}
my $nodes = join(",", @nodes);

#arrays for storing targets, dependencies and commands
my @tgts = ();
my @deps = ();
my @cmds = ();

#temporary variables
my $tgt;
my $dep;
my @cmd;

mkpath($outputDir);

my $inputFiles = "";
my $inputFilesOK = "";
my $inputFile = "";
my $outputFile = "";

######################
#1. Generate 100 files
######################
for my $i (1..100)
{
    $inputFiles .= " $outputDir/$i.log";
    $inputFilesOK .= " $outputDir/$i.OK";
    $tgt = "$outputDir/$i.OK";
    $dep = "";
    @cmd = ("echo $i > $outputDir/$i.log");
    makeJob($launchMethod, $tgt, $dep, @cmd);
}

#########################
#2. Concatenate 100 files
#########################
$outputFile = "$outputDir/all.log";
$tgt = "$outputFile.OK";
$dep = $inputFilesOK;
@cmd = ("cat $inputFiles > $outputFile");
makeJob($launchMethod, $tgt, $dep, @cmd);

###########################
#3. Cleanup temporary files
###########################
$tgt = "$outputDir/cleaned.OK";
$dep = "$outputDir/all.log.OK";
@cmd = ("rm $inputFiles");
makeJob($launchMethod, $tgt, $dep, @cmd);

#*******************
#Write out make file
#*******************
open(MAK,">$makeFile") || die "Cannot open $makeFile\n";
print MAK ".DELETE_ON_ERROR:\n\n";
print MAK "all: @tgts\n\n";

#clean
push(@tgts, "clean");
push(@deps, "");
push(@cmds, "\t-rm -rf $outputDir/*.OK $outputDir/*.log");

for(my $i=0; $i < @tgts; ++$i)
{
    print MAK "$tgts[$i]: $deps[$i]\n";
    print MAK "$cmds[$i]\n";
}
close MAK;

##########
#functions
##########

#run a job either locally or by slurm
sub makeJob
{
    my ($method, $tgt, $dep, @cmd) = @_;

    if ($method eq "local")
    {
        makeLocalStep($tgt, $dep, @cmd);
    }
    elsif ($method eq "slurm")
    {
        makeSlurm($tgt, $dep, @cmd);
    }
}

#run slurm jobs
sub makeSlurm
{
    my ($tgt, $dep, @cmd) = @_;

    push(@tgts, $tgt);
    push(@deps, $dep);
    my $cmd = "";
    for my $c (@cmd)
    {
        $cmd .= "\tsrun " . $c . "\n";
    }
    $cmd .= "\ttouch $tgt\n";
    push(@cmds, $cmd);
}

#run a local job
sub makeLocalStep
{
    my ($tgt, $dep, @cmd) = @_;

    push(@tgts, $tgt);
    push(@deps, $dep);
    my $cmd = "";
    for my $c (@cmd)
    {
        $cmd .= "\t" . $c . "\n";
    }
    $cmd .= "\ttouch $tgt\n";
    push(@cmds, $cmd);
}


Acknowledgement

Thanks to Hyun for introducing this trick.

Maintained by

This page is maintained by Adrian.