From Genome Analysis Wiki
Jump to: navigation, search

arf is a genetic analysis program for sequencing data.

Basic Usage Example

 arf [options] <vcf-file>

Here are examples of how arf works:

  #-c option directs the output to STDOUT
  arf -a complexity 1000g.vcf -g genome.fa -l 30 -c
  #-o option specifies an output file name 
  arf -a complexity 1000g.vcf -g genome.fa -l 30 -o paltum.vcf
  #input VCF file can be gzipped 
  arf -a complexity 1000g.vcf.gz -g genome.fa -l 30 -o paltum.vcf
  #multiple analyses/annotations at once is possible 
  arf -a complexity,f,hwe,exons 1000g.vcf.gz -g genome.fa -l 30 -o paltum.vcf -f refGene.txt.gz
  #estimates allele and genotype frequencies from genotype likelihoods.
  #AF - Allele frequency estimates of alternate alleles (EM)
  #HWEAF - Allele frequency estimates of alternate alleles under the assumption of HWE equilibrium (EM)
  #GF - Genotype frequency estimates (EM)
  arf -a freq 1000g.vcf
  #conducts HWE LRT test from genotype likelihoods (multiallelic)
  #adds the info tags
  #HWP - HWE P-value
  #HWCHISQ - HWE Chi-square value
  #HWDOF - Degrees of freedom for test
  #will generate frequency tags.
  arf -a hwe 1000g.vcf
  #estimates Inbreeding coefficient F from genotype likelihood
  #adds the info tag
  #F - Inbreeding coefficient
  arf -a f 1000g.vcf
  #you can also do both analysis at the same time
  #performs both HWE test and estimates F
  arf -a hwe,f 1000g.vcf
  #annotates exonic regions
  #adds the info tag
  #EXON - flag  
  arf -a exons 1000g.vcf -f refGene.txt
  #reference file can be gzipped up
  arf -a exons 1000g.vcf -f refGene.txt.gz
  #computes extracts flanking sequence around a variant
  #adds the info tag
  #FLANKS - 5' sequence, reference allele, 3' sequence up to length n defined by option -l, default is 25
  arf -a flanks 1000g.vcf -g genome.fa -l 30
  #computes a complexity measure for flanking sequences around a variant
  #adds the info tag
  #CPXY - complexity measure for flanks of length l defined by option -l, default is 25
  arf -a complexity 1000g.vcf -g genome.fa -l 30

In development/Pending update

  #annotates variants
  ##INFO=<ID=VTYPE,Number=1,Type=string,Description="Annotates variant by types SNP, MNP, INDEL, SV, CR">
  #-l option defines the length in which to differentiate INDELs and SVs
  arf -a vartype 1000g.vcf -l 30
  #compute Genotype Likelihood Based Allele Balance 
  ##INFO=<ID=AB,Number=1,Type=float,Description="Allele Balance computed from genotype likelihoods">
  #requires PL/GL and DP in the genotype fields
  arf -a ab 1000g.vcf
  #-e option 
  #when used in conjunction with an analysis that requires allele or genotype frequency estimates,
  #will attempt to  find estimates in the AF, GF and HWEAF fields
  arf -a ab 1000g.vcf -e

Command Line Options

   vcf-file       VCF file (can be gzipped or bgzipped)
   h              help page
   g              genome-file (fasta file) 
                  (note that if genome.fa is specified, the actual file looked 
                   for is genome-bs.umfa, if the memory mapped file is not 
                   found, it will be automatically generated from the fasta file)
   l              length of flanking sequence (default is 25)
   a              analysis/annotation
   o              output file name  (default is arf.vcf)
   f              annotation file name (can be gzipped)


   An output file is generated with the name arf.vcf
   The file name can be specified with the -o option.
   Log files are generated in arf.log


   Basically deals with VCF files, generate additional info tags in an output VCF file.


For arf 0.557215, we provide binaries for linux machines arf 0.557215.

You will also need a copy of human genome assembly fasta file: human.g1k.v37.fa. Please gunzip it before usage. arf will generate a memory mapped file from the fasta file named human.g1k.v37-bs.umfa.

You will also need a copy of UCSC refGene text file: refGene.txt.

This page is maintained by Adrian.