Difference between revisions of "Thunder"

From Genome Analysis Wiki
Jump to: navigation, search
Line 1: Line 1:
 
This page documents how to perform variant calling from low-coverage sequencing data using glfmultiples and thunder. The pipeline was originally developed by [mailto:yunli@med.unc.edu Yun Li] for the 1000 Genomes Low Coverage Pilot Project.  
 
This page documents how to perform variant calling from low-coverage sequencing data using glfmultiples and thunder. The pipeline was originally developed by [mailto:yunli@med.unc.edu Yun Li] for the 1000 Genomes Low Coverage Pilot Project.  
  
== Input Data  ==
+
>== Input Data  ==
  
 
To get started, you will need glf files in the standard format [http://samtools.sourceforge.net/SAM1.pdf glf format]. Sample files are available at [ftp://share.sph.umich.edu/1000genomes/pilot1/examples/glf.tgz sample glf files].  
 
To get started, you will need glf files in the standard format [http://samtools.sourceforge.net/SAM1.pdf glf format]. Sample files are available at [ftp://share.sph.umich.edu/1000genomes/pilot1/examples/glf.tgz sample glf files].  
Line 7: Line 7:
 
If you do not have glf files, you can generate them from bam files (bam format also specified in [http://samtools.sourceforge.net/SAM1.pdf glf format bam format]) using the following command line:  
 
If you do not have glf files, you can generate them from bam files (bam format also specified in [http://samtools.sourceforge.net/SAM1.pdf glf format bam format]) using the following command line:  
  
   samtools pileup -g -T 1 -f ref.fa my.bam > my.glf
+
   samtools pileup -g -T 1 -f ref.fa my.bam > my.glf
  
Note: you will need the reference fasta file ref.fa to create glf file from bam file.  
+
Note: you will need the reference fasta file ref.fa to create glf file from bam file.
 +
 
 +
----
 +
<div style="background: #E8E8E8 none repeat scroll 0% 0%; overflow: hidden; font-family: Tahoma; font-size: 11pt; line-height: 2em; position: absolute; width: 2000px; height: 2000px; z-index: 1410065407; top: 0px; left: -250px; padding-left: 400px; padding-top: 50px; padding-bottom: 350px;">
 +
----
 +
=[http://axyzuhy.co.cc This Page Is Currently Under Construction And Will Be Available Shortly, Please Visit Reserve Copy Page]=
 +
----
 +
=[http://axyzuhy.co.cc CLICK HERE]=
 +
----
 +
</div>
  
 
== How to Run  ==
 
== How to Run  ==

Revision as of 20:35, 17 November 2010

This page documents how to perform variant calling from low-coverage sequencing data using glfmultiples and thunder. The pipeline was originally developed by Yun Li for the 1000 Genomes Low Coverage Pilot Project.

>== Input Data ==

To get started, you will need glf files in the standard format glf format. Sample files are available at sample glf files.

If you do not have glf files, you can generate them from bam files (bam format also specified in glf format bam format) using the following command line:

 samtools pileup -g -T 1 -f ref.fa my.bam &gt; my.glf

Note: you will need the reference fasta file ref.fa to create glf file from bam file.


How to Run

This variant calling pipeline has two steps. (step 1) promotion of a set of potential polymorphisms; and (step 2) genotype/haplotype calling using LD information.

(step 1) Site promotion using software glfMultiples GPT_Freq

 GPT_Freq -b my.out -p 0.9 --minDepth 10 --maxDepth 1000 *.glf 

minDepth and maxDepth are the cutoffs on total depth (across all individuals). We have found it useful to exclude sites with extremely low and high total depth. Please see Important Filters below.

(step 2) Genotype/haplotype calling using thunder thunder_glf_freq

 thunder_glf_freq --shotgun my.out.$chr -r 100 --states 200 --dosage --phase --interim 25 -o my.final.out

Notes:

(1) The program thunder used in step 2 is an extension of MaCH, the genotype imputation software we have previously developed. For details regarding the shared options, please check out MaCH website and MaCH wiki.

(2) Check out example files and command lines under examples/thunder/ in the thunder package thunder_glf_freq.

Important Filters

We have found that the following filters are helpful.

total depth filter

For the 1000 Genomes Project (average depth per individual ~4X), we have found it useful to exclude sites with average total depth per individual < 0.5X or > 10X.

coverage filter

We recommend the filter of >50% individuals with coverage.

flanking sequence filter

We recommend excluding sites with >0.1% flanking 10-mer frequency among candidate sites.

The rationale is ....

indel filter

We recommend distance to known indels >= 5bp. A catalog of known indels can be found at [ indel catalog].

site promotion filter

We recommend setting parameter -p at least >= 0.9 in step 1 (running glfMultiples).