Changes

From Genome Analysis Wiki
Jump to navigationJump to search
1,642 bytes added ,  18:01, 19 May 2015
Line 1: Line 1: −
== Introduction ==
+
== Introduction ==
   −
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will start from the scratch and walk through all necessary steps  
+
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes.  
from raw sequence data to called genotypes. If you are new to sequence data, please be patient to go through every step. If you are experienced, you may directly jump to the section of
+
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section.  
[http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller].  
      
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).
 
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).
 +
 +
=== Note ===
 +
 +
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.
    
=== Download  ===
 
=== Download  ===
   −
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] with a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure).
+
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] (Subject: TrioCaller, with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update. 
 +
 
 +
'''A recent extension of TrioCaller: [http://genome.sph.umich.edu/wiki/FamLDCaller FamLDCaller] is coming soon with major updates (better processing function,  handling general families and reference panels). Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' 
 +
 
 +
[[Binary file:]]  [http://www.pitt.edu/~wec47/Files/FamLDCaller FamLDCaller].  [Last update: 08/15/2014]
 +
 
   −
+
'''TrioCaller''' : the version we used in the paper.
Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz].  
+
<br>
 +
[[Binary file only:]] [http://csg.sph.umich.edu/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz].  
   −
Binary file with example datasets : [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz].  
+
[[Binary file with example datasets&nbsp;:]] [http://csg.sph.umich.edu/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz].  
    
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive].  
 
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive].  
   −
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals.  
+
'''
The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch.  
+
 
 +
== An example from sequence data to genotypes ==
 +
'''
 +
 
 +
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch.  
    
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.
 
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.
   −
=== Required Software ===
+
=== Required Software ===
    
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.
 
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.
Line 39: Line 52:     
   #Rebuild the reference
 
   #Rebuild the reference
   bwa index -a is ref/human_g1k_v37_chr20.fa
+
   bin/bwa index -a is ref/human_g1k_v37_chr20.fa
 
</source>
 
</source>
   Line 46: Line 59:  
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped.  
 
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped.  
   −
   bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai
+
   bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai
    
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1.  
 
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1.  
Line 56: Line 69:  
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.
 
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.
   −
   bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \
+
   bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \
   samtools view -uhS - | samtools sort -m 2000000000 - bams/SAMPLE1
+
   bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1
    
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the  
 
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the  
Line 63: Line 76:  
of the file using the <code>samtools view</code> command, like so:
 
of the file using the <code>samtools view</code> command, like so:
   −
   samtools view bams/SAMPLE1.bam | more
+
   bin/samtools view bams/SAMPLE1.bam | more
    
The text representation of the alignment produced by <code>samtools view</code> describes
 
The text representation of the alignment produced by <code>samtools view</code> describes
Line 87: Line 100:  
genome location. We do this with the <code>samtools index</code> command, like so:
 
genome location. We do this with the <code>samtools index</code> command, like so:
   −
   samtools index bams/SAMPLE1.bam
+
   bin/samtools index bams/SAMPLE1.bam
    
=== Browsing Alignment Results ===
 
=== Browsing Alignment Results ===
Line 96: Line 109:  
starting at position 2,100,000 on chromosome 20, we could run:
 
starting at position 2,100,000 on chromosome 20, we could run:
   −
   samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa
+
   bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa
    
Then, type "g 20:2100000"
 
Then, type "g 20:2100000"
Line 110: Line 123:  
   foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)
 
   foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)
 
   echo $file
 
   echo $file
   bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai
+
   bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai
   bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \
+
   bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \
     samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file
+
     bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file
   samtools index bams/$file.bam
+
   bin/samtools index bams/$file.bam
 
   end
 
   end
   Line 120: Line 133:  
  for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;
 
  for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;
 
   do echo $file;
 
   do echo $file;
   bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;
+
   bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;
   bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \
+
   bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \
     samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;
+
     bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;
   samtools index bams/$file.bam;
+
   bin/samtools index bams/$file.bam;
 
  done
 
  done
   Line 139: Line 152:     
    
 
    
   samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bcftools view -bvcg - > result/chr20.mpileup.bcf
+
   bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf
   −
   bcftools view result/chr20.mpileup.bcf  > result/chr20.mpileup.vcf
+
   bin/bcftools view result/chr20.mpileup.bcf  > result/chr20.mpileup.vcf
      Line 182: Line 195:  
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===
 
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===
   −
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. The current version only supports SNP data, so please filter indels before running TrioCaller.  
+
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios.  
 +
 
 +
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)
    
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):
 
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):
Line 210: Line 225:  
</source>
 
</source>
   −
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file.  
+
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large.
 +
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50")
    
To complete our example analysis, we could run:
 
To complete our example analysis, we could run:
 
    
 
    
   TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --rounds 10 --prefix result/chr20.triocaller
+
   bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller
    
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:
 
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:
Line 223: Line 239:     
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]
 
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]
 +
 +
== Citation  ==
 +
 +
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]
533

edits

Navigation menu