Changes

From Genome Analysis Wiki
Jump to navigationJump to search
693 bytes added ,  18:01, 19 May 2015
Line 1: Line 1:  
== Introduction  ==
 
== Introduction  ==
   −
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will start from the scratch and walk through all necessary steps from raw sequence data to called genotypes. If you are new to sequence data, please be patient to go through every step. If you are experienced, you may directly jump to the section of [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller].  
+
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes.  
 +
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section.  
   −
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).  
+
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).
   −
== '''Note:''' if you are interesting in detecting '''de novo mutations''', or are working on '''a small number of families''' with '''high coverage data''' (e.g. exome sequencing), please first try our sister program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt] . ==
+
=== Note ===
 +
 
 +
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.
    
=== Download  ===
 
=== Download  ===
   −
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] with a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure).  
+
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] (Subject: TrioCaller, with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update. 
 +
 
 +
'''A recent extension of TrioCaller: [http://genome.sph.umich.edu/wiki/FamLDCaller FamLDCaller] is coming soon with major updates (better processing function,  handling general families and reference panels). Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' 
 +
 
 +
[[Binary file:]]  [http://www.pitt.edu/~wec47/Files/FamLDCaller FamLDCaller].  [Last update: 08/15/2014]
 +
 
   −
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz].  
+
'''TrioCaller''' : the version we used in the paper.
 +
<br>  
 +
[[Binary file only:]] [http://csg.sph.umich.edu/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz].  
   −
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz].  
+
[[Binary file with example datasets&nbsp;:]] [http://csg.sph.umich.edu/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz].  
    
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive].  
 
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive].  
 +
 +
'''
 +
 +
== An example from sequence data to genotypes ==
 +
'''
    
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch.  
 
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch.  
   −
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.  
+
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.
    
=== Required Software  ===
 
=== Required Software  ===
Line 37: Line 52:     
   #Rebuild the reference
 
   #Rebuild the reference
   bwa index -a is ref/human_g1k_v37_chr20.fa
+
   bin/bwa index -a is ref/human_g1k_v37_chr20.fa
 
</source>
 
</source>
   Line 44: Line 59:  
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped.  
 
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped.  
   −
   bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai
+
   bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai
    
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1.  
 
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1.  
Line 54: Line 69:  
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.
 
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.
   −
   bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \
+
   bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \
   samtools view -uhS - | samtools sort -m 2000000000 - bams/SAMPLE1
+
   bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1
    
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the  
 
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the  
Line 61: Line 76:  
of the file using the <code>samtools view</code> command, like so:
 
of the file using the <code>samtools view</code> command, like so:
   −
   samtools view bams/SAMPLE1.bam | more
+
   bin/samtools view bams/SAMPLE1.bam | more
    
The text representation of the alignment produced by <code>samtools view</code> describes
 
The text representation of the alignment produced by <code>samtools view</code> describes
Line 85: Line 100:  
genome location. We do this with the <code>samtools index</code> command, like so:
 
genome location. We do this with the <code>samtools index</code> command, like so:
   −
   samtools index bams/SAMPLE1.bam
+
   bin/samtools index bams/SAMPLE1.bam
    
=== Browsing Alignment Results ===
 
=== Browsing Alignment Results ===
Line 94: Line 109:  
starting at position 2,100,000 on chromosome 20, we could run:
 
starting at position 2,100,000 on chromosome 20, we could run:
   −
   samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa
+
   bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa
    
Then, type "g 20:2100000"
 
Then, type "g 20:2100000"
Line 108: Line 123:  
   foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)
 
   foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)
 
   echo $file
 
   echo $file
   bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai
+
   bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai
   bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \
+
   bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \
     samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file
+
     bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file
   samtools index bams/$file.bam
+
   bin/samtools index bams/$file.bam
 
   end
 
   end
   Line 118: Line 133:  
  for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;
 
  for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;
 
   do echo $file;
 
   do echo $file;
   bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;
+
   bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;
   bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \
+
   bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \
     samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;
+
     bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;
   samtools index bams/$file.bam;
+
   bin/samtools index bams/$file.bam;
 
  done
 
  done
   Line 137: Line 152:     
    
 
    
   samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bcftools view -bvcg - > result/chr20.mpileup.bcf
+
   bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf
   −
   bcftools view result/chr20.mpileup.bcf  > result/chr20.mpileup.vcf
+
   bin/bcftools view result/chr20.mpileup.bcf  > result/chr20.mpileup.vcf
      Line 215: Line 230:  
To complete our example analysis, we could run:
 
To complete our example analysis, we could run:
 
    
 
    
   TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller
+
   bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller
    
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:
 
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:
Line 225: Line 240:  
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]
 
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]
   −
==  Citation ==
+
== Citation ==
   −
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. Genome Res. Genotype calling and haplotyping in parent-offspring trios. 2012 Nov 27. [Epub ahead of print]
+
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]
[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]
 
533

edits

Navigation menu