Changes

693 bytes added , 18:01, 19 May 2015

Line 1: Line 1:

== Introduction ==

−

We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will ~~start from the scratch and~~ walk through all necessary steps from raw sequence data to called genotypes. If you are new to sequence data, please ~~be patient to go through~~ every step. If you are experienced, you may directly jump to ~~the section of~~ [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller].

+

We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes.

+

If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section.

−

We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).

+

We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).

−

== ~~'''~~Note~~:''' if~~ you are ~~interesting~~ in ~~detecting '~~''de novo ~~mutations~~''', or are working on ~~'''a small number of~~ families~~'''~~ with ~~'''high coverage~~ data~~'''~~ (~~e.g. exome sequencing~~), ~~please~~ first ~~try~~ our sister program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]~~ ~~. ==

+

=== Note ===

+

If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.

=== Download ===

−

Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] with a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure).

+

Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] (Subject: TrioCaller, with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update. 

+

'''A recent extension of TrioCaller: [http://genome.sph.umich.edu/wiki/FamLDCaller FamLDCaller] is coming soon with major updates (better processing function, handling general families and reference panels). Please try the beta version below. Contact weichen.mich@gmail.com for any questions.'''

+

[[Binary file:]] [http://www.pitt.edu/~wec47/Files/FamLDCaller FamLDCaller]. [Last update: 08/15/2014]

+

−

<br> Binary file only: [http://~~www~~.sph.umich.edu~~/csg~~/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz].

+

'''TrioCaller''' : the version we used in the paper.

+

<br>

+

[[Binary file only:]] [http://csg.sph.umich.edu/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz].

−

Binary file with example datasets : [http://~~www~~.sph.umich.edu~~/csg~~/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz].

+

[[Binary file with example datasets :]] [http://csg.sph.umich.edu/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz].

[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive].

+

'''

+

== An example from sequence data to genotypes ==

+

'''

The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch.

−

To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.

+

To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.

=== Required Software ===

Line 37: Line 52:

#Rebuild the reference

−

bwa index -a is ref/human_g1k_v37_chr20.fa

+

bin/bwa index -a is ref/human_g1k_v37_chr20.fa

</source>

Line 44: Line 59:

Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped.

−

bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai

+

bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai

The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1.

Line 54: Line 69:

The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.

−

bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \

+

bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \

−

samtools view -uhS - | samtools sort -m 2000000000 - bams/SAMPLE1

+

bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1

You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the

Line 61: Line 76:

of the file using the <code>samtools view</code> command, like so:

−

samtools view bams/SAMPLE1.bam | more

+

bin/samtools view bams/SAMPLE1.bam | more

The text representation of the alignment produced by <code>samtools view</code> describes

Line 85: Line 100:

genome location. We do this with the <code>samtools index</code> command, like so:

−

samtools index bams/SAMPLE1.bam

+

bin/samtools index bams/SAMPLE1.bam

=== Browsing Alignment Results ===

Line 94: Line 109:

starting at position 2,100,000 on chromosome 20, we could run:

−

samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa

+

bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa

Then, type "g 20:2100000"

Line 108: Line 123:

foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)

echo $file

−

bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai

+

bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai

−

bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \

+

bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \

−

samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file

+

bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file

−

samtools index bams/$file.bam

+

bin/samtools index bams/$file.bam

end

Line 118: Line 133:

for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;

do echo $file;

−

bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;

+

bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;

−

bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \

+

bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \

−

samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;

+

bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;

−

samtools index bams/$file.bam;

+

bin/samtools index bams/$file.bam;

done

Line 137: Line 152:

−

samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bcftools view -bvcg - > result/chr20.mpileup.bcf

+

bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf

−

bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf

+

bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf

Line 215: Line 230:

To complete our example analysis, we could run:

−

TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller

+

bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller

The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:

Line 225: Line 240:

If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]

−

== ~~Citation~~ ==

+

== Citation ==

−

Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. ~~Genome Res.~~ Genotype calling and haplotyping in parent-offspring trios. ~~2012 Nov 27~~. ~~[Epub ahead of print]~~

+

Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR.  Genotype calling and haplotyping in parent-offspring trios. Genome Res. 2013 Jan;23(1):142-51 [http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]

−

[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]

Weich

533

edits

Changes

TrioCaller (view source)

Revision as of 18:01, 19 May 2015

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools