http://genome.sph.umich.edu/w/api.php?action=feedcontributions&user=Weich&feedformat=atomGenome Analysis Wiki - User contributions [en]2024-03-28T18:18:08ZUser contributionsMediaWiki 1.35.9http://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=13372TrioCaller2015-05-19T22:01:47Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
'''A recent extension of TrioCaller: [http://genome.sph.umich.edu/wiki/FamLDCaller FamLDCaller] is coming soon with major updates (better processing function, handling general families and reference panels). Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' <br />
<br />
[[Binary file:]] [http://www.pitt.edu/~wec47/Files/FamLDCaller FamLDCaller]. [Last update: 08/15/2014]<br />
<br />
<br />
'''TrioCaller''' : the version we used in the paper. <br />
<br> <br />
[[Binary file only:]] [http://csg.sph.umich.edu/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
[[Binary file with example datasets&nbsp;:]] [http://csg.sph.umich.edu/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
'''<br />
<br />
== An example from sequence data to genotypes ==<br />
'''<br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=RareSimu&diff=11386RareSimu2014-08-28T03:28:50Z<p>Weich: Created page with "Genetic Model-based Simulator [GMS] is an efficient c++ program for simulating case control data sets based on genetic models. The input is a pool of haplotypes and a text fil..."</p>
<hr />
<div>Genetic Model-based Simulator [GMS] is an efficient c++ program for simulating case control data sets based on genetic models. The input is a pool of haplotypes and a text file for model specification. <br />
The output is a set of simulated datasets in the format of Merlin ped file. <br />
<br />
== Basic Usage Example ==<br />
<br />
In a typical command line, a few options need to be specified together with the input files. <br />
Here is an example of how GMS works:<br />
<br />
./GMS --hapfile test.hap --snplist test.lst --model model.heter.txt --f0 0.01 --<br />
nrep 100 --ncase 250 --nctrl 250 --causal --prefix tmp<br />
<br />
== Command Line Options ==<br />
<br />
=== Basic Output Options ===<br />
<br />
--hapfile a pool of simulated or real haplotypes, one chromosome per row<br />
--snplist snp names in the order ofhaplotypes in hapfile, one snp per row<br />
--model a model file specifying genetic models, see below for details<br />
--nrep the number of replications<br />
--seed seed for random number generator<br />
--ncase the number of cases in each replicate<br />
--nctrl the number of controls in each replicate<br />
--f0 overall baseline prevalence<br />
--prefix prefix of output files (e.g. prefix.rep1.ped, prefix.rep2.ped)<br />
--causal only generate causal SNPs in the output pedigree file<br />
<br />
<br />
=== Model File Annotation ===<br />
The model file includes one header line and multiple rows after. Each row responding to a set of <br />
SNPs with desired frequency range and relate risk (RR) or odds ratio (OR)<br />
<br />
1. Heterogeneity Model<br />
<br />
a) COUNT FREQ_MIN FREQ_MAX RR1 RR2<br />
<br />
b) FRACTION FREQ_MIN FREQ_MAX RR1 RR2<br />
<br />
2. Logistic Model<br />
<br />
a) COUNT FREQ_MIN FREQ_MAX OR1 OR2<br />
<br />
b) FRACTION FREQ_MIN FREQ_MAX OR1 OR2<br />
<br />
== How It Works ==<br />
There are two underlying models. Disease status follows a Bernoulli distribution with P <br />
<br />
1. Heterogeneity Model<br />
<math> P(D | (AA,AA,...,AA)) = f_0 </math><br />
<br />
<math> P = \sum_{i=1}^N P(D|x_i) </math><br />
<br />
<br />
2. Logistic Model<br />
<br />
<math>logit(y) = \beta_0 + \sum_{i=1}^{N}\beta_i\times x_i</math><br />
<br />
<math> P = \frac{e^{\beta_0 + \sum_{i=1}^{N}\beta_i\times x_i}}{1+e^{\beta_0 + \sum_{i=1}^{N}\beta_i\times x_i}}</math><br />
<br />
== Download ==<br />
<br />
The current version is available for download from http://www.sph.umich.edu/csg/weich/GMS.tar.gz<br />
<br />
== TODO ==<br />
1. Support Quantitative trait.<br />
<br />
2. Support family structures.<br />
<br />
3. Support more "reasonable" models.</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=11352FamLDCaller2014-08-18T14:05:55Z<p>Weich: </p>
<hr />
<div><br />
A general guideline for genotyping calling in families. <br />
<br />
Polymutt: Small to big pedigrees, modest to high depth<br />
FamLDCaller: many small pedigrees, low to modest depths.<br />
<br />
FamLDCaller is an extension of [http://genome.sph.umich.edu/wiki/TrioCaller TrioCaller] to handle nuclear and general family structure. <br />
<br />
'''Download''': <br />
[[Binary file:]] [http://www.pitt.edu/~wec47/Files/FamLDCaller FamLDCaller]. [Last update: 08/15/2014]<br />
<br />
More details will come soon. Please contact Wei Chen at weichen.mich@gmail.com for any questions. <br />
<br />
<br />
<br />
Major updates. <br />
<br />
1. Update the algorithm to allow nuclear and multi-generational pedigrees<br />
<br />
2. Add a feature to use reference panel<br />
<br />
3. More flexible loading functions for VCF files (no need to remove non-SNP variant)<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate []<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
FamLDCaller --vcf test.vcf --pedfile test.ped --states 50 --rounds 10 --prefix test.famldcaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=11351FamLDCaller2014-08-18T14:05:27Z<p>Weich: </p>
<hr />
<div><br />
A general guideline for genotyping calling in families. <br />
<br />
Polymutt: Small to big pedigrees, modest to high depth<br />
FamLDCaller: many small pedigrees, low to modest depths.<br />
<br />
FamLDCaller is an extension of [http://genome.sph.umich.edu/wiki/TrioCaller TrioCaller] to handle nuclear and general family structure. <br />
<br />
'''Download''': <br />
[[Binary file:]] [http://www.pitt.edu/~wec47/Files/FamLDCaller FamLDCaller]. [Last update: 08/15/2014]<br />
<br />
Please contact Wei Chen at weichen.mich@gmail.com for any questions. <br />
<br />
<br />
<br />
Major updates. <br />
<br />
1. Update the algorithm to allow nuclear and multi-generational pedigrees<br />
<br />
2. Add a feature to use reference panel<br />
<br />
3. More flexible loading functions for VCF files (no need to remove non-SNP variant)<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate []<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
FamLDCaller --vcf test.vcf --pedfile test.ped --states 50 --rounds 10 --prefix test.famldcaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=11350FamLDCaller2014-08-18T14:05:10Z<p>Weich: </p>
<hr />
<div><br />
A general guideline for genotyping calling in families. <br />
<br />
Polymutt: Small to big pedigrees, modest to high depth<br />
FamLDCaller: many small pedigrees, low to modest depths.<br />
<br />
FamLDCaller is an extension of [http://genome.sph.umich.edu/wiki/TrioCaller TrioCaller] to handle nuclear and general family structure. <br />
<br />
Download: <br />
[[Binary file:]] [http://www.pitt.edu/~wec47/Files/FamLDCaller FamLDCaller]. [Last update: 08/15/2014]<br />
<br />
Please contact Wei Chen at weichen.mich@gmail.com for any questions. <br />
<br />
Major updates. <br />
<br />
1. Update the algorithm to allow nuclear and multi-generational pedigrees<br />
<br />
2. Add a feature to use reference panel<br />
<br />
3. More flexible loading functions for VCF files (no need to remove non-SNP variant)<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate []<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
FamLDCaller --vcf test.vcf --pedfile test.ped --states 50 --rounds 10 --prefix test.famldcaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=11349TrioCaller2014-08-18T14:04:54Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
'''A recent extension of TrioCaller: [http://genome.sph.umich.edu/wiki/FamLDCaller FamLDCaller] is coming soon with major updates (better processing function, handling general families and reference panels). Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' <br />
<br />
[[Binary file:]] [http://www.pitt.edu/~wec47/Files/FamLDCaller FamLDCaller]. [Last update: 08/15/2014]<br />
<br />
<br />
'''TrioCaller''' : the version we used in the paper. <br />
<br> <br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
[[Binary file with example datasets&nbsp;:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
'''<br />
<br />
== An example from sequence data to genotypes ==<br />
'''<br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=11348FamLDCaller2014-08-18T14:02:57Z<p>Weich: </p>
<hr />
<div><br />
A general guideline for genotyping calling in families. <br />
<br />
Polymutt: Small to big pedigrees, modest to high depth<br />
FamLDCaller: many small pedigrees, low to modest depths.<br />
<br />
Download: <br />
[[Binary file:]] [http://www.pitt.edu/~wec47/Files/FamLDCaller FamLDCaller]. [Last update: 08/15/2014]<br />
<br />
Major updates. <br />
<br />
1. Update the algorithm to allow nuclear and multi-generational pedigrees<br />
<br />
2. Add a feature to use reference panel<br />
<br />
3. More flexible loading functions for VCF files (no need to remove non-SNP variant)<br />
<br />
<br />
<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate []<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
FamLDCaller --vcf test.vcf --pedfile test.ped --states 50 --rounds 10 --prefix test.famldcaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=11347FamLDCaller2014-08-18T14:02:27Z<p>Weich: </p>
<hr />
<div><br />
A general guideline for genotyping calling in families. <br />
<br />
Polymutt: Small to big pedigrees, modest to high depth<br />
FamLDCaller: many small pedigrees, low to modest depths.<br />
<br />
Download: <br />
[[Binary file:]] Download [www.pitt.edu/~wec47/Files/FamLDCaller FamLDCaller]. [Last update: 08/15/2014]<br />
<br />
Major updates. <br />
<br />
1. Update the algorithm to allow nuclear and multi-generational pedigrees<br />
<br />
2. Add a feature to use reference panel<br />
<br />
3. More flexible loading functions for VCF files (no need to remove non-SNP variant)<br />
<br />
<br />
<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate []<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
FamLDCaller --vcf test.vcf --pedfile test.ped --states 50 --rounds 10 --prefix test.famldcaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=11346FamLDCaller2014-08-18T14:01:19Z<p>Weich: </p>
<hr />
<div><br />
A general guideline for genotyping calling in families. <br />
<br />
Polymutt: Small to big pedigrees, modest to high depth<br />
FamLDCaller: many small pedigrees, low to modest depths.<br />
<br />
Download: <br />
[[Binary file:]] [www.pitt.edu/~wec47/Files/FamLDCaller]. [Last update: 08/15/2014]<br />
<br />
Major updates. <br />
<br />
1. Update the algorithm to allow nuclear and multi-generational pedigrees<br />
<br />
2. Add a feature to use reference panel<br />
<br />
3. More flexible loading functions for VCF files (no need to remove non-SNP variant)<br />
<br />
<br />
<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate []<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
FamLDCaller --vcf test.vcf --pedfile test.ped --states 50 --rounds 10 --prefix test.famldcaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=11345FamLDCaller2014-08-18T14:00:22Z<p>Weich: </p>
<hr />
<div><br />
A general guideline for genotyping calling in families. <br />
<br />
Polymutt: Small to big pedigrees.<br />
FamLDCaller/TrioCaller: many small pedigrees, low to modest depths.<br />
<br />
Download: <br />
[[Binary file only:]] [www.pitt.edu/~wec47/Files/FamLDCaller]. [Last update: 08/15/2014]<br />
<br />
Major updates. <br />
<br />
1. Update the algorithm to allow nuclear and multi-generational pedigrees<br />
<br />
2. Add a feature to use reference panel<br />
<br />
3. More flexible loading functions for VCF files (no need to remove non-SNP variant)<br />
<br />
<br />
<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate []<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
FamLDCaller --vcf test.vcf --pedfile test.ped --states 50 --rounds 10 --prefix test.famldcaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=11344FamLDCaller2014-08-18T13:41:02Z<p>Weich: </p>
<hr />
<div><br />
A general guideline for genotyping calling in families. <br />
<br />
Polymutt: Small to big pedigrees.<br />
FamLDCaller/TrioCaller: many small pedigrees, low to modest depths.<br />
<br />
Download: <br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/tmp/FamLDCaller FamLDCaller]. [Last update: 08/15/2014]<br />
<br />
Major updates. <br />
<br />
1. Update the algorithm to allow nuclear and multi-generational pedigrees<br />
<br />
2. Add a feature to use reference panel<br />
<br />
3. More flexible loading functions for VCF files (no need to remove non-SNP variant)<br />
<br />
<br />
<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate []<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
FamLDCaller --vcf test.vcf --pedfile test.ped --states 50 --rounds 10 --prefix test.famldcaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9891TrioCaller2014-04-03T19:04:01Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
'''A recent extension of TrioCaller: [http://genome.sph.umich.edu/wiki/FamLDCaller FamLDCaller] is coming soon with major updates (better processing function, handling general families and reference panels). Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' <br />
<br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/tmp/FamLDCaller FamLDCaller]. [Last update: 01/11/2014]<br />
<br />
<br />
'''TrioCaller''' : the version we used in the paper. <br />
<br> <br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
[[Binary file with example datasets&nbsp;:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
'''<br />
== An example from sequence data to genotypes ==<br />
'''<br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9890TrioCaller2014-04-03T19:03:36Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
'''A recent extension of TrioCaller: [http://genome.sph.umich.edu/wiki/FamLDCaller FamLDCaller] is coming soon with major updates (better processing function, handling general families and reference panels). Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' <br />
<br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/tmp/FamLDCaller FamLDCaller]. [Last update: 01/11/2014]<br />
<br />
<br />
'''TrioCaller''' : the version we used in the paper. <br />
<br> <br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
[[Binary file with example datasets&nbsp;:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
''''''An example from sequence data to genotypes.'''''' <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9889TrioCaller2014-04-03T19:03:19Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
'''A recent extension of TrioCaller: [http://genome.sph.umich.edu/wiki/FamLDCaller FamLDCaller] is coming soon with major updates (better processing function, handling general families and reference panels). Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' <br />
<br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/tmp/FamLDCaller FamLDCaller]. [Last update: 01/11/2014]<br />
<br />
<br />
'''TrioCaller''' : the version we used in the paper. <br />
<br> <br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
[[Binary file with example datasets&nbsp;:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
'''An example from sequence data to genotypes.''' <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9352TrioCaller2014-01-12T21:18:29Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
'''A recent extension of TrioCaller: [http://genome.sph.umich.edu/wiki/FamLDCaller FamLDCaller] is coming soon with major updates (better processing function, handling general families and reference panels). Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' <br />
<br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/tmp/FamLDCaller FamLDCaller]. [Last update: 01/11/2014]<br />
<br />
<br />
'''TrioCaller''' : the version we used in the paper. <br />
<br> <br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
[[Binary file with example datasets&nbsp;:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9153TrioCaller2013-12-13T22:23:04Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
'''A recent extension of TrioCaller: [http://genome.sph.umich.edu/wiki/FamLDCaller FamLDCaller] is coming soon with major updates (better processing function, handling general families and reference panels). Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' <br />
<br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/tmp/FamLDCaller FamLDCaller]. [Last update: 12/01/2013]<br />
<br />
<br />
'''TrioCaller''' : the version we used in the paper. <br />
<br> <br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
[[Binary file with example datasets&nbsp;:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=9152FamLDCaller2013-12-13T22:20:40Z<p>Weich: </p>
<hr />
<div><br />
A general guideline for genotyping calling in families. <br />
<br />
Polymutt: Small to big pedigrees.<br />
FamLDCaller/TrioCaller: many small pedigrees, low to modest depths.<br />
<br />
Download here. Version. <br />
<br />
Major updates. <br />
<br />
1. Update the algorithm to allow nuclear and multi-generational pedigrees<br />
<br />
2. Add a feature to use reference panel<br />
<br />
3. More flexible loading functions for VCF files (no need to remove non-SNP variant)<br />
<br />
<br />
<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate []<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
FamLDCaller --vcf test.vcf --pedfile test.ped --states 50 --rounds 10 --prefix test.famldcaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=9151FamLDCaller2013-12-13T22:17:48Z<p>Weich: </p>
<hr />
<div><br />
Download here. Version. <br />
<br />
Major updates. <br />
<br />
1. Update the algorithm to allow nuclear and multi-generational pedigrees<br />
<br />
2. Add a feature to use reference panel<br />
<br />
3. More flexible loading functions for VCF files (no need to remove non-SNP variant)<br />
<br />
<br />
<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate []<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
FamLDCaller --vcf test.vcf --pedfile test.ped --states 50 --rounds 10 --prefix test.famldcaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9150TrioCaller2013-12-13T19:28:22Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
'''A new version with major updates (better processing function, handling general families and reference panels) is coming soon. Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' <br />
<br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/tmp/FamLDCaller FamLDCaller]. [Last update: 12/01/2013]<br />
<br />
<br />
'''TrioCaller''' : the version we used in the paper. <br />
<br> <br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
[[Binary file with example datasets&nbsp;:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9149TrioCaller2013-12-13T19:27:31Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
'''A new version with major updates (better processing function, handling general families and reference panels) is coming soon. Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' <br />
<br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/FamLDCaller FamLDCaller]. [Last update: 12/01/2013]<br />
<br />
<br />
'''TrioCaller''' : the version we used in the paper. <br />
<br> <br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
[[Binary file with example datasets&nbsp;:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9148TrioCaller2013-12-13T19:26:33Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
'''A new version with major updates (better processing function, handling general families and reference panels) is coming soon. Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' <br />
<br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/FamLDCaller FamLDCaller]. [Last update: 12/01/2013]<br />
<br />
<br />
'''TrioCaller'''<br />
<br> <br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
[[Binary file with example datasets&nbsp;:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9147TrioCaller2013-12-13T19:24:43Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
'''A new version with major updates to handle general families and reference panels is coming soon. Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' <br />
<br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/FamLDCaller FamLDCaller]. [Last update: 12/01/2013]<br />
<br />
<br />
TrioCaller:<br />
<br> <br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
[[Binary file with example datasets&nbsp;:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9146TrioCaller2013-12-13T19:21:57Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
'''A new version with major updates to handle general families and reference panels is coming soon. Please try the beta version below. Contact weichen.mich@gmail.com for any questions.''' <br />
<br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/FamLDCaller FamLDCaller]. [Last update: 12/01/2013]<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> <br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
[[Binary file with example datasets&nbsp;:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9145TrioCaller2013-12-13T19:20:29Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
'''A new version with major updates to handle general families and reference panels is coming soon. Contact weichen.mich@gmail.com for a test version.''' <br />
<br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/FamLDCaller FamLDCaller]. [Last update: 12/01/2013]<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> <br />
[[Binary file only:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
[[Binary file with example datasets&nbsp;:]] [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9144TrioCaller2013-12-13T19:19:58Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
'''A new version with major updates to handle general families and reference panels is coming soon. Contact weichen.mich@gmail.com for a test version.''' <br />
<br />
Binary file only: [http://www.sph.umich.edu/csg/weich/FamLDCaller FamLDCaller]. [Last update: 12/01/2013]<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> <br />
Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9143TrioCaller2013-12-13T19:19:05Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
'''A new version with major updates to handle general families and reference panels is coming soon. Contact weichen.mich@gmail.com for a test version.''' <br />
<br />
Binary file only: [http://www.sph.umich.edu/csg/weich/FamLDCaller.12012013 FamLDCaller]. <br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> <br />
Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=9135FamLDCaller2013-12-10T21:34:36Z<p>Weich: </p>
<hr />
<div><br />
Download here. Version. <br />
<br />
Major updates. <br />
<br />
1. Update the algorithm to allow nuclear and multi-generational pedigrees<br />
<br />
2. Add a feature to use reference panel<br />
<br />
3. More flexible loading functions for VCF files (no need to remove non-SNP variant)<br />
<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate []<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
FamLDCaller --vcf test.vcf --pedfile test.ped --states 50 --rounds 10 --prefix test.famldcaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=9134FamLDCaller2013-12-10T21:34:27Z<p>Weich: </p>
<hr />
<div><br />
Download here. Version. <br />
<br />
Major updates. <br />
<br />
1. Update the algorithm to allow nuclear and multi-generational pedigrees<br />
<br />
2. Add a feature to use reference panel<br />
<br />
3. More flexible loading functions for VCF files (no need to remove non-SNP variant). <br />
<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate []<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
FamLDCaller --vcf test.vcf --pedfile test.ped --states 50 --rounds 10 --prefix test.famldcaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=9102FamLDCaller2013-12-04T17:58:15Z<p>Weich: </p>
<hr />
<div><br />
Download here. Version. <br />
<br />
Major updates. <br />
<br />
1. More flexible loading functions for VCF files. <br />
<br />
2. <br />
<br />
3.<br />
<br />
4.<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
FamLDCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=9101FamLDCaller2013-12-04T17:52:26Z<p>Weich: </p>
<hr />
<div><br />
Download here. Version. <br />
<br />
Major updates. <br />
<br />
1. More flexible loading functions for VCF files. <br />
<br />
2. <br />
<br />
3.<br />
<br />
4.<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Here is a summary of the FamLDCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=9100FamLDCaller2013-12-04T17:43:57Z<p>Weich: </p>
<hr />
<div><br />
Download here. Version. <br />
<br />
Major updates. <br />
<br />
1. More flexible loading functions for VCF files. <br />
<br />
2. <br />
<br />
3.<br />
<br />
4.<br />
<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=9096FamLDCaller2013-12-04T16:40:50Z<p>Weich: </p>
<hr />
<div><br />
Download here. Version. <br />
<br />
Major updates. <br />
<br />
1. More flexible loading functions for VCF files. <br />
<br />
2. <br />
<br />
3.<br />
<br />
4.</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=FamLDCaller&diff=9095FamLDCaller2013-12-04T16:40:03Z<p>Weich: Created page with " Download here. Version. Major updates. 1. More flexible loading functions for VCF files. 2. 3. 4."</p>
<hr />
<div><br />
Download here. Version. <br />
<br />
Major updates. <br />
<br />
1. More flexible loading functions for VCF files. <br />
2. <br />
3.<br />
4.</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9094TrioCaller2013-12-04T16:38:09Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
'''A new version with major updates to handle general families and reference panels is coming soon. Contact weichen.mich@gmail.com for a test version.''' <br />
<br />
<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9093TrioCaller2013-12-04T16:29:49Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
'''A new version with major updates to handle general families and reference panels is coming soon. Contact weichen.mich@gmail.com for test version.''' <br />
<br />
<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9092TrioCaller2013-12-04T16:29:39Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
'''A new version with major updates to handle general families and reference panels is coming soon. Contact weichen.mich@gmail.com for test version.''' <br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9091TrioCaller2013-12-04T16:28:27Z<p>Weich: /* Polymutt */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Note ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on one or two families with deep sequence data (>30X), you should first consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
'''A new version with major updates is coming soon. 12/01/2013''' <br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9090TrioCaller2013-12-04T16:26:56Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Polymutt ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on families with deep sequence data, you should also consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
'''A new version with major updates is coming soon. 12/01/2013''' <br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9089TrioCaller2013-12-04T16:26:37Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Polymutt ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on families with deep sequence data, you should also consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
'''A new version with major updates is coming soon.''' <br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=9088TrioCaller2013-12-04T16:26:16Z<p>Weich: /* Download */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will walk through all necessary steps to move from raw sequence data to called genotypes. <br />
If you are new to sequence data, please review every step. If you are experienced, you may directly jump to [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller] specific section. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
=== Polymutt ===<br />
<br />
If you are interested in ''de novo'' mutations or are working on families with deep sequence data, you should also consider our sister program, [http://genome.sph.umich.edu/wiki/Polymutt Polymutt], which ignores linkage disequilibrium information but can handle more complex pedigrees.<br />
<br />
=== Download ===<br />
<br />
A new version with major updates is coming soon. <br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=6985TrioCaller2013-04-05T14:40:13Z<p>Weich: /* Initial set of variant calls */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will start from the scratch and walk through all necessary steps from raw sequence data to called genotypes. If you are new to sequence data, please be patient to go through every step. If you are experienced, you may directly jump to the section of [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller]. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file). <br />
<br />
== '''Note:''' if you are interesting in detecting '''de novo mutations''', or are working on '''a small number of families''' with '''high coverage data''' (e.g. exome sequencing), please first try our sister program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]&nbsp;. ==<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bin/bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=6984TrioCaller2013-04-05T14:39:34Z<p>Weich: /* Converting Alignments to BAM format */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will start from the scratch and walk through all necessary steps from raw sequence data to called genotypes. If you are new to sequence data, please be patient to go through every step. If you are experienced, you may directly jump to the section of [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller]. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file). <br />
<br />
== '''Note:''' if you are interesting in detecting '''de novo mutations''', or are working on '''a small number of families''' with '''high coverage data''' (e.g. exome sequencing), please first try our sister program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]&nbsp;. ==<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=6983TrioCaller2013-04-05T14:24:31Z<p>Weich: </p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will start from the scratch and walk through all necessary steps from raw sequence data to called genotypes. If you are new to sequence data, please be patient to go through every step. If you are experienced, you may directly jump to the section of [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller]. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file). <br />
<br />
== '''Note:''' if you are interesting in detecting '''de novo mutations''', or are working on '''a small number of families''' with '''high coverage data''' (e.g. exome sequencing), please first try our sister program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]&nbsp;. ==<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bin/bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
bin/samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
bin/samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
bin/samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
bin/samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bin/bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bin/bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
bin/samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
bin/samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
bin/samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bin/bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
bin/TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=6453TrioCaller2013-02-17T22:55:57Z<p>Weich: </p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will start from the scratch and walk through all necessary steps from raw sequence data to called genotypes. If you are new to sequence data, please be patient to go through every step. If you are experienced, you may directly jump to the section of [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller]. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file). <br />
<br />
== '''Note:''' if you are interesting in detecting '''de novo mutations''', or are working on '''a small number of families''' with '''high coverage data''' (e.g. exome sequencing), please first try our sister program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]&nbsp;. ==<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com]&nbsp;(Subject: TrioCaller,&nbsp;with/without a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). We will notify you if there is any update.&nbsp;<br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=6452TrioCaller2013-02-17T22:51:18Z<p>Weich: </p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will start from the scratch and walk through all necessary steps from raw sequence data to called genotypes. If you are new to sequence data, please be patient to go through every step. If you are experienced, you may directly jump to the section of [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller]. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file). <br />
<br />
== '''Note:''' if you are interesting in detecting '''de novo mutations''', or are working on '''a small number of families''' with '''high coverage data''' (e.g. exome sequencing), please first try our sister program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]&nbsp;. ==<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] with a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). <br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites. <br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. &nbsp;Genotype calling and haplotyping in parent-offspring trios. Genome Res.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=6451TrioCaller2013-02-17T22:50:57Z<p>Weich: </p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will start from the scratch and walk through all necessary steps from raw sequence data to called genotypes. If you are new to sequence data, please be patient to go through every step. If you are experienced, you may directly jump to the section of [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller]. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file). <br />
<br />
== '''Note:''' if you are interesting in detecting '''de novo mutations''', or are working on '''a small number of families''' with '''high coverage data''' (e.g. exome sequencing), please first try our sister program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]&nbsp;. ==<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] with a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). <br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites. <br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. Genome Res. Genotype calling and haplotyping in parent-offspring trios. Genome Research.&nbsp;2013 Jan;23(1):142-51&nbsp;[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=6450TrioCaller2013-02-17T22:44:43Z<p>Weich: </p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will start from the scratch and walk through all necessary steps from raw sequence data to called genotypes. If you are new to sequence data, please be patient to go through every step. If you are experienced, you may directly jump to the section of [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller]. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file). <br />
<br />
== '''Note:''' if you are interesting in detecting '''de novo mutations''', or are working on '''a small number of families''' with '''high coverage data''' (e.g. exome sequencing), please first try our sister program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]&nbsp;. ==<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] with a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). <br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites. <br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. Genome Res. Genotype calling and haplotyping in parent-offspring trios. 2012 Nov 27. [Epub ahead of print] <br />
[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=6449TrioCaller2013-02-17T22:43:53Z<p>Weich: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will start from the scratch and walk through all necessary steps from raw sequence data to called genotypes. If you are new to sequence data, please be patient to go through every step. If you are experienced, you may directly jump to the section of [http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller]. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file). <br />
<br />
== '''Note:''' if you are interesting in detecting '''de novo mutations''', or are working on '''a small number of families''' with '''high coverage data''' (e.g. exome sequencing), please first try the other program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt] we developed. ==<br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] with a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure). <br />
<br />
<br> Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets&nbsp;: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites. <br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. Genome Res. Genotype calling and haplotyping in parent-offspring trios. 2012 Nov 27. [Epub ahead of print] <br />
[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=6448TrioCaller2013-02-17T22:43:08Z<p>Weich: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will start from the scratch and walk through all necessary steps <br />
from raw sequence data to called genotypes. If you are new to sequence data, please be patient to go through every step. If you are experienced, you may directly jump to the section of<br />
[http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller]. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
'''Note:''' if you are interesting in detecting '''de novo mutations''', or are working on '''a small number of families''' with '''high coverage data''' (e.g. exome sequencing),<br />
please first try the other program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt] we developed. <br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] with a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure).<br />
<br />
<br />
Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets : [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. <br />
The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. Genome Res. Genotype calling and haplotyping in parent-offspring trios. 2012 Nov 27. [Epub ahead of print] <br />
[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weichhttp://genome.sph.umich.edu/w/index.php?title=TrioCaller&diff=6447TrioCaller2013-02-17T22:42:24Z<p>Weich: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will start from the scratch and walk through all necessary steps <br />
from raw sequence data to called genotypes. If you are new to sequence data, please be patient to go through every step. If you are experienced, you may directly jump to the section of<br />
[http://genome.sph.umich.edu/wiki/TrioCaller#Genotype_Refinement_Using_Linkage_Disequilibrium_Information_.28TrioCaller.29 TrioCaller]. <br />
<br />
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).<br />
<br />
'''Note:''' if you are interesting in detecting '''de novo mutations''', or are working on a small number of families with high coverage data (e.g. exome sequencing),<br />
please try the other program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt] we developed. <br />
<br />
=== Download ===<br />
<br />
Before downloading the program, we appreciate if you could email [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] with a little descriptive information (e.g. Affiliation, depth, the number of samples and family structure).<br />
<br />
<br />
Binary file only: [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.binary.tgz TrioCaller.06262012.binary.tgz]. <br />
<br />
Binary file with example datasets : [http://www.sph.umich.edu/csg/weich/TrioCaller.06262012.tgz TrioCaller.06262012.tgz]. <br />
<br />
[http://genome.sph.umich.edu/wiki/TrioCaller:Archive Archive]. <br />
<br />
The example dataset demonstrated here is also included. Our dataset consists of 40 individuals, including 10 parent-offspring trios and 10 unrelated individuals. <br />
The average sequence depth is ~3x. README.txt describes the structure of the package. Pipeline.csh (C shell) and pipeline.bash (bash shell) are two scripts for you to run all commands listed here in batch. <br />
<br />
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20 around position 2,000,000. We will first map reads for a single individual (labeled SAMPLE1). Then we combine the results with mapped reads from all individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.<br />
<br />
=== Required Software ===<br />
<br />
In addition to TrioCaller, you will need BWA ([http://bio-bwa.sourceforge.net available from Sourceforge]) and samtools ([http://samtools.sourceforge.net also from Sourceforge]) installed to run this exercise. The examples are tested in in bwa 0.6.1, samtools 0.1.18, TrioCaller 0.1.1; we expect newer versions should also work. We assume all executables are in your path.<br />
<br />
== Building an Index for Short Read Alignment ==<br />
<br />
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and<br />
can be used to seed and then extend particular matches.<br />
<br />
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:<br />
<br />
<source lang="bash"><br />
#Remove any earlier reference files<br />
rm ref/human_g1k_v37_chr20.fa.*<br />
<br />
#Rebuild the reference<br />
bwa index -a is ref/human_g1k_v37_chr20.fa<br />
</source><br />
<br />
== Mapping Reads to The Genome ==<br />
<br />
Next, we will use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. <br />
<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/SAMPLE1.fastq > bwa.sai/SAMPLE1.sai<br />
<br />
The file SAMPLE1.fastq contains DNA sequence reads for sample SAMPLE1. <br />
<br />
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character for compactness and can be decoded using an [http://www.google.com/search?q=ascii+table ASCII table] (simply look up the ascii code for each base and subtract 33 to get base quality). By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities. For example, try to figure out if base quality is typically higher at the start or end of each read...<br />
<br />
=== Converting Alignments to BAM format ===<br />
<br />
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.<br />
<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:SAMPLE1" ref/human_g1k_v37_chr20.fa bwa.sai/SAMPLE1.sai fastq/SAMPLE1.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/SAMPLE1<br />
<br />
You can check the use of parameters in the bwa manual. The result BAM file uses a compact binary format to represent the <br />
alignment of each short read to the genome. You can view the contents<br />
of the file using the <code>samtools view</code> command, like so:<br />
<br />
samtools view bams/SAMPLE1.bam | more<br />
<br />
The text representation of the alignment produced by <code>samtools view</code> describes<br />
the alignment of one read per line. The most interesting fields are column 1 (the read <br />
name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing <br />
any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.<br />
<br />
=== Indexing the BAM file ===<br />
<br />
<!--<br />
Although the current file contains all necessary information about reads and their<br />
genomic locations, it is missing some auxiliary information that BAM files typically<br />
contain to help describe their contents (for example, to specify that this file contains<br />
DNA sequence reads for sample NA20589). So, the very next step is to add this information<br />
to the file:<br />
<br />
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam<br />
<br />
!--><br />
<br />
If you reached this far, rejoice! The mapping process is almost done. We will now create <br />
an index for the file, which makes it convenient to quickly extract reads from any <br />
genome location. We do this with the <code>samtools index</code> command, like so:<br />
<br />
samtools index bams/SAMPLE1.bam<br />
<br />
=== Browsing Alignment Results ===<br />
<br />
You can now view the contents of the alignment at any location using the <code>samtools view</code><br />
and <code>samtools tview</code> commands. While the <code>tview</code> generates prettier output,<br />
it is not compatible with all screens. For example, to view reads overlapping <br />
starting at position 2,100,000 on chromosome 20, we could run:<br />
<br />
samtools tview bams/SAMPLE1.bam ref/human_g1k_v37_chr20.fa<br />
<br />
Then, type "g 20:2100000"<br />
<br />
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more <br />
commonly used format used by many different programs, sorted and indexed the results.<br />
In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.<br />
<br />
Till now, we only finished read mapping for one sample SAMPLE1. We need to repeat this step for other samples (SAMPL2 - SAMPLE40). You can try something like:<br />
<br />
For c shell<br />
<br />
foreach file (`ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`)<br />
echo $file<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | bin/samtools sort -m 2000000000 - bams/$file<br />
samtools index bams/$file.bam<br />
end<br />
<br />
For bash shell<br />
<br />
for file in `ls fastq/SAMPLE*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.'`;<br />
do echo $file;<br />
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/$file.fastq > bwa.sai/$file.sai;<br />
bwa samse -r "@RG\tID:ILLUMINA\tSM:$file" ref/human_g1k_v37_chr20.fa bwa.sai/$file.sai fastq/$file.fastq | \<br />
samtools view -uhS - | samtools sort -m 2000000000 - bams/$file;<br />
samtools index bams/$file.bam;<br />
done<br />
<br />
<br />
Once we finish the read mapping step and generate bam files for all samples, we can step to variant calling and genotype inference.<br />
<br />
== Calling variants and Inferring genotypes ==<br />
<br />
=== Initial set of variant calls ===<br />
<br />
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that<br />
the next few steps are much simpler.<br />
<br />
The first thing we'll do is use samtools to generate an initial list of variant sites, using the <code>mpileup</code> command. This command looks at the bases aligned to each location and flags locations that are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).<br />
<br />
<br />
samtools mpileup -Iuf ref/human_g1k_v37_chr20.fa bams/SAMPLE*bam | bcftools view -bvcg - > result/chr20.mpileup.bcf<br />
<br />
bcftools view result/chr20.mpileup.bcf > result/chr20.mpileup.vcf<br />
<br />
<br />
The [http://www.1000genomes.org/node/101 VCF format] is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:<br />
<br />
more result/chr20.mpileup.vcf<br />
<br />
Here are some questions for you to investigate:<br />
<br />
* How many variant sites were detected in this dataset? Try a command like this one:<br />
<br />
grep -vE ^# result/chr20.mpileup.vcf | wc -l<br />
<br />
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).<br />
<br />
<!-- <br />
* How many variant sites are estimated to be singletons?<br />
!--><br />
<br />
<!--<br />
<br />
Wei: I commented this out because generating GLF requires an old version of samtools that most people don't have. Goncalo<br />
<br />
=== Alternative way to call variants ===<br />
Bcftools assumes all samples are unrelated. The better way to call variants is to consider family constraints imposed by parent-offspring trio. An alternative approach for current step is to use<br />
another program [http://genome.sph.umich.edu/wiki/Polymutt Polymutt]. It consists of two steps, but the input [bam file] and output files [vcf file] are same as above. For more information, <br />
click the link. <br />
<br />
Convert bam files to glf files (It is a single line of command)<br />
<br />
bin/samtools view -bh bams/SAMPLE1.bam | bin/samtools calmd -Abr - ref/human_g1k_v37_chr20.fa 2 - | \<br />
bin/samtools mpileup -g -f ref/human_g1k_v37_chr20.fa - > bams/SAMPLE1.glf<br />
<br />
Take glf files and output vcf files.<br />
<br />
bin/polymutt -p polymutt.ped -d polymutt.dat -g polymutt.glfindex --vcf result/chr20.polymutt.vcf<br />
<br />
!--><br />
<br />
=== Genotype Refinement Using Linkage Disequilibrium Information (TrioCaller) ===<br />
<br />
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by parent-offspring trios. <br />
<br />
Note: The current version only supports SNP data, so please '''filter indels''' before running TrioCaller. It supports VCF 4.0 and 4.1 formats with the '''exception of dropped missing trailing fields''' (e.g. use complete missing notation ./.:.:.:.,.,. rather than ./. for the genotype field)<br />
<br />
Here is a summary of the TrioCaller command line options (these are also listed whenever you run the program with no parameters):<br />
<br />
<source lang="text"><br />
Available Options<br />
Shotgun Sequences: --vcf [], --pedfile [] <br />
Markov Sampler: --seed [], --burnin [], --rounds [] <br />
Haplotyper: --states [], --errorRate [], --compact<br />
Phasing: --randomPhase , --inputPhased, --refPhased<br />
Output Files: --prefix [], --phase, --interimInterval []<br />
<br />
<br />
Explanation of Options<br />
--vcf: Standard VCF file (4.0 and above). <br />
--pedfile: Pedigree file in MERLIN format.<br />
--seed: Seed for sampling, default 123456.<br />
--burnin: The number of rounds ignored at the beginning of sampling.<br />
--rounds: The total number of iterations.<br />
--states: The number of haplotyes used in the state space. The default is the maximum number: 2*(number of founders -1).<br />
--errorRate: The pre-defined base error rate. Default 0.01.<br />
--randomPhase: The initial haplotypes are inferred from the single marker. Default option.<br />
--inputPhased: The initial haplotypes are directly from input VCF file (with "|" as separator in the genotype column).<br />
--refPhased: The initial haplotypes are built on reference alleles from VCF file.<br />
--prefix: The prefix of output file <br />
--interimInterval: The number of rounds for interim outputs<br />
</source><br />
<br />
Note: The pedigree files require complete trio structures (all three members of the trio exist in the file). For parent-offspring pair, create a "fake" parent in the pedigree file or code them as unrelated individuals. The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. <br />
You can use option "--states" to reduce the computation cost (e.g. start with "--states 50") <br />
<br />
To complete our example analysis, we could run:<br />
<br />
TrioCaller --vcf result/chr20.mpileup.vcf --pedfile ped/triocaller.ped --states 50 --rounds 10 --prefix result/chr20.triocaller<br />
<br />
The format of output file is same as the input file. Again, you can review the contents of the updated VCF file using the more command:<br />
<br />
more result/chr20.triocaller.vcf<br />
<br />
All right. Congratulations! You have come to the end and learned basic skills for accurate genotype calling in trios.<br />
<br />
If you have any question or comment, feel free to contact Wei Chen at [mailto:weichen.mich@gmail.com weichen.mich@gmail.com] or Goncalo Abecasis at [mailto:goncalo@umich.edu goncalo@umich.edu]<br />
<br />
== Citation ==<br />
<br />
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, Kang HM, Li Y, Abecasis GR. Genome Res. Genotype calling and haplotyping in parent-offspring trios. 2012 Nov 27. [Epub ahead of print] <br />
[http://genome.cshlp.org/content/early/2012/11/26/gr.142455.112.long LINK]</div>Weich