This page describes the procedure for pileup-based variant calling. The software package is currently in-house at Center for Statistical Genetics (CSG) at University of Michigan, so only CSG individuals are able to perform analysis with this tool at this point. If you have questions, please contact Hyun Min Kang (firstname.lastname@example.org).
Outline of the Procedure
Pileup-based variant calling toolkit consists of the following procedure
- Detecting variant site from aligned sequence reads
- Producing pileups of sequence reads at variant sites
- Detecting and estimating DNA contamination from the pileups
- Calculating contamination-aware genotype likelihoods and filtering statistics
- Perform SVM filtering to produce high quality variants
- Perform LD-aware genotyping to phase the sequence data to create sequence-based
Currently, only a small part of these procedure is described in this page. This document will be further updated later comprehensively.
Producing pileups of sequence reads at variant sites
You can use the capt-pileup pipeline to produce pileups of sequence reads at variant sites. Below are brief usage.
% /net/fantasia/home/hmkang/bin/captTest/bin/capt-pileup Usage: capt pileup [options] Required Options (Run epacts single -man or see wiki for more info): -loci STR Input genomic position to perform pileup -index STR Index file containing sample IDs and BAM file path -out STR Output prefix Key Options (Run capt pileup -man or see wiki for more info): -help Print out brief help message [OFF] -man Print the full documentation in man page style [OFF] -ref STR Reference FASTA file [/data/local/ref/karma.ref/human.g1k.v37.fa] -max-depth INT Maximum depth to pile up -ignore-MQ Do not report mapping quality -ignore-cycle Do not report the read cycle -samtools STR The full path of samtools binary [/usr/cluster/bin/samtools] -bgzip STR The full path of bgzip binary [/usr/cluster/bin/bgzip] -tabix STR The full path of bgzip binary [/usr/cluster/bin/tabix] -mosix-nodes STR Comma separated list of mosix nodes -view-param STR Parameter for samtools view [-q 20 -F 0x0704] -baq-param STR Parameter for samsools calmd (for BAQ) [-AEbr] -clip-overlap STR Clip overlapping read fragments -run INT Number of jobs to run in parallel  -restart Restart jobs ignoring the previous outputs [OFF]
What you minimally need is the following list of input files
- loci file : two column file containing base positions containing the genomic position - [CHROM] [POS] - to collect pileups. The positions need to be sorted in genomic coordinate.
- index file : two column file containing [SAMPLE_ID] [FULL_PATH_OF_BAM_FILE]. The pileups will be performed across all samples.
Also you need --out [out_prefix] argument. Then [out_prefix].[sample_id].txt.gz files will be generated for pileups.
By default, the pipeline will produce [out_prefix].Makefile, and you need to run it by calling GNU make. You can use --run [INT] option to automatically start running make.
By default, the pileup procedure will do the following things
- At each specified genomic position, it will produce pileups containing the information about (1) base reads and strands (2) base qualities (3) mapping qualities and (4) cycles. --ignore-MQ or --ignore-cycle options can change this default option.
- QC-failed reads, reads with mapping quality less than 20 will be ignored during the pileups. --view-param will override this default.
- Extended BAQ is applied by default, and this can be overriden by --baq-param option.
- Overlapping pair of read fragments are not specially handled by default. **It is recommended** to explicitly turn on --clip-overlap option to clip either side of overlapping read fragment to improve the filtering performance.
- By default, it assume that it runs in one machine. If you are running in MOSIX enable cluster, mosix-nodes [node1,node2,node3,..,noden] will allow to spread the jobs to multiple nodes in parallel
Internally, the current implementation runs samtools to collect this pileup. We have a separate software package that handles indels and SNPs together and will replace samtools soon.
Examples from the 1000 Genomes project is available at
For example, you can modify from the following command
/net/fantasia/home/hmkang/bin/captTest/bin/capt-pileup --index /net/1000g/hmkang/1KG/phase3/index/20130502.gotcloud.low_coverage.2col.index \\ --out /net/1000g/hmkang/1KG/phase3/wg.consensus/lcmpus/phase3.low_coverage.wgs \\ --loci /net/1000g/hmkang/1KG/phase3/wg.consensus/union/union.snps.sites.loci \\ --mosix-nodes 10,11,12,13 \\ --ref /net/1000g/hmkang/1KG/phase3/gotcloud/gotcloud.ref/hs37d5.fa \\ --clip-overlap
Have a peek of the each input file to better understand what you actually need to prepare
$ head /net/1000g/hmkang/1KG/phase3/index/20130502.gotcloud.low_coverage.2col.index HG00096 /net/1000g/1000g/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam HG00097 /net/1000g/1000g/data/HG00097/alignment/HG00097.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam HG00099 /net/1000g/1000g/data/HG00099/alignment/HG00099.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam HG00100 /net/1000g/1000g/data/HG00100/alignment/HG00100.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam HG00101 /net/1000g/1000g/data/HG00101/alignment/HG00101.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam HG00102 /net/1000g/1000g/data/HG00102/alignment/HG00102.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam HG00103 /net/1000g/1000g/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam HG00105 /net/1000g/1000g/data/HG00105/alignment/HG00105.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam HG00106 /net/1000g/1000g/data/HG00106/alignment/HG00106.mapped.ILLUMINA.bwa.GBR.low_coverage.20121211.bam HG00107 /net/1000g/1000g/data/HG00107/alignment/HG00107.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
$ head /net/1000g/hmkang/1KG/phase3/wg.consensus/union/union.snps.sites.loci 1 10002 1 10004 1 10005 1 10327 1 10469 1 10470 1 10471 1 10472 1 10473 1 10478
The output file has a bit cryptic format, but it will be readable to the downstream software. See [Samtools Pileup Format] web page to understand the details of the output format.
$ zcat /net/1000g/hmkang/1KG/phase3/wg.consensus/lcmpus/phase3.low_coverage.wgs.HG00096.txt.gz | head 1 10327 T 4 ,.,. E@,H >>>F 76,30,36,18 1 10469 C 5 .g,$,. A?/;P >>>FF 78,66,100,38,12 1 10470 G 4 .,,. 8?CH >>FF 79,67,39,13 1 10471 C 4 .,,. D=:Q >>FF 80,68,40,14 1 10472 G 4 .,,. <DLH >>FF 81,69,41,15 1 10473 G 4 .,,. C=DR >>FF 82,70,42,16 1 10478 C 5 .,,., DJQR> >>FF> 87,75,47,21,2 1 10492 C 5 ,,T,, QKAA@ >FF>> 89,61,35,16,7 1 10494 G 5 ,,.,, QQ<G> >FF>> 91,63,37,18,9 1 10503 T 5 ,$,.,, /QDAG >FF>> 100,72,46,27,18
(TO BE CONTINUED)..