From Genome Analysis Wiki
Jump to: navigation, search


This page describes the procedure for pileup-based variant calling. The software package is currently in-house at Center for Statistical Genetics (CSG) at University of Michigan, so only CSG individuals are able to perform analysis with this tool at this point. If you have questions, please contact Hyun Min Kang (hmkang@umich.edu).

Outline of the Procedure

Pileup-based variant calling toolkit consists of the following procedure

  1. Detecting variant site from aligned sequence reads
  2. Producing pileups of sequence reads at variant sites
  3. Detecting and estimating DNA contamination from the pileups
  4. Calculating contamination-aware genotype likelihoods and filtering statistics
  5. Perform SVM filtering to produce high quality variants
  6. Perform LD-aware genotyping to phase the sequence data to create sequence-based

Currently, only a small part of these procedure is described in this page. This document will be further updated later comprehensively.

Producing pileups of sequence reads at variant sites

You can use the capt-pileup pipeline to produce pileups of sequence reads at variant sites. Below are brief usage.

% /net/fantasia/home/hmkang/bin/captTest/bin/capt-pileup
   capt pileup [options]

    Required Options (Run capt-pileup -man or see wiki for more info):
      -loci STR         Input genomic position to perform pileup
      -index STR        Index file containing sample IDs and BAM file path
      -out STR          Output prefix

    Key Options (Run capt pileup -man or see wiki for more info):
      -help             Print out brief help message [OFF]
      -man              Print the full documentation in man page style [OFF]
      -ref STR          Reference FASTA file [/data/local/ref/karma.ref/human.g1k.v37.fa]
      -max-depth INT    Maximum depth to pile up
      -ignore-MQ        Do not report mapping quality
      -ignore-cycle     Do not report the read cycle
      -samtools STR     The full path of samtools binary [/usr/cluster/bin/samtools]
      -bgzip STR        The full path of bgzip binary [/usr/cluster/bin/bgzip]
      -tabix STR        The full path of bgzip binary [/usr/cluster/bin/tabix]
      -mosix-nodes STR  Comma separated list of mosix nodes
      -view-param STR   Parameter for samtools view [-q 20 -F 0x0704]
      -baq-param STR    Parameter for samsools calmd (for BAQ) [-AEbr]
      -clip-overlap STR Clip overlapping read fragments
      -run INT          Number of jobs to run in parallel []
      -restart          Restart jobs ignoring the previous outputs [OFF]

What you minimally need is the following list of input files

  • loci file : two column file containing base positions containing the genomic position - [CHROM] [POS] - to collect pileups. The positions need to be sorted in genomic coordinate.
  • index file : two column file containing [SAMPLE_ID] [FULL_PATH_OF_BAM_FILE]. The pileups will be performed across all samples.

Also you need --out [out_prefix] argument. Then [out_prefix].[sample_id].txt.gz files will be generated for pileups.

By default, the pipeline will produce [out_prefix].Makefile, and you need to run it by calling GNU make. You can use --run [INT] option to automatically start running make.

By default, the pileup procedure will do the following things

  • At each specified genomic position, it will produce pileups containing the information about (1) base reads and strands (2) base qualities (3) mapping qualities and (4) cycles. --ignore-MQ or --ignore-cycle options can change this default option.
  • QC-failed reads, reads with mapping quality less than 20 will be ignored during the pileups. --view-param will override this default.
  • Extended BAQ is applied by default, and this can be overriden by --baq-param option.
  • Overlapping pair of read fragments are not specially handled by default. **It is recommended** to explicitly turn on --clip-overlap option to clip either side of overlapping read fragment to improve the filtering performance.
  • By default, it assume that it runs in one machine. If you are running in MOSIX enable cluster, mosix-nodes [node1,node2,node3,..,noden] will allow to spread the jobs to multiple nodes in parallel

Internally, the current implementation runs samtools to collect this pileup. We have a separate software package that handles indels and SNPs together and will replace samtools soon.

Examples from the 1000 Genomes project is available at

 cat /net/1000g/hmkang/1KG/phase3/scripts/m02-create-pileups.sh

For example, you can modify from the following command

/net/fantasia/home/hmkang/bin/captTest/bin/capt-pileup --index /net/1000g/hmkang/1KG/phase3/index/20130502.gotcloud.low_coverage.2col.index \\
   --out /net/1000g/hmkang/1KG/phase3/wg.consensus/lcmpus/phase3.low_coverage.wgs \\
   --loci /net/1000g/hmkang/1KG/phase3/wg.consensus/union/union.snps.sites.loci \\
   --mosix-nodes 10,11,12,13 \\
   --ref /net/1000g/hmkang/1KG/phase3/gotcloud/gotcloud.ref/hs37d5.fa \\

Have a peek of the each input file to better understand what you actually need to prepare

$ head /net/1000g/hmkang/1KG/phase3/index/20130502.gotcloud.low_coverage.2col.index
HG00096	/net/1000g/1000g/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
HG00097	/net/1000g/1000g/data/HG00097/alignment/HG00097.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
HG00099	/net/1000g/1000g/data/HG00099/alignment/HG00099.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
HG00100	/net/1000g/1000g/data/HG00100/alignment/HG00100.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
HG00101	/net/1000g/1000g/data/HG00101/alignment/HG00101.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
HG00102	/net/1000g/1000g/data/HG00102/alignment/HG00102.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
HG00103	/net/1000g/1000g/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
HG00105	/net/1000g/1000g/data/HG00105/alignment/HG00105.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
HG00106	/net/1000g/1000g/data/HG00106/alignment/HG00106.mapped.ILLUMINA.bwa.GBR.low_coverage.20121211.bam
HG00107	/net/1000g/1000g/data/HG00107/alignment/HG00107.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
$ head /net/1000g/hmkang/1KG/phase3/wg.consensus/union/union.snps.sites.loci
1	10002
1	10004
1	10005
1	10327
1	10469
1	10470
1	10471
1	10472
1	10473
1	10478

The output file has a bit cryptic format, but it will be readable to the downstream software. See [Samtools Pileup Format] web page to understand the details of the output format.

$ zcat /net/1000g/hmkang/1KG/phase3/wg.consensus/lcmpus/phase3.low_coverage.wgs.HG00096.txt.gz | head
1	10327	T	4	,.,.	E@,H	>>>F	76,30,36,18
1	10469	C	5	.g,$,.	A?/;P	>>>FF	78,66,100,38,12
1	10470	G	4	.,,.	8?CH	>>FF	79,67,39,13
1	10471	C	4	.,,.	D=:Q	>>FF	80,68,40,14
1	10472	G	4	.,,.	<DLH	>>FF	81,69,41,15
1	10473	G	4	.,,.	C=DR	>>FF	82,70,42,16
1	10478	C	5	.,,.,	DJQR>	>>FF>	87,75,47,21,2
1	10492	C	5	,,T,,	QKAA@	>FF>>	89,61,35,16,7
1	10494	G	5	,,.,,	QQ<G>	>FF>>	91,63,37,18,9
1	10503	T	5	,$,.,,	/QDAG	>FF>>	100,72,46,27,18