Variant Call Pipeline

From Genome Analysis Wiki
Jump to: navigation, search

Input Files

Index File

The index file contains list of BAM/GLF to be analyzed. It is a simple tab-separated file, taking inspiration from 1000G sequence index file. Mininum requirements are:

  • Header containing name of the fields
  • Column containing name of the file to be analyzed
  • Column containing SAMPLE_NAME of the sample to be analyzed ( to merge different files from different platform

The index files can contain more fields that can be used to filter/group samples, no requirement on minimum or maximum number of fieds

Configuration File

Configuration file cna be specified with -c option. If -c not specified, will be read file "seq_pipeline.conf"

Basic Configuration ( One population, one platform, no group or filter )

All the fields have their default value:

Parameter Default Description
STEPS 1,2,3,4,5,6,7,8 Describe which steps will be executed
N_CPU 1 How many parallel jobs to be used
REFERENCE_FA /data/local/ref/GATK/human_g1k_v37.fasta Which reference file use for samtools pileup
CMD_PREFIX <empty string> Command prefix , useful to run on cluster /usr/bin/mosrun -b -e -t
EXEC_PATH <current dir> Directory containing executable file
OUTPUT_PATH <current dir> Directory to place pipeline results
CHR 1-22,X,M,Y Chromosome to be analysed
INPUT GENOME_BAM Format of the file contained in the index Files
GENOTYPE_FILE <empty string> File containing genotypes to be merged with GLF variants
  • STEPS, numeric values can be replaced by a list of tags:
  1. UPDATE_GLF = '1'
  2. SPLIT_GLF = '2'
  3. CHECK_DEPTH = '3'
  4. MERGE_GLF = '4'
  5. GPT_FREQ = '5'
  6. MERGE_GENO = '6'
  7. CHR_CHUNKER = '7'
  8. RUN_THUNDER = '8'
  • INPUT can be in the following format:
    • GENOME_BAM : 1 bam file containing all chromosome
    • GENOME_GLF : 1 glf file containing all chromosome
    • CHR_BAM : multiple bam file per sample containing each 1 chromosome
    • CHR_GLF : multiple GLF file per sample containing each 1 chromosome

NOTE : if CHR_BAM or CHR_BAM is selected, index file MUST contain CHR column for each file

    • GENOTYPE_FILE_TABLE is a file containing a line for each chromosome
    • Each line must contain the chromosome number and the corresponding file

Advanced configuration

All the fields have their default value:

Parameter Default Description
SECTION_NAME <empty string> Add section name, NCBI37 is empty, HG18, HG19 use 'chr'
GROUP_BY <empty_string> Which field of the header has to be used to group input files. Grouped files will be analyzed together for depth analysis and filter. Multiples grouping fields are allowed
FILTER <empty string> Which header field will be used to filter and which are the permitted values, sintax is FILTER <header_field> value1[|value2]
MERGE_POP <empty string> Which populations have to be merged together during GPT calling, sintax is MERGE_POP <pop1>[+<pop2>][+<pop3>]...
THUNDER_CHUNK 20000 Number of snps contained in each chunk (only if CHUNKER_FILE_TABLE not defined)
THUNDER_OVERLAP 1000 Number of snps overlapping between two adjacent chunks (only if CHUNKER_FILE_TABLE not defined)
CHUNKER_FILE_TABLE <empty string> File containing the region to launch in parallel using thunder
Q 10 Quality filter used in GPT to evaluate postco parameter
INPUT GENOME_BAM Format of the file contained in the index Files
GENOTYPE_FILE <empty string> File containing genotypes to be merged with GLF variants
  • GROUP_BY if multiple fields are specified, pipeline will group according two fields


fileC SOLID    SPH
fileD SOLID    NIH
fileE SOLID    NIH

In this case will be generate 4 groups : [ILLUMINA.SPH, ILLUMINA.NIH, SOLID.SPH, SOLID.NIH]

  • FILTER only one header field can be used as filter, however multiple values are allowed and they have to be separated by |



will exclude all bam files not belonging to TSI or CEU population

  • MERGE_POP multiple population can be merged together



will group together all BAM file belonging to CEU CHB and TSI when calling GPT

  • Q only three values are currently allowed , 10 (postco 0.9), 20 (postco 0.99), 30 (postco 0.999)
  • CHUNKER_FILE_TABLE the file contains one region per line according the format:

One thunder run will be generated for each regions If CHUNKER_FILE_TABLE is specified THUNDER_OVERLAP and THUNDER_CHUNK will be ignored


Each file produced by the pipeline is place inside the OUTPUT_PATH directory as specified in the configuration file. Each steps has one or more specific directory

  1. UPDATE_GLF : creates .md5 files in the md5/ dir and .glf files in the glf/ dir
  2. SPLIT_GLF : creates .glf files split by chromosome in the chr/ dir
  3. CHECK_DEPTH : creates total_depth.*.txt and depth_per_site.*.txt in the OUTPUT_PATH dir, filtered .glf file will be placed in the filter/ dir, separated by group if GROUP_BY is specified
  4. MERGE_GLF : creates one .glf per SAMPLE_NAME and place it in the merge/ dir
  5. GPT_FREQ : creates a .tin file for each population and uses GPT/ dir
  6. MERGE_GENO : creates a .tin file in the directory merge_geno/
  7. CHR_CHUNKER : creates a list of .tin file for each chromosome according the number of snps
  8. RUN_THUNDER : place thunder output for each chunk in the thunder/ dir