Variant Call Pipeline

From Genome Analysis Wiki
Jump to navigationJump to search


Input Files

Index File

The index file contains list of BAM/GLF to be analyzed. It is a simple tab-separated file, taking inspiration from 1000G sequence index file. Mininum requirements are:

  • Header containing name of the fields
  • Column containing name of the file to be analyzed
  • Column containing SAMPLE_NAME of the sample to be analyzed ( to merge different files from different platform

The index files can contain more fields that can be used to filter/group samples, no requirement on minimum or maximum number of fieds

Configuration File

Configuration file cna be specified with -c option. If -c not specified, will be read file "seq_pipeline.conf"

Basic Configuration ( One population, one platform, no group or filter )

All the fields have their default value:

Parameter Default Description
STEPS 1,2,3,4,5,6,7,8 Describe which steps will be executed
N_CPU 1 How many parallel jobs to be used
REFERENCE_FA /data/local/ref/GATK/human_g1k_v37.fasta Which reference file use for samtools pileup
CMD_PREFIX <empty string> Command prefix , useful to run on cluster /usr/bin/mosrun -b -e -t
EXEC_PATH <current dir> Directory containing executable file
OUTPUT_PATH <current dir> Directory to place pipeline results
CHR 1-22,X,M,Y Chromosome to be analysed
INPUT GENOME_BAM Format of the file contained in the index Files
GENOTYPE_FILE <empty string> File containing genotypes to be merged with GLF variants
  • STEPS, numeric values can be replaced by a list of tags:
  1. UPDATE_GLF = '1'
  2. SPLIT_GLF = '2'
  3. CHECK_DEPTH = '3'
  4. MERGE_GLF = '4'
  5. GPT_FREQ = '5'
  6. MERGE_GENO = '6'
  7. CHR_CHUNKER = '7'
  8. RUN_THUNDER = '8'
  • INPUT can be in the following format:
    • GENOME_BAM : 1 bam file containing all chromosome
    • GENOME_GLF : 1 glf file containing all chromosome
    • CHR_BAM : multiple bam file per sample containing each 1 chromosome
    • CHR_GLF : multiple GLF file per sample containing each 1 chromosome

NOTE : if CHR_BAM or CHR_BAM is selected, index file MUST contain CHR column for each file

  • GENOTYPE_FILE can be replaced by GENOTYPE_FILE_TABLE
    • GENOTYPE_FILE_TABLE is a file containing a line for each chromosome
    • Each line must contain the chromosome number and the corresponding file

Advanced configuration

All the fields have their default value:


Parameter Default Description
SECTION_NAME <empty string> Add section name, NCBI37 is empty, HG18, HG19 use 'chr'
GROUP_BY <empty_string> Which field of the header has to be used to group input files. Grouped files will be analyzed together for depth analysis and filter. Multiples grouping fields are allowed
FILTER <empty string> Which header field will be used to filter and which are the permitted values, sintax is FILTER <header_field> value1[|value2]
MERGE_POP <empty string> Which populations have to be merged together during GPT calling, sintax is MERGE_POP <pop1>[+<pop2>][+<pop3>]...
THUNDER_CHUNK 20000 Number of snps contained in each chunk (only if CHUNKER_FILE_TABLE not defined)
THUNDER_OVERLAP 1000 Number of snps overlapping between two adjacent chunks (only if CHUNKER_FILE_TABLE not defined)
CHUNKER_FILE_TABLE <empty string> File containing the region to launch in parallel using thunder
Q 10 Quality filter used in GPT to evaluate postco parameter
INPUT GENOME_BAM Format of the file contained in the index Files
GENOTYPE_FILE <empty string> File containing genotypes to be merged with GLF variants
  • GROUP_BY if multiple fields are specified, pipeline will group according two fields

i.e

FILE  PLATFORM LABNAME
fileA ILLUMINA SPH
fileB ILLUMINA NIH
fileC SOLID    SPH
fileD SOLID    NIH
fileE SOLID    NIH

In this case will be generate 4 groups : [ILLUMINA.SPH, ILLUMINA.NIH, SOLID.SPH, SOLID.NIH]

  • FILTER only one header field can be used as filter, however multiple values are allowed and they have to be separated by |

i.e

FILTER POPULATION=TSI|CEU 

will exclude all bam files not belonging to TSI or CEU population

  • MERGE_POP multiple population can be merged together

i.e.

MERGE_POP TSI+CEU+CHB 

will group together all BAM file belonging to CEU CHB and TSI when calling GPT

  • Q only three values are currently allowed , 10 (postco 0.9), 20 (postco 0.99), 30 (postco 0.999)
  • CHUNKER_FILE_TABLE the file contains one region per line according the format:
REGION_NAME CHR START STOP

One thunder run will be generated for each regions If CHUNKER_FILE_TABLE is specified THUNDER_OVERLAP and THUNDER_CHUNK will be ignored

OUTPUT FILES

Each file produced by the pipeline is place inside the OUTPUT_PATH directory as specified in the configuration file. Each steps has one or more specific directory

  1. UPDATE_GLF : creates .md5 files in the md5/ dir and .glf files in the glf/ dir
  2. SPLIT_GLF : creates .glf files split by chromosome in the chr/ dir
  3. CHECK_DEPTH : creates total_depth.*.txt and depth_per_site.*.txt in the OUTPUT_PATH dir, filtered .glf file will be placed in the filter/ dir, separated by group if GROUP_BY is specified
  4. MERGE_GLF : creates one .glf per SAMPLE_NAME and place it in the merge/ dir
  5. GPT_FREQ : creates a .tin file for each population and uses GPT/ dir
  6. MERGE_GENO : creates a .tin file in the directory merge_geno/
  7. CHR_CHUNKER : creates a list of .tin file for each chromosome according the number of snps
  8. RUN_THUNDER : place thunder output for each chunk in the thunder/ dir