Input Files

Index File

The index file contains list of BAM/GLF to be analyzed. It is a simple tab-separated file, taking inspiration from 1000G sequence index file. Mininum requirements are:

Header containing name of the fields
Column containing name of the file to be analyzed
Column containing SAMPLE_NAME of the sample to be analyzed ( to merge different files from different platform

The index files can contain more fields that can be used to filter/group samples, no requirement on minimum or maximum number of fieds

Configuration File

Configuration file cna be specified with -c option. If -c not specified, will be read file "seq_pipeline.conf"

Basic Configuration ( One population, one platform, no group or filter )

All the fields have their default value:

Parameter	Default	Description
STEPS	1,2,3,4,5,6,7,8	Describe which steps will be executed
N_CPU	1	How many parallel jobs to be used
REFERENCE_FA	/data/local/ref/GATK/human_g1k_v37.fasta	Which reference file use for samtools pileup
CMD_PREFIX	<empty string>	Command prefix , useful to run on cluster /usr/bin/mosrun -b -e -t
EXEC_PATH	<current dir>	Directory containing executable file
OUTPUT_PATH	<current dir>	Directory to place pipeline results
CHR	1-22,X,M,Y	Chromosome to be analysed
INPUT	GENOME_BAM	Format of the file contained in the index Files
GENOTYPE_FILE	<empty string>	File containing genotypes to be merged with GLF variants

STEPS, numeric values can be replaced by a list of tags:

UPDATE_GLF = '1'
SPLIT_GLF = '2'
CHECK_DEPTH = '3'
MERGE_GLF = '4'
GPT_FREQ = '5'
MERGE_GENO = '6'
CHR_CHUNKER = '7'
RUN_THUNDER = '8'

INPUT can be in the following format:
- GENOME_BAM : 1 bam file containing all chromosome
- GENOME_GLF : 1 glf file containing all chromosome
- CHR_BAM : multiple bam file per sample containing each 1 chromosome
- CHR_GLF : multiple GLF file per sample containing each 1 chromosome

NOTE : if CHR_BAM or CHR_BAM is selected, index file MUST contain CHR column for each file

GENOTYPE_FILE can be replaced by GENOTYPE_FILE_TABLE
- GENOTYPE_FILE_TABLE is a file containing a line for each chromosome
- Each line must contain the chromosome number and the corresponding file

Advanced configuration

All the fields have their default value:

Parameter	Default	Description
SECTION_NAME	<empty string>	Add section name, NCBI37 is empty, HG18, HG19 use 'chr'
GROUP_BY	<empty_string>	Which field of the header has to be used to group input files. Grouped files will be analyzed together for depth analysis and filter. Multiples grouping fields are allowed
FILTER	<empty string>	Which header field will be used to filter and which are the permitted values, sintax is FILTER <header_field> value1[\|value2]
MERGE_POP	<empty string>	Which populations have to be merged together during GPT calling, sintax is MERGE_POP <pop1>[+<pop2>][+<pop3>]...
THUNDER_CHUNK	20000	Number of snps contained in each chunk (only if CHUNKER_FILE_TABLE not defined)
THUNDER_OVERLAP	1000	Number of snps overlapping between two adjacent chunks (only if CHUNKER_FILE_TABLE not defined)
CHUNKER_FILE_TABLE	<empty string>	File containing the region to launch in parallel using thunder
Q	10	Quality filter used in GPT to evaluate postco parameter
INPUT	GENOME_BAM	Format of the file contained in the index Files
GENOTYPE_FILE	<empty string>	File containing genotypes to be merged with GLF variants

GROUP_BY if multiple fields are specified, pipeline will group according two fields

i.e

FILE  PLATFORM LABNAME
fileA ILLUMINA SPH
fileB ILLUMINA NIH
fileC SOLID    SPH
fileD SOLID    NIH
fileE SOLID    NIH

In this case will be generate 4 groups : [ILLUMINA.SPH, ILLUMINA.NIH, SOLID.SPH, SOLID.NIH]

FILTER only one header field can be used as filter, however multiple values are allowed and they have to be separated by |

i.e

FILTER POPULATION=TSI|CEU

will exclude all bam files not belonging to TSI or CEU population

MERGE_POP multiple population can be merged together

i.e.

MERGE_POP TSI+CEU+CHB

will group together all BAM file belonging to CEU CHB and TSI when calling GPT

Q only three values are currently allowed , 10 (postco 0.9), 20 (postco 0.99), 30 (postco 0.999)

CHUNKER_FILE_TABLE the file contains one region per line according the format:

REGION_NAME

CHR

START

STOP

One thunder run will be generated for each regions If CHUNKER_FILE_TABLE is specified THUNDER_OVERLAP and THUNDER_CHUNK will be ignored

OUTPUT FILES

Each file produced by the pipeline is place inside the OUTPUT_PATH directory as specified in the configuration file. Each steps has one or more specific directory

UPDATE_GLF : creates .md5 files in the md5/ dir and .glf files in the glf/ dir
SPLIT_GLF : creates .glf files split by chromosome in the chr/ dir
CHECK_DEPTH : creates total_depth.*.txt and depth_per_site.*.txt in the OUTPUT_PATH dir, filtered .glf file will be placed in the filter/ dir, separated by group if GROUP_BY is specified
MERGE_GLF : creates one .glf per SAMPLE_NAME and place it in the merge/ dir
GPT_FREQ : creates a .tin file for each population and uses GPT/ dir
MERGE_GENO : creates a .tin file in the directory merge_geno/
CHR_CHUNKER : creates a list of .tin file for each chromosome according the number of snps
RUN_THUNDER : place thunder output for each chunk in the thunder/ dir

Variant Call Pipeline

Contents

Input Files

Index File

Configuration File

Basic Configuration ( One population, one platform, no group or filter )

Advanced configuration

OUTPUT FILES

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools