Variant Call Pipeline
Input Files
Index File
The index file contains list of BAM/GLF to be analyzed. It is a simple tab-separated file, taking inspiration from 1000G sequence index file. Mininum requirements are:
- Header containing name of the fields
- Column containing name of the file to be analyzed
- Column containing SAMPLE_NAME of the sample to be analyzed ( to merge different files from different platform
The index files can contain more fields that can be used to filter/group samples, no requirement on minimum or maximum number of fieds
Configuration File
Configuration file cna be specified with -c option. If -c not specified, will be read file "seq_pipeline.conf"
Basic Configuration ( One population, one platform, no group or filter )
All the fields have their default value:
Parameter | Default | Description |
---|---|---|
STEPS | 1,2,3,4,5,6,7,8 | Describe which steps will be executed |
N_CPU | 1 | How many parallel jobs to be used |
REFERENCE_FA | /data/local/ref/GATK/human_g1k_v37.fasta | Which reference file use for samtools pileup |
CMD_PREFIX | <empty string> | Command prefix , useful to run on cluster /usr/bin/mosrun -b -e -t |
EXEC_PATH | <current dir> | Directory containing executable file |
OUTPUT_PATH | <current dir> | Directory to place pipeline results |
CHR | 1-22,X,M,Y | Chromosome to be analysed |
INPUT | GENOME_BAM | Format of the file contained in the index Files |
GENOTYPE_FILE | <empty string> | File containing genotypes to be merged with GLF variants |
- STEPS, numeric values can be replaced by a list of tags:
- UPDATE_GLF = '1'
- SPLIT_GLF = '2'
- CHECK_DEPTH = '3'
- MERGE_GLF = '4'
- GPT_FREQ = '5'
- MERGE_GENO = '6'
- CHR_CHUNKER = '7'
- RUN_THUNDER = '8'
- INPUT can be in the following format:
- GENOME_BAM : 1 bam file containing all chromosome
- GENOME_GLF : 1 glf file containing all chromosome
- CHR_BAM : multiple bam file per sample containing each 1 chromosome
- CHR_GLF : multiple GLF file per sample containing each 1 chromosome
NOTE : if CHR_BAM or CHR_BAM is selected, index file MUST contain CHR column for each file
- GENOTYPE_FILE can be replaced by GENOTYPE_FILE_TABLE
- GENOTYPE_FILE_TABLE is a file containing a line for each chromosome
- Each line must contain the chromosome number and the corresponding file
Advanced configuration
All the fields have their default value:
Parameter | Default | Description |
---|---|---|
SECTION_NAME | <empty string> | Add section name, NCBI37 is empty, HG18, HG19 use 'chr' |
GROUP_BY | <empty_string> | Which field of the header has to be used to group input files. Grouped files will be analyzed together for depth analysis and filter. Multiples grouping fields are allowed |
FILTER | <empty string> | Which header field will be used to filter and which are the permitted values, sintax is FILTER <header_field> value1[|value2] |
MERGE_POP | <empty string> | Which populations have to be merged together during GPT calling, sintax is MERGE_POP <pop1>[+<pop2>][+<pop3>]... |
THUNDER_CHUNK | 20000 | Number of snps contained in each chunk (only if CHUNKER_FILE_TABLE not defined) |
THUNDER_OVERLAP | 1000 | Number of snps overlapping between two adjacent chunks (only if CHUNKER_FILE_TABLE not defined) |
CHUNKER_FILE_TABLE | <empty string> | File containing the region to launch in parallel using thunder |
Q | 10 | Quality filter used in GPT to evaluate postco parameter |
INPUT | GENOME_BAM | Format of the file contained in the index Files |
GENOTYPE_FILE | <empty string> | File containing genotypes to be merged with GLF variants |
- GROUP_BY if multiple fields are specified, pipeline will group according two fields
i.e
FILE PLATFORM LABNAME fileA ILLUMINA SPH fileB ILLUMINA NIH fileC SOLID SPH fileD SOLID NIH fileE SOLID NIH
In this case will be generate 4 groups : [ILLUMINA.SPH, ILLUMINA.NIH, SOLID.SPH, SOLID.NIH]
- FILTER only one header field can be used as filter, however multiple values are allowed and they have to be separated by |
i.e
FILTER POPULATION=TSI|CEU
will exclude all bam files not belonging to TSI or CEU population
- MERGE_POP multiple population can be merged together
i.e.
MERGE_POP TSI+CEU+CHB
will group together all BAM file belonging to CEU CHB and TSI when calling GPT
- Q only three values are currently allowed , 10 (postco 0.9), 20 (postco 0.99), 30 (postco 0.999)
- CHUNKER_FILE_TABLE the file contains one region per line according the format:
REGION_NAME | CHR | START | STOP |
One thunder run will be generated for each regions If CHUNKER_FILE_TABLE is specified THUNDER_OVERLAP and THUNDER_CHUNK will be ignored
OUTPUT FILES
Each file produced by the pipeline is place inside the OUTPUT_PATH directory as specified in the configuration file. Each steps has one or more specific directory
- UPDATE_GLF : creates .md5 files in the md5/ dir and .glf files in the glf/ dir
- SPLIT_GLF : creates .glf files split by chromosome in the chr/ dir
- CHECK_DEPTH : creates total_depth.*.txt and depth_per_site.*.txt in the OUTPUT_PATH dir, filtered .glf file will be placed in the filter/ dir, separated by group if GROUP_BY is specified
- MERGE_GLF : creates one .glf per SAMPLE_NAME and place it in the merge/ dir
- GPT_FREQ : creates a .tin file for each population and uses GPT/ dir
- MERGE_GENO : creates a .tin file in the directory merge_geno/
- CHR_CHUNKER : creates a list of .tin file for each chromosome according the number of snps
- RUN_THUNDER : place thunder output for each chunk in the thunder/ dir