Tutorial: Low Pass Sequence Analysis
- 1 Sequence Analysis Workshop
Sequence Analysis Workshop
In this workshop, we will illustrate some of the essential steps in the analysis of next generation sequence data. As part of the process, you will learn about many of the file formats commonly used to store next generation sequence data.
For questions or comments please contact Carlo Sidore.
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).
The dataset for the tutorial can be downloaded here
Our dataset consists of 10 individuals sequenced by the 1000 Genomes Project. As with other 1000 Genomes Project samples, these individuals have been sequenced to an average depth of about 4x.
To conserve time and disk-space, our analysis will focus on a small region of chromosome 20, from 33,500,000 to 33,600,000 bp. We will first map reads for 3 individuals. We will then perform the variant calling by combining the results with mapped reads from the other 7 individuals to generate a list of polymorphic sites and estimate genotypes at each of these sites. We will compare the results of the variant calling on the low pass dataset with results from the exome sequencing of the same individual. Finally we will use the LD refinement to increase the accuracy of our genotypes.
The example dataset will be availabe in the folder Workshops/Abecasis/Workshop_lowpass/, so let's move there
> cd D2.WS03_NGS_variant_calling-CarloSidore/Workshop_low_pass/
Building an Index for Short Read Alignment
To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and can be used to seed and then extend particular matches.
The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands (they will take approximately 2 minutes to complete):
> bwa index -a is ref/human_g1k_v37_chr20.fa > samtools faidx ref/human_g1k_v37_chr20.fa
A quick look to the fastq files
The sequencers provides unmapped reads which are stored in fastq file. For this workshop, you will find DNA sequence reads for 3 samples in fastq format. To conserve disk space, the files have been compressed with gzip but, since fastq is a simple text format, you can easily view the contents of the files using a command like:
> zcat fastq/HG00111.lowcoverage.chr20.smallregion_1.fastq.gz | less
A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores (expressed as Phred score). Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character (for compactness) and can be decoded using an ascii table - you can look up the ascii code for each base and subtract 33 to get base quality. By inspecting the fastq file you should be able to learn about the length of reads being mapped and their base qualities (is base quality typically higher at the start or end of each read). You can find more details about the fastq format here Wikipedia fastq format. For each sample you will find two fastq files, since the 1000G samples are sequenced in paired end. In paired end sequencing, each DNA fragment has been sequenced twice, once in the forward and once in the reverse direction.
- Q1: What is the base quality of the fifth nucleotide of the third read in the file HG00111.lowcoverage.chr20.smallregion_1.fastq.gz?
Mapping reads to the genome
There are many different tools for mapping DNA sequence reads. One of the most commonly used tools is BWA, developed by Heng Li and Richard Durbin at the Sanger Center. As with other read mappers, BWA first builds an index of the reference genome and then uses this index to quickly assign each sequence read to a genomic location.
To learn more about BWA, you should visit the BWA website at http://bio-bwa.sourceforge.net
Here, we will use the GotCloud
align command to run BWA to find the most likely sequence location for each read. For time reasons we will map only 3 samples, and you will find the remaining 7 samples in the folder bams/.
align command requires the configuration file, which contains the index file and the files to be used as reference.
> cat config/gotCloud.align.conf
INDEX_FILE = index/gotCloud.align.index ################### # References REF_DIR = ref AS = NCBI37 REF = $(REF_DIR)/human_g1k_v37_chr20.fa DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.chr20.smallregion.vcf.gz HM3_VCF = $(REF_DIR)/hapmap_3.3.b37.chr20.smallregion.sites.vcf.gz
You can find the index file containing the samples to be used in the index folder
> cat index/gotCloud.align.index
MERGE_NAME FASTQ1 FASTQ2 RGID SAMPLE LIBRARY CENTER PLATFORM HG00108 fastq/HG00108.lowcoverage.chr20.smallregion_1.fastq.gz fastq/HG00108.lowcoverage.chr20.smallregion_2.fastq.gz HG00108 HG00108 HG00108 1000G ILLUMINA HG00111 fastq/HG00111.lowcoverage.chr20.smallregion_1.fastq.gz fastq/HG00111.lowcoverage.chr20.smallregion_2.fastq.gz HG00111 HG00111 HG00111 1000G ILLUMINA HG00120 fastq/HG00120.lowcoverage.chr20.smallregion_1.fastq.gz fastq/HG00120.lowcoverage.chr20.smallregion_2.fastq.gz HG00120 HG00120 HG00120 1000G ILLUMINA
We are now ready to align our fastq files. Since we are aligning only 100kb in 3 samples, this step will require more or less 2 minutes.
> gotcloud align --conf config/gotCloud.align.conf --outDir align --basePrefix .
File sizes of 6 FASTQ input files referenced in '/net/sardinia/progenia/csidore/Bertinoro/testdir/index /gotCloud.align.index' = 0.01 GB Total temp space will be about 0.05 GB Be sure you have enough space to hold all this data Created /net/sardinia/progenia/csidore/Bertinoro/testdir/align/Makefiles/align_HG00111.Makefile Created /net/sardinia/progenia/csidore/Bertinoro/testdir/align/Makefiles/align_HG00108.Makefile Created /net/sardinia/progenia/csidore/Bertinoro/testdir/align/Makefiles/align_HG00120.Makefile --------------------------------------------------------------------- Waiting while samples are processed... Processing finished in 77 secs with no errors reported
You can now see the bam files (HG00XXX.recal.bam) you just created in :
> ls align/bams/
In the same folder you can also see the .bai files (the index files used to quickly access every region of the genome) and some other files specific to the gotCloud pipeline.
align command maps the reads to the genome, marks duplicate reads, and recalibrates quality scores to allow better error estimation in genotype evaluation.
GotCloud also provides some statistics on the identity verification and contamination evaluation by using verifyBamID and some useful quality statistics by using QPLOT. Let's take a look at some quality statistics for the sample HG00108
> cat align/QCFiles/HG00108.qplot.stats
- Q2. Which is the mean depth of the sample HG00108? And the mapping rate?
Browsing Alignment Results
You can view the contents of the alignment at any location using the
samtools tview commands. While
tview generates prettier output,
it is not compatible with all screens. For example, to view reads overlapping
starting at position 33,350,971 on chromosome 20, we could run:
> samtools tview align/bams/HG00111.recal.bam ref/human_g1k_v37_chr20.fa
Then, type "g 20:33350971" to move the visualization there and press "." to hide/unhide the nucleotide equal to the reference.
The first line of the view indicates the positions in the genome.
The second line is the reference genome.
The third line is the reconstruction of the sequence of the individual HG00111 using the reads contained in the bam file. Note the candidate variant at position 33350987. Since at this position there are 4 reads with C and 2 read with T (equal to the reference) the most likely genotype is C/T indicated as Y (according to IUPAC )
Note: The total count of C is 4, but 2 of them are duplicates. Similarly the number of T is 2 but one is an orphan read (underlined). Samtools tview shows them, but they will discarded from further analysis
The 4th and following lines represent the reads contained in the bam files, each group of letters is a read.
You can play with the visualization help to set different way to visualize nucleotides, base qualities, mapping qualities and so on.
Press "?" in the tview screen to show the help and the available options
Press "q" to exit
Another way to check the reads covering a position is to use
The header of the mpileup format is "CHR POS REF DEPTH BASES QUALITIES".
> samtools view -uh align/bams/HG00111.recal.bam 20:33350987| samtools mpileup - | grep 33350987
- Q3: What is the depth of position 33594959 for the sample HG00108? What would be the most likely genotype looking at the reads? (You can answer this question by using tview or mpileup.)
Initial set of variant calls
We can also use GotCloud
snpcall to identify the SNPs present in our bam files and generate a VCF file containing the variant calls.
The variant calling pipeline has multiple built-in steps to generate BAMs:
- Filter out reads with low mapping quality
- Per Base Alignment Quality Adjustment (BAQ)
- Resolve overlapping paired end reads
- Generate genotype likelihood files
- Perform variant calling
- Extract features from variant sites
- Perform variant filtering
Let's start the variant calling with:
> gotcloud snpcall --conf config/gotCloud.snpcall.conf --outDir snpcall
This step will create a Makefile containing the commands to be executed and their mutual dependencies to facilitate the command parallelization.
Now run the Makefiles as gotcloud suggests and continue with the workshop while gotCloud executes (it will take 5-10 minutes):
> make -f snpcall/umake.Makefile &> snpcall.log &
Note that, in this case we are using a single CPU to run the snp calling. If you have multiple CPUs you can run gotcloud in parallel using multiple CPUs by setting the parameter "-j".
While waiting for gotCloud to take care of all these steps, we will take a look to the configuration and index file.
> cat config/gotCloud.snpcall.conf
CHRS = 20 # you can add here more chromosomes BAM_INDEX = index/gotCloud.snpcall.index ############ # References REF_ROOT = ref # REF = $(REF_ROOT)/human_g1k_v37_chr20.fa INDEL_PREFIX = $(REF_ROOT)/1kg.pilot_release.merged.indels.sites.hg19 DBSNP_VCF = $(REF_ROOT)/dbsnp_135.b37.chr20.smallregion.vcf.gz HM3_VCF = $(REF_ROOT)/hapmap_3.3.b37.sites.chr20.smallregion.vcf.gz OMNI_VCF = $(REF_ROOT)/1000G_omni2.5.b37.sites.PASS.chr20.smallregion.vcf.gz
In this case, we want to run a single chromosome (20) and we are using a different index file to include all the 10 samples in the workshop dataset
> cat index/gotCloud.snpcall.index
HG00108 1000G snpcall_bams/HG00108.lowcoverage.chr20.smallregion.bam HG00111 1000G snpcall_bams/HG00111.lowcoverage.chr20.smallregion.bam HG00120 1000G snpcall_bams/HG00120.lowcoverage.chr20.smallregion.bam HG00096 1000G snpcall_bams/HG00096.lowcoverage.chr20.smallregion.bam HG00100 1000G snpcall_bams/HG00100.lowcoverage.chr20.smallregion.bam HG00103 1000G snpcall_bams/HG00103.lowcoverage.chr20.smallregion.bam HG00114 1000G snpcall_bams/HG00114.lowcoverage.chr20.smallregion.bam HG00116 1000G snpcall_bams/HG00116.lowcoverage.chr20.smallregion.bam HG00117 1000G snpcall_bams/HG00117.lowcoverage.chr20.smallregion.bam HG00119 1000G snpcall_bams/HG00119.lowcoverage.chr20.smallregion.bam
Results of the variant calling are stored in vcf format, for a complete description of this format, you can take a look at VCF Format Specifications
The first section of the vcf is the meta-information, every line in this section starts with "##". You can find some useful information about the data that we are going to analyse and the meaning of the fields.
After the meta-information, we can see the header line starting with "#". This line contains the column description and the identifiers of the samples included in the variant calling.
Finally, in the data section we find a line for each of the variants found. Each line has 8 fixed fields ( CHROM POS ID REF ALT QUAL FILTER INFO ) followed by a column for each individual included in the analysis.
The INFO column reports a set of features, as described in the meta-information section, and these features help in evaluating the quality and the frequency of a variant. You may also add or customize your own features and report them in the meta-information section and in this column.
The FORMAT field describes the format of each genotype in the sample genotype columns, again you can see some information about their meaning in the meta-information section.
At this point, gotcloud should have completed the snp calling and generated the file:
> ls snpcall/split/chr20/subset.OK
If not, the "ls" command will report an error, just wait a little bit more:
> ls snpcall/split/chr20/subset.OK ls: cannot access snpcall/split/chr20/subset.OK: No such file or directory
Take some time to inspect the meta-information and the header sections:
> zless -nS snpcall/vcfs/chr20/chr20.filtered.vcf.gz
Let's consider a sample genotyping at the position 33514465 (if needed, check the FORMAT fields in the meta-information section in the vcf to understand the data format)
> zgrep -E "CHROM|33514465" snpcall/vcfs/chr20/chr20.filtered.vcf.gz | cut -f 2,4,5,9,14
POS REF ALT FORMAT HG00111 33514465 T C GT:GD:GQ:PL 1/1:3:10:117,9,0
- Q4: What is the genotype at this position? (T/T,C/T or C/C?) How many reads are covering this position? Is this consistent with the result you can obtain by using tview or mpileup?
- Q5: Which is the "Total Depth at Site" for the variant at position 33500378?
- Q6: How many alternate alleles are found at position 33505937?
- Q7: Is the genotype of HG00108 at position 33594959 consistent with what you predicted in Q3? (be careful about choosing the right column with the "cut" command)
- Q8: How many variant sites were detected in this dataset? Try a command like this one:
> zgrep -vE ^# snpcall/vcfs/chr20/chr20.filtered.vcf.gz | wc -l
(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).
Genotype Refinement Using Linkage Disequilibrium Information
The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data.
For instance , let's check the genotype of HG00111 at position 33514465, extracting the information from a vcf generated with gotCloud and exome sequencing on the sample HG00111. This time rather than using "zgrep", that could be very slow for large dataset, we will use "tabix" which provides quick access to every line of the vcf.
We first need to index the .vcf file
> tabix exome/vcfs/chr20/chr20.filtered.vcf.gz
This command creates a .vcf.gz.tbi, the index file which contains position information and allows for quick access.
We can now pick the lines that we want with the notation "CHR:FIRSTPOS-LASTPOS". Since we are interested in a single position, FIRSTPOS and LASTPOS are the same
> tabix -h exome/vcfs/chr20/chr20.filtered.vcf.gz 20:33514465-33514465| tail -n2| cut -f 2,10
POS HG00111 33514465 0/1:16:85:137,0,82
The pileup of this position from the bam file reports 4T's and 12C's
- Q9: Is this genotype concordant with the one found in the low pass variant calling? Which genotype do you think is more accurate?
- Q10: What can be the reason of the genotype discordance?
Low pass sequencing data, however, can be greatly improved by models that combine information across sites and individuals.
Here is how that might work:
> gotcloud ldrefine --conf config/gotCloud.snpcall.conf --outDir snpcall --numjobs 1
Again, you can review the contents of the updated VCF file using the zless command:
> zless snpcall/thunder/chr20/1000G/thunder/chr20.filtered.PASS.beagled.1000G.thunder.vcf.gz
- Q11: Compare the genotype of the sample HG00111 at position 33514465 in the low-pass and in the LD-refined VCF. Did something change? Why?
- Q12: Check position 33523840 in the low pass VCF for sample HG00111.
- What is the genotype assigned by the variant caller?
- What is your predicted genotype according to the reads piling up at this site?
- What is the genotype in the exome VCF?
- What is the genotype after LD refinement?