Changes

5,491 bytes added , 10:23, 2 February 2017

→‎Getting started

Line 1: Line 1: +

'''Note:''' the latest version of this practical is available at: [[SeqShop: Estimates of Genetic Ancestry Practical]]

+

* The ones here is the original one from the June workshop (updated to be run from elsewhere)

+

== Introduction ==

See the [[Media:LASER-tutorial.pdf|tutorial slides]] for an introduction of the LASER analysis workflow, input/output file formats, and usage of the LASER software.

−

The main purpose of this page is to provide step-by-step command lines for using LASER to estimate ancestry of 6 targeted sequenced samples (2 HapMap trios) in a principal component space generated using genome-wide SNP data from the Human Genome Diversity Project (HGDP).

+

The main purpose of this page is to provide step-by-step command lines for using LASER to estimate ancestry of 6 targeted sequenced samples (2 HapMap trios) in a principal component space generated using genome-wide SNP data from the Human Genome Diversity Project (HGDP). The HGDP reference panel contains genotype data across 632,958 autosomal loci for 938 individuals from 53 populations worldwide.

−

For more details about the options and usage of LASER, please read the [http://~~www~~.sph.umich.edu/~~csg~~/chaolong/LASER/LASER_Manual.pdf manual].

+

For more details about the options and usage of LASER, please read the [http://csg.sph.umich.edu//chaolong/LASER/LASER_Manual.pdf manual].

== LASER workflow ==

−

~~== Getting started ==~~

−

~~Create a working directory:~~

−

~~mkdir ancestry~~

+

== Setup in person at the SeqShop Workshop ==

−

~~cd ancestry~~

+

''This section is specifically for the SeqShop Workshop computers.''

+

+

''If you are not running during the SeqShop Workshop, please skip this section.''

+

+

−

~~Download and decompress software package:~~

+

−

~~wget http://www.sph.umich.edu/csg/chaolong/LASER/LASER-2.01.tar.gz~~

+

=== Setup your run environment===

−

~~tar xzvf LASER-2.01.tar~~.gz

+

This is the same setup you did for the previous tutorial, but you need to redo it each time you log in.

−

~~Set up~~ to ~~access data:~~

+

This will setup some environment variables to point you to

+

* Tutorial input files

source /home/chaolong/LASER-Tutorial/setup.txt

+

* You won't see any output after running <code>source</code>

+

** It silently sets up your environment

+

What is in the setup.txt file:

+

export GC=/home/mktrost/seqshop/gotcloud

+

export REF=/home/mktrost/seqshop/reference/all

+

export HGDP=/home/chaolong/LASER-Tutorial/HGDP

export BAM=/home/chaolong/LASER-Tutorial/BAM

−

export ~~REF~~=/~~home~~/~~chaolong~~/~~LASER~~-Tutorial/~~reference~~

+

</div>

−

export HGDP=/~~home~~/chaolong/LASER-~~Tutorial/HGDP~~

+

</div>

+

Point to where you want the output to go replacing the path with where you would like your output to go

+

<pre>export OUT=~/seqshop_output/</pre>

+

</div>

+

</div>

+

== Setup when running on your own outside of the SeqShop Workshop ==

+

''This section is specifically for running on your own outside of the SeqShop Workshop.''

+

+

''If you are running during the SeqShop Workshop, please skip this section.''

+

+

This tutorial uses samtools from GotCloud, as well as example data downloaded in the Sequence Mapping & Assembly tutorial, so if you have not already installed GotCloud and the tutorial data in a previous tutorial, please do so now: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014#Setup_when_running_on_your_own_outside_of_the_SeqShop_Workshop|Tutorial Setup]]

+

<ul>

+

<li> Additional variables for Ancestry:</li>

+

<ul>

+

+

<li>Using bash (replace the paths below with the appropriate paths):</li>

+

+

:<pre>export REF=$SS/ancestry/ref
export HGDP=$SS/ancestry/HGDP
export BAM=$SS/ancestry/bams</pre>

+

</div>

+

</div>

+

+

<li>Using tcsh (replace the paths below with the appropriate paths):</li>

+

+

:<pre>setenv REF $SS/ancestry/ref
setenv HGDP $SS/ancestry/HGDP
setenv BAM $SS/ancestry/bams</pre>

+

</div>

+

</div>

+

</ul>

+

</ul>

+

</div>

+

</div>

+

== Getting started ==

+

Create a working directory:

+

mkdir $OUT/ancestry

+

cd $OUT/ancestry

+

Download and decompress software package:

+

wget http://csg.sph.umich.edu//chaolong/LASER/LASER-2.01.tar.gz

+

tar xzvf LASER-2.01.tar.gz

== Preparing input files for LASER ==

−

Step 0: vcf --> geno

+

=== Step 0: vcf --> geno ===

This step prepares the reference panel by converting a VCF genotype file to a GENO file. We will skip this step and use a ready-to-use HGDP reference panel. A typical command to run the vcf2geno tool is given in the file "./LASER-2.01/vcf2geno/cmd.sh":

+

# cd ./LASER-2.01/vcf2geno/

+

# ./vcf2geno --inVcf exampleVCF/example.vcf.gz --updateID test.updateId --out test

+

=== Step 1: bam --> pileup ===

−

.~~/vcf2geno --inVcf exampleVCF/example.vcf.gz --updateID test.updateId --out test~~

+

This step uses samtools to generate pileup files from bam files.

+

+

In person at workshop notes:

+

+

*Please only try one sample so that we won't overload the sever with everyone running 6 jobs at the same time. Pileup files for these 6 samples have been prepared for later steps.

+

*It takes about 2 mins for each pileup job.

+

</div>

+

</div>

−

~~Step 1~~: ~~bam~~ --> ~~pileup~~

+

+

Outside of the workshop notes:

+

+

*The BAMs provided as part of the download are chr22 only BAMs. They are used to demonstrate how to run this step.

+

*Pileup files for the whole genome BAMs are provided with the download and will be used in the next step.

+

* You only need to try one of these.

+

</div>

+

</div>

−

~~This step uses~~ samtools ~~to generate~~ pileup ~~files from~~ bam ~~files~~.

+

$GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101035.recal.bam > 121101035.recal.pileup

−

~~Please only try one sample so that we won't overload the sever with everyone running 6 jobs at the same time~~. ~~Pileup files for these 6 samples have been prepared for later steps~~.

+

# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101043.recal.bam > 121101043.recal.pileup

−

~~It takes about 2 mins for each~~ pileup ~~job~~.

+

# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101050.recal.bam > 121101050.recal.pileup

+

# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101052.recal.bam > 121101052.recal.pileup

+

# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101415.recal.bam > 121101415.recal.pileup

+

# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101861.recal.bam > 121101861.recal.pileup

−

~~$GC/bin/samtools mpileup~~ -q 30 -Q 20 ~~-f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101035.recal.bam > 121101035.recal.pileup &~~

+

We use -q 30 and -Q 20 to exclude reads that have mapping quality score lower than 30 or base quality score lower than 20.

−

~~# $GC/bin/samtools mpileup -q~~ 30 -Q 20 ~~-f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101043.recal.bam > 121101043.recal.pileup &~~

−

~~# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101050.recal.bam > 121101050.recal.pileup &~~

−

~~# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101052.recal.bam > 121101052.recal.pileup &~~

−

~~# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101415.recal.bam > 121101415.recal.pileup &~~

−

~~# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101861.recal.bam > 121101861.recal~~.~~pileup &~~

−

Step 2: pileup --> seq

+

=== Step 2: pileup --> seq ===

In this step, we will generate a file called "hapmap_trios.seq", containing the information of 6 samples. It takes about 30 seconds to run.

We will use the pre-generated pileup files in the $BAM folder.

+

* These pre-generated pileup files are for the whole genome of all 6 samples

python ./LASER-2.01/pileup2seq/pileup2seq.py \

Line 65: Line 144:

$BAM/121101052.recal.pileup \

$BAM/121101415.recal.pileup \

−

$BAM/121101861.recal.pileup &

+

$BAM/121101861.recal.pileup

−

In the above command, -i specifies alternative IDs for the BAM files to be used in the .seq file (including popID and indivID). -b and -i are optional.

+

In the above command, -b provides the targeted regions to exclude and -i specifies alternative IDs for the BAM files to be used in the .seq file (including popID and indivID).

+

-b and -i are optional.

+

== Estimating ancestry coordinates ==

−

== ~~Estimating~~ ancestry ~~coordinates using LASER~~ ==

+

=== Step 0: Generate the reference ancestry space ===

−

~~Step 0: Generate~~ the reference ancestry space ~~(using the PCA mode of the LASER program)~~

+

LASER can perform principal components analysis (PCA) on genotype data of the reference panel to generate a reference ancestry space.

−

# ./LASER-2.01/laser -g ./HGDP/HGDP_938.geno -pca 1 -k 30 -o HGDP_938

+

# ./LASER-2.01/laser -g $HGDP/HGDP_938.geno -pca 1 -k 30 -o HGDP_938

−

The above command takes ~~about 15 mins~~ to finish.

+

The above command takes ~20 minutes to finish.

We will skip this step, and use a set of reference ancestry coordinates that have been generated in the file $HGDP/HGDP_938.RefPC.coord.

+

View the reference coordinates:

−

Step 1: ~~Place~~ sequenced samples into the reference ancestry space:

+

less -S $HGDP/HGDP_938.RefPC.coord

+

=== Step 1: Estimate ancestry for sequenced samples ===

+

Submit two jobs to place sequenced samples into the reference ancestry space:

./LASER-2.01/laser -g $HGDP/HGDP_938.geno -c $HGDP/HGDP_938.RefPC.coord -s hapmap_trios.seq -K 20 -k 4 -x 1 -y 3 -o hapmap_trios.1-3 &

Line 86: Line 172:

The first job will process samples 1 to 3 and the second job will processed samples 4 to 6.

Each sequenced sample will be projected from a 20-dimensional PCA space onto a 4-dimensional reference ancestry space.

−

The running time is ~7 minutes for processing 3 samples in each job.

+

The running time is ~10 minutes for processing 3 samples in each job.

+

=== Step 2: Combine results ===

+

Results from previous step will be output to two files "hapmap_trios.1-3.SeqPC.coord" and "hapmap_trios.4-6.SeqPC.coord".

+

Here we simply concatenate the two files while skipping the header line of the second file.

+

cp hapmap_trios.1-3.SeqPC.coord hapmap_trios.SeqPC.coord

+

more +2 hapmap_trios.4-6.SeqPC.coord >> hapmap_trios.SeqPC.coord

+

View the results:

+

less -S hapmap_trios.SeqPC.coord

+

The results should look like below (results will vary slightly):

+

popID indivID L1 Ci K t PC1 PC2 PC3 PC4

+

YRI NA19238 78386 0.170864 20 0.999688 467.989 -210.294 -14.1729 -14.4204

+

CEU NA12892 85486 0.185973 20 0.999723 10.796 199.095 -9.90387 -21.4534

+

CEU NA12891 87588 0.190442 20 0.99973 2.04224 196.07 -19.5705 -12.8022

+

CEU NA12878 83213 0.181748 20 0.999711 4.34591 199.861 -12.4825 -22.6281

+

YRI NA19239 87564 0.193424 20 0.999734 474.464 -215.96 -9.02921 -19.7372

+

YRI NA19240 95866 0.213874 20 0.999748 469.914 -214.94 -14.9923 -13.6559

+

== Visualizing results ==

+

Example R codes are available in ./LASER-2.01/plot/. Let's copy the folder to current working directory:

+

cp -r ./LASER-2.01/plot/ ./

+

Go to the plot folder and run the script to plot results:

+

cd plot

+

Rscript plotHGDP.r $HGDP/HGDP_938.RefPC.coord ../hapmap_trios.SeqPC.coord

+

A figure named "Results_on_HGDP.pdf" will be generated.

+

Visualize the figure:

+

evince Results_on_HGDP.pdf &

+

We expect to see the following figure, in which 3 CEU samples cluster with HGDP Europeans and 3 YRI samples cluster with HGDP Africans:

+

Ppwhite

96

edits

Changes

SeqShop: Estimates of Genetic Ancestry Practical, June 2014 (view source)

Revision as of 10:23, 2 February 2017

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools