Changes

3,432 bytes added , 10:23, 2 February 2017

→‎Getting started

Line 1: Line 1: +

'''Note:''' the latest version of this practical is available at: [[SeqShop: Estimates of Genetic Ancestry Practical]]

+

* The ones here is the original one from the June workshop (updated to be run from elsewhere)

+

== Introduction ==

See the [[Media:LASER-tutorial.pdf|tutorial slides]] for an introduction of the LASER analysis workflow, input/output file formats, and usage of the LASER software.

Line 4: Line 8:

The main purpose of this page is to provide step-by-step command lines for using LASER to estimate ancestry of 6 targeted sequenced samples (2 HapMap trios) in a principal component space generated using genome-wide SNP data from the Human Genome Diversity Project (HGDP). The HGDP reference panel contains genotype data across 632,958 autosomal loci for 938 individuals from 53 populations worldwide.

−

For more details about the options and usage of LASER, please read the [http://~~www~~.sph.umich.edu/~~csg~~/chaolong/LASER/LASER_Manual.pdf manual].

+

For more details about the options and usage of LASER, please read the [http://csg.sph.umich.edu//chaolong/LASER/LASER_Manual.pdf manual].

== LASER workflow ==

−

~~== Getting started ==~~

−

~~Create a working directory:~~

−

~~mkdir ancestry~~

+

== Setup in person at the SeqShop Workshop ==

−

~~cd ancestry~~

+

''This section is specifically for the SeqShop Workshop computers.''

+

+

''If you are not running during the SeqShop Workshop, please skip this section.''

+

+

−

~~Download and decompress software package:~~

+

−

~~wget http://www.sph.umich.edu/csg/chaolong/LASER/LASER-2.01.tar.gz~~

+

=== Setup your run environment===

−

~~tar xzvf LASER-2.01.tar~~.gz

+

This is the same setup you did for the previous tutorial, but you need to redo it each time you log in.

−

~~Set up~~ to ~~access data:~~

+

This will setup some environment variables to point you to

+

* Tutorial input files

source /home/chaolong/LASER-Tutorial/setup.txt

+

* You won't see any output after running <code>source</code>

+

** It silently sets up your environment

+

What is in the setup.txt file:

+

export GC=/home/mktrost/seqshop/gotcloud

+

export REF=/home/mktrost/seqshop/reference/all

+

export HGDP=/home/chaolong/LASER-Tutorial/HGDP

export BAM=/home/chaolong/LASER-Tutorial/BAM

−

export ~~REF~~=/~~home~~/~~chaolong~~/~~LASER~~-Tutorial/~~reference~~

+

</div>

−

export HGDP=/~~home~~/chaolong/LASER-~~Tutorial/HGDP~~

+

</div>

+

Point to where you want the output to go replacing the path with where you would like your output to go

+

<pre>export OUT=~/seqshop_output/</pre>

+

</div>

+

</div>

+

== Setup when running on your own outside of the SeqShop Workshop ==

+

''This section is specifically for running on your own outside of the SeqShop Workshop.''

+

+

''If you are running during the SeqShop Workshop, please skip this section.''

+

+

This tutorial uses samtools from GotCloud, as well as example data downloaded in the Sequence Mapping & Assembly tutorial, so if you have not already installed GotCloud and the tutorial data in a previous tutorial, please do so now: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014#Setup_when_running_on_your_own_outside_of_the_SeqShop_Workshop|Tutorial Setup]]

+

<ul>

+

<li> Additional variables for Ancestry:</li>

+

<ul>

+

+

<li>Using bash (replace the paths below with the appropriate paths):</li>

+

+

:<pre>export REF=$SS/ancestry/ref
export HGDP=$SS/ancestry/HGDP
export BAM=$SS/ancestry/bams</pre>

+

</div>

+

</div>

+

+

<li>Using tcsh (replace the paths below with the appropriate paths):</li>

+

+

:<pre>setenv REF $SS/ancestry/ref
setenv HGDP $SS/ancestry/HGDP
setenv BAM $SS/ancestry/bams</pre>

+

</div>

+

</div>

+

</ul>

+

</ul>

+

</div>

+

</div>

+

== Getting started ==

+

Create a working directory:

+

mkdir $OUT/ancestry

+

cd $OUT/ancestry

+

Download and decompress software package:

+

wget http://csg.sph.umich.edu//chaolong/LASER/LASER-2.01.tar.gz

+

tar xzvf LASER-2.01.tar.gz

== Preparing input files for LASER ==

−

Step 0: vcf --> geno

+

=== Step 0: vcf --> geno ===

This step prepares the reference panel by converting a VCF genotype file to a GENO file. We will skip this step and use a ready-to-use HGDP reference panel. A typical command to run the vcf2geno tool is given in the file "./LASER-2.01/vcf2geno/cmd.sh":

Line 37: Line 98: −

Step 1: bam --> pileup

+

=== Step 1: bam --> pileup ===

This step uses samtools to generate pileup files from bam files.

−

~~Please only try one sample so that we won't overload the sever with everyone running 6 jobs at the same time. Pileup files for these 6 samples have been prepared for later steps.~~

−

~~It takes about 2 mins for each pileup job.~~

−

~~$GC/bin/samtools mpileup~~ -~~q 30~~ -~~Q 20~~ -~~f $REF/hs37d5.fa.rz~~ -~~l $HGDP/HGDP_938.bed $BAM/121101035.recal.bam~~ > ~~121101035~~.~~recal~~.pileup &

+

−

~~# $GC~~/~~bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM~~/~~121101043.recal.bam~~ > ~~121101043.recal.pileup &~~

+

In person at workshop notes:

−

~~# $GC/bin/samtools mpileup -q 30 -Q 20~~ -~~f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101050.recal.bam~~ > ~~121101050.recal.pileup &~~

+

−

~~# $GC/bin/samtools mpileup~~ -~~q 30~~ -~~Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101052.recal.bam~~ > ~~121101052~~.~~recal~~.~~pileup &~~

+

*Please only try one sample so that we won't overload the sever with everyone running 6 jobs at the same time. Pileup files for these 6 samples have been prepared for later steps.

−

~~# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5~~.fa.~~rz -l $HGDP/HGDP_938.bed $BAM~~/~~121101415.recal.bam~~ > ~~121101415.recal.pileup &~~

+

*It takes about 2 mins for each pileup job.

−

~~# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF~~/~~hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101861.recal.bam~~ > ~~121101861.recal.pileup &~~

+

</div>

+

</div>

+

+

Outside of the workshop notes:

+

+

*The BAMs provided as part of the download are chr22 only BAMs. They are used to demonstrate how to run this step.

+

*Pileup files for the whole genome BAMs are provided with the download and will be used in the next step.

+

* You only need to try one of these.

+

</div>

+

</div>

−

Step 2: pileup --> seq

+

$GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101035.recal.bam > 121101035.recal.pileup

+

# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101043.recal.bam > 121101043.recal.pileup

+

# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101050.recal.bam > 121101050.recal.pileup

+

# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101052.recal.bam > 121101052.recal.pileup

+

# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101415.recal.bam > 121101415.recal.pileup

+

# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101861.recal.bam > 121101861.recal.pileup

+

We use -q 30 and -Q 20 to exclude reads that have mapping quality score lower than 30 or base quality score lower than 20.

+

=== Step 2: pileup --> seq ===

In this step, we will generate a file called "hapmap_trios.seq", containing the information of 6 samples. It takes about 30 seconds to run.

We will use the pre-generated pileup files in the $BAM folder.

+

* These pre-generated pileup files are for the whole genome of all 6 samples

python ./LASER-2.01/pileup2seq/pileup2seq.py \

Line 65: Line 144:

$BAM/121101052.recal.pileup \

$BAM/121101415.recal.pileup \

−

$BAM/121101861.recal.pileup &

+

$BAM/121101861.recal.pileup

In the above command, -b provides the targeted regions to exclude and -i specifies alternative IDs for the BAM files to be used in the .seq file (including popID and indivID).

Line 72: Line 151:

== Estimating ancestry coordinates ==

−

Step 0: Generate the reference ancestry space (~~using the~~ PCA ~~mode~~ of the ~~LASER program)~~

+

=== Step 0: Generate the reference ancestry space ===

+

LASER can perform principal components analysis (PCA) on genotype data of the reference panel to generate a reference ancestry space.

# ./LASER-2.01/laser -g $HGDP/HGDP_938.geno -pca 1 -k 30 -o HGDP_938

Line 82: Line 163:

less -S $HGDP/HGDP_938.RefPC.coord

−

Step 1: ~~Place~~ sequenced samples into the reference ancestry space:

+

=== Step 1: Estimate ancestry for sequenced samples ===

+

Submit two jobs to place sequenced samples into the reference ancestry space:

./LASER-2.01/laser -g $HGDP/HGDP_938.geno -c $HGDP/HGDP_938.RefPC.coord -s hapmap_trios.seq -K 20 -k 4 -x 1 -y 3 -o hapmap_trios.1-3 &

Line 91: Line 174:

The running time is ~10 minutes for processing 3 samples in each job.

−

Step 2: Combine results

+

=== Step 2: Combine results ===

Results from previous step will be output to two files "hapmap_trios.1-3.SeqPC.coord" and "hapmap_trios.4-6.SeqPC.coord".

Line 102: Line 185:

less -S hapmap_trios.SeqPC.coord

−

The results should look like below:

+

The results should look like below (results will vary slightly):

popID indivID L1 Ci K t PC1 PC2 PC3 PC4

Line 114: Line 197:

== Visualizing results ==

−

Example R codes are available in ./LASER-2.01/plot/. Let's ~~first~~ copy the folder to current working directory:

+

Example R codes are available in ./LASER-2.01/plot/. Let's copy the folder to current working directory:

cp -r ./LASER-2.01/plot/ ./

−

~~And then we enter~~ the ~~directory~~ and run the script to plot results:

+

Go to the plot folder and run the script to plot results:

cd plot

Rscript plotHGDP.r $HGDP/HGDP_938.RefPC.coord ../hapmap_trios.SeqPC.coord

Line 127: Line 210:

evince Results_on_HGDP.pdf &

−

We expect to see the following figure, in which 3 ~~HapMap~~ CEU samples cluster with HGDP Europeans and 3 ~~HapMap~~ YRI samples cluster with HGDP Africans:

+

We expect to see the following figure, in which 3 CEU samples cluster with HGDP Europeans and 3 YRI samples cluster with HGDP Africans:

Ppwhite

96

edits

Changes

SeqShop: Estimates of Genetic Ancestry Practical, June 2014 (view source)

Revision as of 10:23, 2 February 2017

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools