Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 1: Line 1:  +
'''Note:''' the latest version of this practical is available at: [[SeqShop: Estimates of Genetic Ancestry Practical]]
 +
* The ones here is the original one from the June workshop (updated to be run from elsewhere)
 +
 +
 
== Introduction ==
 
== Introduction ==
 
See the [[Media:LASER-tutorial.pdf|tutorial slides]] for an introduction of the LASER analysis workflow, input/output file formats, and usage of the LASER software.
 
See the [[Media:LASER-tutorial.pdf|tutorial slides]] for an introduction of the LASER analysis workflow, input/output file formats, and usage of the LASER software.
   −
The main purpose of this page is to provide step-by-step command lines for using LASER to estimate ancestry of 6 targeted sequenced samples (2 HapMap trios) in a principal component space generated using genome-wide SNP data from the Human Genome Diversity Project (HGDP).
+
The main purpose of this page is to provide step-by-step command lines for using LASER to estimate ancestry of 6 targeted sequenced samples (2 HapMap trios) in a principal component space generated using genome-wide SNP data from the Human Genome Diversity Project (HGDP). The HGDP reference panel contains genotype data across 632,958 autosomal loci for 938 individuals from 53 populations worldwide.
   −
For more details about the options and usage of LASER, please read the [http://www.sph.umich.edu/csg/chaolong/LASER/LASER_Manual.pdf manual].
+
For more details about the options and usage of LASER, please read the [http://csg.sph.umich.edu//chaolong/LASER/LASER_Manual.pdf manual].
    
== LASER workflow ==
 
== LASER workflow ==
 
[[File:LASER-workflow.png|thumb|center|alt=LASER workflow|400px|LASER workflow]]
 
[[File:LASER-workflow.png|thumb|center|alt=LASER workflow|400px|LASER workflow]]
   −
== Getting started ==
  −
Create a working directory:
     −
mkdir ancestry
+
== Setup in person at the SeqShop Workshop ==
cd ancestry
+
''This section is specifically for the SeqShop Workshop computers.''
 +
<div class="mw-collapsible mw-collapsed" style="width:600px">
 +
''If you are not running during the SeqShop Workshop, please skip this section.''
 +
<div class="mw-collapsible-content">
 +
 
   −
Download and decompress software package:
+
{{SeqShopLogin}}
   −
wget http://www.sph.umich.edu/csg/chaolong/LASER/LASER-2.01.tar.gz
+
=== Setup your run environment===
tar xzvf LASER-2.01.tar.gz
+
This is the same setup you did for the previous tutorial, but you need to redo it each time you log in.
   −
Set up to access data:
+
This will setup some environment variables to point you to
 +
* Tutorial input files
 
  source /home/chaolong/LASER-Tutorial/setup.txt
 
  source /home/chaolong/LASER-Tutorial/setup.txt
 +
* You won't see any output after running <code>source</code>
 +
** It silently sets up your environment
    +
<div class="mw-collapsible mw-collapsed" style="width:400px">
 
What is in the setup.txt file:
 
What is in the setup.txt file:
 +
<div class="mw-collapsible-content">
 
  export GC=/home/mktrost/seqshop/gotcloud
 
  export GC=/home/mktrost/seqshop/gotcloud
 +
export REF=/home/mktrost/seqshop/reference/all
 +
export HGDP=/home/chaolong/LASER-Tutorial/HGDP
 
  export BAM=/home/chaolong/LASER-Tutorial/BAM
 
  export BAM=/home/chaolong/LASER-Tutorial/BAM
export REF=/home/chaolong/LASER-Tutorial/reference
+
</div>
export HGDP=/home/chaolong/LASER-Tutorial/HGDP
+
</div>
 +
 
 +
Point to where you want the output to go replacing the path with where you would like your output to go
 +
<pre>export OUT=~/seqshop_output/</pre>
 +
</div>
 +
</div>
 +
 
 +
 
 +
== Setup when running on your own outside of the SeqShop Workshop ==
 +
''This section is specifically for running on your own outside of the SeqShop Workshop.''
 +
<div class="mw-collapsible" style="width:600px">
 +
''If you are running during the SeqShop Workshop, please skip this section.''
 +
<div class="mw-collapsible-content">
 +
 
 +
This tutorial uses samtools from GotCloud, as well as example data downloaded in the Sequence Mapping & Assembly tutorial, so if you have not already installed GotCloud and the tutorial data in a previous tutorial, please do so now: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014#Setup_when_running_on_your_own_outside_of_the_SeqShop_Workshop|Tutorial Setup]]
 +
 
 +
 
 +
{{SeqShopRemoteEnv}}
 +
 
 +
<ul>
 +
<li> Additional variables for Ancestry:</li>
 +
<ul>
 +
<div class="mw-collapsible" style="width:500px">
 +
<li>Using bash (replace the paths below with the appropriate paths):</li>
 +
<div class="mw-collapsible-content">
 +
:<pre>export REF=$SS/ancestry/ref&#10;export HGDP=$SS/ancestry/HGDP&#10;export BAM=$SS/ancestry/bams</pre>
 +
</div>
 +
</div>
 +
<div class="mw-collapsible mw-collapsed" style="width:500px">
 +
<li>Using tcsh (replace the paths below with the appropriate paths):</li>
 +
<div class="mw-collapsible-content">
 +
:<pre>setenv REF $SS/ancestry/ref&#10;setenv HGDP $SS/ancestry/HGDP&#10;setenv BAM $SS/ancestry/bams</pre>
 +
</div>
 +
</div>
 +
</ul>
 +
</ul>
 +
</div>
 +
</div>
 +
 
 +
== Getting started ==
 +
Create a working directory:
 +
 
 +
mkdir $OUT/ancestry
 +
cd $OUT/ancestry
 +
 
 +
Download and decompress software package:
 +
 
 +
wget http://csg.sph.umich.edu//chaolong/LASER/LASER-2.01.tar.gz
 +
tar xzvf LASER-2.01.tar.gz
    
== Preparing input files for LASER ==
 
== Preparing input files for LASER ==
Step 0: vcf --> geno
+
=== Step 0: vcf --> geno ===
    
This step prepares the reference panel by converting a VCF genotype file to a GENO file. We will skip this step and use a ready-to-use HGDP reference panel. A typical command to run the vcf2geno tool is given in the file "./LASER-2.01/vcf2geno/cmd.sh":
 
This step prepares the reference panel by converting a VCF genotype file to a GENO file. We will skip this step and use a ready-to-use HGDP reference panel. A typical command to run the vcf2geno tool is given in the file "./LASER-2.01/vcf2geno/cmd.sh":
 +
# cd ./LASER-2.01/vcf2geno/
 +
# ./vcf2geno --inVcf exampleVCF/example.vcf.gz --updateID test.updateId --out test
 +
 +
 +
=== Step 1: bam --> pileup ===
   −
./vcf2geno --inVcf exampleVCF/example.vcf.gz --updateID test.updateId --out test
+
This step uses samtools to generate pileup files from bam files.  
    +
<div class="mw-collapsible mw-collapsed" style="width:500px">
 +
In person at workshop notes:
 +
<div class="mw-collapsible-content">
 +
*Please only try one sample so that we won't overload the sever with everyone running 6 jobs at the same time. Pileup files for these 6 samples have been prepared for later steps.
 +
*It takes about 2 mins for each pileup job.
 +
</div>
 +
</div>
   −
Step 1: bam --> pileup
+
<div class="mw-collapsible" style="width:500px">
 +
Outside of the workshop notes:
 +
<div class="mw-collapsible-content">
 +
*The BAMs provided as part of the download are chr22 only BAMs.  They are used to demonstrate how to run this step.
 +
*Pileup files for the whole genome BAMs are provided with the download and will be used in the next step.
 +
* You only need to try one of these.
 +
</div>
 +
</div>
   −
This step uses samtools to generate pileup files from bam files.  
+
  $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101035.recal.bam > 121101035.recal.pileup
Please only try one sample so that we won't overload the sever with everyone running 6 jobs at the same time. Pileup files for these 6 samples have been prepared for later steps.
+
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101043.recal.bam > 121101043.recal.pileup
It takes about 2 mins for each pileup job.  
+
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101050.recal.bam > 121101050.recal.pileup
 +
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101052.recal.bam > 121101052.recal.pileup
 +
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101415.recal.bam > 121101415.recal.pileup
 +
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101861.recal.bam > 121101861.recal.pileup
   −
$GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101035.recal.bam > 121101035.recal.pileup &
+
We use -q 30 and -Q 20 to exclude reads that have mapping quality score lower than 30 or base quality score lower than 20.
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101043.recal.bam > 121101043.recal.pileup &
  −
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101050.recal.bam > 121101050.recal.pileup &
  −
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101052.recal.bam > 121101052.recal.pileup &
  −
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101415.recal.bam > 121101415.recal.pileup &
  −
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101861.recal.bam > 121101861.recal.pileup &
     −
Step 2: pileup --> seq
+
=== Step 2: pileup --> seq ===
    
In this step, we will generate a file called "hapmap_trios.seq", containing the information of 6 samples. It takes about 30 seconds to run.
 
In this step, we will generate a file called "hapmap_trios.seq", containing the information of 6 samples. It takes about 30 seconds to run.
 
We will use the pre-generated pileup files in the $BAM folder.
 
We will use the pre-generated pileup files in the $BAM folder.
 +
* These pre-generated pileup files are for the whole genome of all 6 samples
    
  python ./LASER-2.01/pileup2seq/pileup2seq.py \
 
  python ./LASER-2.01/pileup2seq/pileup2seq.py \
Line 65: Line 144:  
  $BAM/121101052.recal.pileup \
 
  $BAM/121101052.recal.pileup \
 
  $BAM/121101415.recal.pileup \
 
  $BAM/121101415.recal.pileup \
  $BAM/121101861.recal.pileup &
+
  $BAM/121101861.recal.pileup
   −
In the above command, -i specifies alternative IDs for the BAM files to be used in the .seq file (including popID and indivID). -b and -i are optional.
+
In the above command, -b provides the targeted regions to exclude and -i specifies alternative IDs for the BAM files to be used in the .seq file (including popID and indivID).  
 +
-b and -i are optional.
    +
== Estimating ancestry coordinates ==
   −
== Estimating ancestry coordinates using LASER ==
+
=== Step 0: Generate the reference ancestry space ===
   −
Step 0: Generate the reference ancestry space (using the PCA mode of the LASER program)
+
LASER can perform principal components analysis (PCA) on genotype data of the reference panel to generate a reference ancestry space.
   −
  # ./LASER-2.01/laser -g ./HGDP/HGDP_938.geno -pca 1 -k 30 -o HGDP_938
+
  # ./LASER-2.01/laser -g $HGDP/HGDP_938.geno -pca 1 -k 30 -o HGDP_938
   −
The above command takes about 15 mins to finish.  
+
The above command takes ~20 minutes to finish.  
 
We will skip this step, and use a set of reference ancestry coordinates that have been generated in the file $HGDP/HGDP_938.RefPC.coord.
 
We will skip this step, and use a set of reference ancestry coordinates that have been generated in the file $HGDP/HGDP_938.RefPC.coord.
 +
View the reference coordinates:
   −
Step 1: Place sequenced samples into the reference ancestry space:
+
less -S $HGDP/HGDP_938.RefPC.coord
 +
 
 +
=== Step 1: Estimate ancestry for sequenced samples ===
 +
 
 +
Submit two jobs to place sequenced samples into the reference ancestry space:
    
  ./LASER-2.01/laser -g $HGDP/HGDP_938.geno -c $HGDP/HGDP_938.RefPC.coord -s hapmap_trios.seq -K 20 -k 4 -x 1 -y 3 -o hapmap_trios.1-3 &
 
  ./LASER-2.01/laser -g $HGDP/HGDP_938.geno -c $HGDP/HGDP_938.RefPC.coord -s hapmap_trios.seq -K 20 -k 4 -x 1 -y 3 -o hapmap_trios.1-3 &
Line 86: Line 172:  
The first job will process samples 1 to 3 and the second job will processed samples 4 to 6.  
 
The first job will process samples 1 to 3 and the second job will processed samples 4 to 6.  
 
Each sequenced sample will be projected from a 20-dimensional PCA space onto a 4-dimensional reference ancestry space.  
 
Each sequenced sample will be projected from a 20-dimensional PCA space onto a 4-dimensional reference ancestry space.  
The running time is ~7 minutes for processing 3 samples in each job.
+
The running time is ~10 minutes for processing 3 samples in each job.
 +
 
 +
=== Step 2: Combine results ===
 +
 
 +
Results from previous step will be output to two files "hapmap_trios.1-3.SeqPC.coord" and "hapmap_trios.4-6.SeqPC.coord".
 +
Here we simply concatenate the two files while skipping the header line of the second file.
 +
 
 +
cp hapmap_trios.1-3.SeqPC.coord hapmap_trios.SeqPC.coord
 +
more +2 hapmap_trios.4-6.SeqPC.coord >> hapmap_trios.SeqPC.coord
 +
 
 +
View the results:
 +
less -S hapmap_trios.SeqPC.coord
 +
 
 +
The results should look like below (results will vary slightly):
 +
 
 +
popID  indivID  L1      Ci        K    t          PC1        PC2        PC3        PC4
 +
YRI    NA19238  78386  0.170864  20    0.999688  467.989    -210.294    -14.1729    -14.4204
 +
CEU    NA12892  85486  0.185973  20    0.999723  10.796    199.095    -9.90387    -21.4534
 +
CEU    NA12891  87588  0.190442  20    0.99973    2.04224    196.07      -19.5705    -12.8022
 +
CEU    NA12878  83213  0.181748  20    0.999711  4.34591    199.861    -12.4825    -22.6281
 +
YRI    NA19239  87564  0.193424  20    0.999734  474.464    -215.96    -9.02921    -19.7372
 +
YRI    NA19240  95866  0.213874  20    0.999748  469.914    -214.94    -14.9923    -13.6559
 +
 
 +
== Visualizing results ==
 +
 
 +
Example R codes are available in ./LASER-2.01/plot/. Let's copy the folder to current working directory:
 +
 
 +
cp -r ./LASER-2.01/plot/ ./
 +
 
 +
Go to the plot folder and run the script to plot results:
 +
cd plot
 +
Rscript plotHGDP.r $HGDP/HGDP_938.RefPC.coord ../hapmap_trios.SeqPC.coord
 +
 
 +
A figure named "Results_on_HGDP.pdf" will be generated.
 +
Visualize the figure:
 +
 
 +
evince Results_on_HGDP.pdf &
 +
 
 +
We expect to see the following figure, in which 3 CEU samples cluster with HGDP Europeans and 3 YRI samples cluster with HGDP Africans:
 +
[[File:Results_on_HGDP.png|thumb|center|alt=LASER results|400px|LASER results]]
96

edits

Navigation menu