Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 1: Line 1:  +
'''Note:''' the latest version of this practical is available at: [[SeqShop: Estimates of Genetic Ancestry Practical]]
 +
* The ones here is the original one from the June workshop (updated to be run from elsewhere)
 +
 +
 
== Introduction ==
 
== Introduction ==
 
See the [[Media:LASER-tutorial.pdf|tutorial slides]] for an introduction of the LASER analysis workflow, input/output file formats, and usage of the LASER software.
 
See the [[Media:LASER-tutorial.pdf|tutorial slides]] for an introduction of the LASER analysis workflow, input/output file formats, and usage of the LASER software.
Line 4: Line 8:  
The main purpose of this page is to provide step-by-step command lines for using LASER to estimate ancestry of 6 targeted sequenced samples (2 HapMap trios) in a principal component space generated using genome-wide SNP data from the Human Genome Diversity Project (HGDP). The HGDP reference panel contains genotype data across 632,958 autosomal loci for 938 individuals from 53 populations worldwide.
 
The main purpose of this page is to provide step-by-step command lines for using LASER to estimate ancestry of 6 targeted sequenced samples (2 HapMap trios) in a principal component space generated using genome-wide SNP data from the Human Genome Diversity Project (HGDP). The HGDP reference panel contains genotype data across 632,958 autosomal loci for 938 individuals from 53 populations worldwide.
   −
For more details about the options and usage of LASER, please read the [http://www.sph.umich.edu/csg/chaolong/LASER/LASER_Manual.pdf manual].
+
For more details about the options and usage of LASER, please read the [http://csg.sph.umich.edu//chaolong/LASER/LASER_Manual.pdf manual].
    
== LASER workflow ==
 
== LASER workflow ==
 
[[File:LASER-workflow.png|thumb|center|alt=LASER workflow|400px|LASER workflow]]
 
[[File:LASER-workflow.png|thumb|center|alt=LASER workflow|400px|LASER workflow]]
   −
== Getting started ==
  −
Create a working directory:
     −
mkdir ancestry
+
== Setup in person at the SeqShop Workshop ==
cd ancestry
+
''This section is specifically for the SeqShop Workshop computers.''
 +
<div class="mw-collapsible mw-collapsed" style="width:600px">
 +
''If you are not running during the SeqShop Workshop, please skip this section.''
 +
<div class="mw-collapsible-content">
 +
 
   −
Download and decompress software package:
+
{{SeqShopLogin}}
   −
wget http://www.sph.umich.edu/csg/chaolong/LASER/LASER-2.01.tar.gz
+
=== Setup your run environment===
tar xzvf LASER-2.01.tar.gz
+
This is the same setup you did for the previous tutorial, but you need to redo it each time you log in.
   −
Set up to access data:
+
This will setup some environment variables to point you to
 +
* Tutorial input files
 
  source /home/chaolong/LASER-Tutorial/setup.txt
 
  source /home/chaolong/LASER-Tutorial/setup.txt
 +
* You won't see any output after running <code>source</code>
 +
** It silently sets up your environment
    +
<div class="mw-collapsible mw-collapsed" style="width:400px">
 
What is in the setup.txt file:
 
What is in the setup.txt file:
 +
<div class="mw-collapsible-content">
 
  export GC=/home/mktrost/seqshop/gotcloud
 
  export GC=/home/mktrost/seqshop/gotcloud
 +
export REF=/home/mktrost/seqshop/reference/all
 +
export HGDP=/home/chaolong/LASER-Tutorial/HGDP
 
  export BAM=/home/chaolong/LASER-Tutorial/BAM
 
  export BAM=/home/chaolong/LASER-Tutorial/BAM
export REF=/home/chaolong/LASER-Tutorial/reference
+
</div>
export HGDP=/home/chaolong/LASER-Tutorial/HGDP
+
</div>
 +
 
 +
Point to where you want the output to go replacing the path with where you would like your output to go
 +
<pre>export OUT=~/seqshop_output/</pre>
 +
</div>
 +
</div>
 +
 
 +
 
 +
== Setup when running on your own outside of the SeqShop Workshop ==
 +
''This section is specifically for running on your own outside of the SeqShop Workshop.''
 +
<div class="mw-collapsible" style="width:600px">
 +
''If you are running during the SeqShop Workshop, please skip this section.''
 +
<div class="mw-collapsible-content">
 +
 
 +
This tutorial uses samtools from GotCloud, as well as example data downloaded in the Sequence Mapping & Assembly tutorial, so if you have not already installed GotCloud and the tutorial data in a previous tutorial, please do so now: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014#Setup_when_running_on_your_own_outside_of_the_SeqShop_Workshop|Tutorial Setup]]
 +
 
 +
 
 +
{{SeqShopRemoteEnv}}
 +
 
 +
<ul>
 +
<li> Additional variables for Ancestry:</li>
 +
<ul>
 +
<div class="mw-collapsible" style="width:500px">
 +
<li>Using bash (replace the paths below with the appropriate paths):</li>
 +
<div class="mw-collapsible-content">
 +
:<pre>export REF=$SS/ancestry/ref&#10;export HGDP=$SS/ancestry/HGDP&#10;export BAM=$SS/ancestry/bams</pre>
 +
</div>
 +
</div>
 +
<div class="mw-collapsible mw-collapsed" style="width:500px">
 +
<li>Using tcsh (replace the paths below with the appropriate paths):</li>
 +
<div class="mw-collapsible-content">
 +
:<pre>setenv REF $SS/ancestry/ref&#10;setenv HGDP $SS/ancestry/HGDP&#10;setenv BAM $SS/ancestry/bams</pre>
 +
</div>
 +
</div>
 +
</ul>
 +
</ul>
 +
</div>
 +
</div>
 +
 
 +
== Getting started ==
 +
Create a working directory:
 +
 
 +
mkdir $OUT/ancestry
 +
cd $OUT/ancestry
 +
 
 +
Download and decompress software package:
 +
 
 +
wget http://csg.sph.umich.edu//chaolong/LASER/LASER-2.01.tar.gz
 +
tar xzvf LASER-2.01.tar.gz
    
== Preparing input files for LASER ==
 
== Preparing input files for LASER ==
Step 0: vcf --> geno
+
=== Step 0: vcf --> geno ===
    
This step prepares the reference panel by converting a VCF genotype file to a GENO file. We will skip this step and use a ready-to-use HGDP reference panel. A typical command to run the vcf2geno tool is given in the file "./LASER-2.01/vcf2geno/cmd.sh":
 
This step prepares the reference panel by converting a VCF genotype file to a GENO file. We will skip this step and use a ready-to-use HGDP reference panel. A typical command to run the vcf2geno tool is given in the file "./LASER-2.01/vcf2geno/cmd.sh":
Line 37: Line 98:       −
Step 1: bam --> pileup
+
=== Step 1: bam --> pileup ===
    
This step uses samtools to generate pileup files from bam files.  
 
This step uses samtools to generate pileup files from bam files.  
Please only try one sample so that we won't overload the sever with everyone running 6 jobs at the same time. Pileup files for these 6 samples have been prepared for later steps.
  −
It takes about 2 mins for each pileup job.
     −
$GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101035.recal.bam > 121101035.recal.pileup &
+
<div class="mw-collapsible mw-collapsed" style="width:500px">
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101043.recal.bam > 121101043.recal.pileup &
+
In person at workshop notes:
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101050.recal.bam > 121101050.recal.pileup &
+
<div class="mw-collapsible-content">
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101052.recal.bam > 121101052.recal.pileup &
+
*Please only try one sample so that we won't overload the sever with everyone running 6 jobs at the same time. Pileup files for these 6 samples have been prepared for later steps.
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101415.recal.bam > 121101415.recal.pileup &
+
*It takes about 2 mins for each pileup job.
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/hs37d5.fa.rz -l $HGDP/HGDP_938.bed $BAM/121101861.recal.bam > 121101861.recal.pileup &
+
</div>
 +
</div>
 +
 
 +
<div class="mw-collapsible" style="width:500px">
 +
Outside of the workshop notes:
 +
<div class="mw-collapsible-content">
 +
*The BAMs provided as part of the download are chr22 only BAMs. They are used to demonstrate how to run this step.
 +
*Pileup files for the whole genome BAMs are provided with the download and will be used in the next step.
 +
* You only need to try one of these.
 +
</div>
 +
</div>
   −
Step 2: pileup --> seq
+
  $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101035.recal.bam > 121101035.recal.pileup
 +
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101043.recal.bam > 121101043.recal.pileup
 +
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101050.recal.bam > 121101050.recal.pileup
 +
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101052.recal.bam > 121101052.recal.pileup
 +
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101415.recal.bam > 121101415.recal.pileup
 +
# $GC/bin/samtools mpileup -q 30 -Q 20 -f $REF/human.g1k.v37.fa -l $HGDP/HGDP_938.bed $BAM/121101861.recal.bam > 121101861.recal.pileup
 +
 
 +
We use -q 30 and -Q 20 to exclude reads that have mapping quality score lower than 30 or base quality score lower than 20.
 +
 
 +
=== Step 2: pileup --> seq ===
    
In this step, we will generate a file called "hapmap_trios.seq", containing the information of 6 samples. It takes about 30 seconds to run.
 
In this step, we will generate a file called "hapmap_trios.seq", containing the information of 6 samples. It takes about 30 seconds to run.
 
We will use the pre-generated pileup files in the $BAM folder.
 
We will use the pre-generated pileup files in the $BAM folder.
 +
* These pre-generated pileup files are for the whole genome of all 6 samples
    
  python ./LASER-2.01/pileup2seq/pileup2seq.py \
 
  python ./LASER-2.01/pileup2seq/pileup2seq.py \
Line 65: Line 144:  
  $BAM/121101052.recal.pileup \
 
  $BAM/121101052.recal.pileup \
 
  $BAM/121101415.recal.pileup \
 
  $BAM/121101415.recal.pileup \
  $BAM/121101861.recal.pileup &
+
  $BAM/121101861.recal.pileup
    
In the above command, -b provides the targeted regions to exclude and -i specifies alternative IDs for the BAM files to be used in the .seq file (including popID and indivID).  
 
In the above command, -b provides the targeted regions to exclude and -i specifies alternative IDs for the BAM files to be used in the .seq file (including popID and indivID).  
Line 72: Line 151:  
== Estimating ancestry coordinates ==
 
== Estimating ancestry coordinates ==
   −
Step 0: Generate the reference ancestry space (using the PCA mode of the LASER program)
+
=== Step 0: Generate the reference ancestry space ===
 +
 
 +
LASER can perform principal components analysis (PCA) on genotype data of the reference panel to generate a reference ancestry space.
    
  # ./LASER-2.01/laser -g $HGDP/HGDP_938.geno -pca 1 -k 30 -o HGDP_938
 
  # ./LASER-2.01/laser -g $HGDP/HGDP_938.geno -pca 1 -k 30 -o HGDP_938
Line 82: Line 163:  
  less -S $HGDP/HGDP_938.RefPC.coord
 
  less -S $HGDP/HGDP_938.RefPC.coord
   −
Step 1: Place sequenced samples into the reference ancestry space:
+
=== Step 1: Estimate ancestry for sequenced samples ===
 +
 
 +
Submit two jobs to place sequenced samples into the reference ancestry space:
    
  ./LASER-2.01/laser -g $HGDP/HGDP_938.geno -c $HGDP/HGDP_938.RefPC.coord -s hapmap_trios.seq -K 20 -k 4 -x 1 -y 3 -o hapmap_trios.1-3 &
 
  ./LASER-2.01/laser -g $HGDP/HGDP_938.geno -c $HGDP/HGDP_938.RefPC.coord -s hapmap_trios.seq -K 20 -k 4 -x 1 -y 3 -o hapmap_trios.1-3 &
Line 91: Line 174:  
The running time is ~10 minutes for processing 3 samples in each job.
 
The running time is ~10 minutes for processing 3 samples in each job.
   −
Step 2: Combine results  
+
=== Step 2: Combine results ===
    
Results from previous step will be output to two files "hapmap_trios.1-3.SeqPC.coord" and "hapmap_trios.4-6.SeqPC.coord".  
 
Results from previous step will be output to two files "hapmap_trios.1-3.SeqPC.coord" and "hapmap_trios.4-6.SeqPC.coord".  
Line 102: Line 185:  
  less -S hapmap_trios.SeqPC.coord
 
  less -S hapmap_trios.SeqPC.coord
   −
The results should look like below:
+
The results should look like below (results will vary slightly):
    
  popID  indivID  L1      Ci        K    t          PC1        PC2        PC3        PC4
 
  popID  indivID  L1      Ci        K    t          PC1        PC2        PC3        PC4
Line 114: Line 197:  
== Visualizing results ==
 
== Visualizing results ==
   −
Example R codes are available in ./LASER-2.01/plot/. Let's first copy the folder to current working directory:
+
Example R codes are available in ./LASER-2.01/plot/. Let's copy the folder to current working directory:
    
  cp -r ./LASER-2.01/plot/ ./
 
  cp -r ./LASER-2.01/plot/ ./
   −
And then we enter the directory and run the script to plot results:
+
Go to the plot folder and run the script to plot results:
 
  cd plot
 
  cd plot
 
  Rscript plotHGDP.r $HGDP/HGDP_938.RefPC.coord ../hapmap_trios.SeqPC.coord
 
  Rscript plotHGDP.r $HGDP/HGDP_938.RefPC.coord ../hapmap_trios.SeqPC.coord
Line 127: Line 210:  
  evince Results_on_HGDP.pdf &
 
  evince Results_on_HGDP.pdf &
   −
We expect to see the following figure, in which 3 HapMap CEU samples cluster with HGDP Europeans and 3 HapMap YRI samples cluster with HGDP Africans:  
+
We expect to see the following figure, in which 3 CEU samples cluster with HGDP Europeans and 3 YRI samples cluster with HGDP Africans:  
 
[[File:Results_on_HGDP.png|thumb|center|alt=LASER results|400px|LASER results]]
 
[[File:Results_on_HGDP.png|thumb|center|alt=LASER results|400px|LASER results]]
96

edits

Navigation menu