Changes

From Genome Analysis Wiki
Jump to navigationJump to search
458 bytes added ,  17:35, 2 February 2013
no edit summary
Line 1: Line 1:  
= Introduction  =
 
= Introduction  =
   −
LASER is a C++ software that can estimate individual ancestry directly from genome-wide shortgun sequencing reads without calling genotypes. The method relies on the availability of a set of reference individuals whose genome-wide SNP genotypes and ancestral information are known. We first construct a reference coordinate system by applying principal components analysis (PCA) to the genotype data of the reference individuals. Then, for each sequence sample, uses the genome-wide sequencing reads to place the sample into the reference PCA space. With an appropriate reference panel, the estimated coordinates of the sequence samples identify their ancestral background and can be directly used to correct for population structure in association studies or to ensure adequate matching of cases and controls.  
+
LASER, which stands for Locating Ancestry using SEquencing Reads, is a C++ software package that can estimate individual ancestry directly from genome-wide shortgun sequencing reads without calling genotypes. The method relies on the availability of a set of reference individuals whose genome-wide SNP genotypes and ancestral information are known. We first construct a reference coordinate system by applying principal components analysis (PCA) to the genotype data of the reference individuals. Then, for each sequencing sample, use the genome-wide sequencing reads to place the sample into the reference PCA space. With an appropriate reference panel, the estimated coordinates of the sequencing samples identify their ancestral background and can be directly used to correct for population structure in association studies or to ensure adequate matching of cases and controls.  
   −
This goal of this wiki page is to help you get start using LASER, and we encourage you to read the [http://www.sph.umich.edu/csg/chaolong/LASER/LASER_Manual.pdf manual] for details.
+
The goal of this wiki page is to help you get start using LASER, and we encourage you to read the [http://www.sph.umich.edu/csg/chaolong/LASER/LASER_Manual.pdf manual] for more details.
   −
= Download LASER =
+
= Download =
   −
The software package, resource bundle and a detailed manual can be downloaded from: [http://www.sph.umich.edu/csg/chaolong/LASER/ LASER link].
+
To get a copy go to the [http://www.sph.umich.edu/csg/chaolong/LASER/ LASER Download] download page.
    
= Workflow  =
 
= Workflow  =
   −
LASER generates the coordinates of reference panels and sequencing samples together, and it requires essentially two input files:  
+
LASER generates the coordinates from both reference individuals and sequence samples. It requires essentially two input files:  
    
[[File:LASER-Workflow.png|thumb|center|alt=LASER workflow|400px|LASER Workflow]]  
 
[[File:LASER-Workflow.png|thumb|center|alt=LASER workflow|400px|LASER Workflow]]  
   −
*Seq file: a text file processed from BAM (alignment) files. (See [[#Process sequencing file (BAM)|Processing sequencing file]] for how to obtain seq file)  
+
*Seq file: a text file processed from BAM (alignment) files. (See [[#Process sequencing file (BAM)|Processing sequencing file]] for how to prepare seq file)  
*Geno file: genotypes of reference panels. (See [[#Geno file|Geno file]] to understand geno file format)
+
*Geno file: genotypes of reference individuals. (See [[#Geno file|Geno file]] to understand geno file format)
   −
In coord files of reference (Reference.coord), LASER outputs the result of Principal Component Analysis (PCA) of the reference samples; in coord files of sequencing samples(AllSamples.coord), LASER infers their ancestries by placing their ancestry coordinates based on the reference coordinates.  
+
LASER typically output two coord files: (1) in reference individuals' coord file(Reference.coord), LASER outputs the reference coordinates in the PCA space; (2) in sequence samples' coord files(AllSamples.coord), LASER infers their ancestries by placing their ancestry coordinates onto reference samples' PCA space.
   −
 
+
An example result of the coord file of sequence samples is shown below:
An example result (Seis shown below:
      
  popID  indivID  L1    Ci        t        PC1      PC2
 
  popID  indivID  L1    Ci        t        PC1      PC2
Line 31: Line 30:  
  YRI    NA19240  1735  0.404142  0.990264  59.8379  -45.2765
 
  YRI    NA19240  1735  0.404142  0.990264  59.8379  -45.2765
   −
In the header line, popID means "population ID", indivID means "individual ID", L1 means number of loci has been covered, Ci means "average coverage", t means Procrustes similarity.
+
In the header line, popID means "population ID", indivID means "individual ID", L1 means number of loci that has been covered by at least one read, Ci means "average coverage", t means Procrustes similarity. PC1, PC2 means coordinates of the first and second principal components.
PC1, PC2 means coordinates of first and second principal components.
      
= Tutorial  =
 
= Tutorial  =
Line 41: Line 39:     
We illustrate how to obtain .seq file from BAM files in this section.  
 
We illustrate how to obtain .seq file from BAM files in this section.  
In this example, we use HGDP data set, which contain 938 individuals and 632,958 markers as the reference.
+
In this example, we use HGDP data set as reference, which contains 938 individuals and 632,958 markers.
 
[[File:LASER-DataProcessing.png|thumb|center|alt=LASER workflow|400px|LASER Data Processing Procedure]]  
 
[[File:LASER-DataProcessing.png|thumb|center|alt=LASER workflow|400px|LASER Data Processing Procedure]]  
    
1. Obtain pileup files from BAM files   
 
1. Obtain pileup files from BAM files   
   −
We use samtools to extract the bases on the 632,958 reference markers using:
+
We use ''samtools'' to extract the sequence bases overlapping the 632,958 reference markers:
 
  samtools mpileup -q 30 -Q 20 -f ../../LASER-resource/reference/hs37d5.fa -l HGDP_938.bed exampleBAM/NA12878.chrom22.recal.bam > NA12878.chrom22.pileup
 
  samtools mpileup -q 30 -Q 20 -f ../../LASER-resource/reference/hs37d5.fa -l HGDP_938.bed exampleBAM/NA12878.chrom22.recal.bam > NA12878.chrom22.pileup
   −
2. Obtain seq files from pileup files.  
+
2. Obtain a seq file from pileup files.  
   −
To convert pile up files into seq file format, we first generate site file:
+
To convert pileup files into a single seq file before running LASER, we first generate a site file:
    
  cat ../resource/HGDP/HGDP_938.site |awk '{if (NR > 1) {print $1, $2-1, $2;}}' > HGDP_938.bed
 
  cat ../resource/HGDP/HGDP_938.site |awk '{if (NR > 1) {print $1, $2-1, $2;}}' > HGDP_938.bed
Line 59: Line 57:  
  python pileup2seq.py  -m ../resource/HGDP/HGDP_938.site -o test NA12878.chrom22.pileup  
 
  python pileup2seq.py  -m ../resource/HGDP/HGDP_938.site -o test NA12878.chrom22.pileup  
   −
== Estimating ancestry using LASER ==
+
== Estimate ancestries of sequence samples ==
   −
The easiest way to use LASER using provide example is:  
+
The easiest way to perform LASER using its exemplar data is:  
    
  ./laser -s pileup2seq/test.seq  -g resource/HGDP/HGDP_938.geno -c resource/HGDP/HGDP_938.RefPC.coord -o test -k 2
 
  ./laser -s pileup2seq/test.seq  -g resource/HGDP/HGDP_938.geno -c resource/HGDP/HGDP_938.RefPC.coord -o test -k 2
   −
Upon successful running, you will find result file "test.SeqPC.coord".
+
Upon successful calculation, you will find a result file "test.SeqPC.coord".
    
<br>
 
<br>
Line 86: Line 84:     
The first and second columns represent the population id and individual id.  
 
The first and second columns represent the population id and individual id.  
From the third column, the number represents the genotype.
+
From the third column, each number represents a genotype.
 
In this geno file, we have 632,960 columns which contains 632,958 markers from column 3 to the last column.
 
In this geno file, we have 632,960 columns which contains 632,958 markers from column 3 to the last column.
    
== Seq file  ==
 
== Seq file  ==
Seq file organizes the sequencing information into LASER readable format.
+
Seq file is generated from pileup files. It contains sequencing information and organize it in a LASER readable format.
The first two columns are intended for population id and individual id.
+
The first two columns represent population id and individual id.
Subsequent columns are total read depth and reference base count.
+
Subsequent columns are total read depths and reference base counts.
For example, column 3 and 4 are 0, 0 in the following example, meaning at first marker, the read depth is 0 and none of read has reference base.
+
For example, column 3 and 4 are 0, 0 in the following example. That means at first marker, the sequence read depth is 0 and thus none of the reads has reference base.
We enforce tab delimiters between markers and space delimiters between read depth and reference base counts.
+
We enforce tab delimiters between markers and space delimiters between each read depths and reference base counts.
An example seq file is shown below:
+
On line of seq file looks like below:
    
  NA12878.chrom22 NA12878.chrom22 0 0 0 0 0 0 0 0 0  
 
  NA12878.chrom22 NA12878.chrom22 0 0 0 0 0 0 0 0 0  
Line 101: Line 99:  
== Pileup file  ==
 
== Pileup file  ==
   −
Pileup file are generate by samtools. An example pileup file is listed below:
+
Pileup file are generate using samtools. An example pileup file is shown below:
 
   
 
   
 
  22 17094749 A 1 c D
 
  22 17094749 A 1 c D
Line 116: Line 114:     
== BED file  ==
 
== BED file  ==
BED file represents genomic regions and it follows UCSC conventions:
+
BED file represents genomic regions and it follows [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 UCSC conventions]:
    
  1 752565 752566
 
  1 752565 752566
Line 127: Line 125:     
== Coord file  ==
 
== Coord file  ==
Coord files are used to represents the ancestries of both reference samples and sequencing samples.
+
Coord files represent the ancestries of both reference samples and sequence samples.
 
An example coord file looks like below:
 
An example coord file looks like below:
   Line 139: Line 137:     
The columns are: popID means "population ID", indivID means "individual ID", L1 means number of loci has been covered, Ci means "average coverage", t means Procrustes similarity.
 
The columns are: popID means "population ID", indivID means "individual ID", L1 means number of loci has been covered, Ci means "average coverage", t means Procrustes similarity.
PC1, PC2 means coordinates of first and second principal components.
+
PC1, PC2 means coordinates of first and second principal components. You may notice Ci and t are omitted in the coord files of reference samples. The reason is that reference samples use
 +
genotypes and do not have coverage information.
    
== Site file ==
 
== Site file ==
Line 154: Line 153:  
= Advanced options =
 
= Advanced options =
   −
LASER has advanced options including (1) running parallel jobs; (2) increase ancestry inference using repeated runs; (3) generate PCA coordiates for genotypes.
+
LASER has advanced options including (1) parallel computing; (2) increase ancestry inference accuracy using repeated runs; (3) generate PCA coordiates using genotypes.
See the [http://www.sph.umich.edu/csg/chaolong/LASER/LASER_Manual.pdf manual] for detailed information.
+
See [http://www.sph.umich.edu/csg/chaolong/LASER/LASER_Manual.pdf LASER Manual] for detailed information.
    
= Contact  =
 
= Contact  =
255

edits

Navigation menu