Changes

From Genome Analysis Wiki
Jump to navigationJump to search
5,845 bytes added ,  18:33, 15 July 2015
Line 1: Line 1:  
LiftOver is a necesary step to bring all genetical analysis to the same reference build.  
 
LiftOver is a necesary step to bring all genetical analysis to the same reference build.  
Particularly, our current data are mainly in either NCBI build 36 (UCSC hg 18) or NCBI build 37 (UCSC hg19).
+
LiftOver can have three use cases:
Although lift over can be from higher build to lower build, we always recommend lift lower build to higher/current build.
     −
LiftOver is not hard. The easier way is to use UCSC liftOver tool to lift [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED format] file to BED format file.
+
(1) [[#Lift genome positions | Convert genome position from one genome assembly to another genome assembly]]
With additional steps, we can also lift Merlin and PLINK data files.
     −
Besides introducing lift over genomic positions, lifting SNPs is also introduced.
+
In most scenarios, we have known genome positions in NCBI build 36 (UCSC hg 18) and hope to lift them over to NCBI build 37 (UCSC hg19).
   −
== Lift over using BED files ==
+
(2) [[#Lift dbSNP rs numbers | Convert dbSNP rs number from one build to another]]
 +
 
 +
(3) [[#Lift Merlin/PLINK format | Convert both genome position and dbSNP rs number over different versions]]
 +
 
 +
It is likely to see such type of data in Merlin/PLINK format.
 +
 
 +
We will explain the work flow for the above three cases. In the rest of this article,
 +
our example is to lift over from lower/older build to newer/higher build, as it is the common practice.
 +
 
 +
Using different tools, liftOver can be easy.
 +
For example, UCSC liftOver tool is able to lift [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED format] file between builds.
 +
With our customized scripts, we can also lift rsNumber and Merlin/PLINK data files.
 +
 
 +
== Lift genome positions ==
 +
Genome positions are best represented in [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED format]. UCSC provides tools to convert BED file from one genome assembly to another.
    
=== Binary liftOver tool ===
 
=== Binary liftOver tool ===
Download the [http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver liftOver binary] from UCSC and [http://hgdownload.cse.ucsc.edu/goldenPath/hg18/liftOver/hg18ToHg19.over.chain.gz hg18 to hg 19 chain file]
+
We need [http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver liftOver binary] from UCSC and [http://hgdownload.cse.ucsc.edu/goldenPath/hg18/liftOver/hg18ToHg19.over.chain.gz hg18 to hg 19 chain file].
   −
Provide BED format file (input.bed)
+
Provide BED format file (e.g. input.bed)
    
NOTE: Use the 'chr' before each chromosome name
 
NOTE: Use the 'chr' before each chromosome name
Line 25: Line 37:  
     liftOver input.bed hg18ToHg19.over.chain.gz output.bed unlifted.bed
 
     liftOver input.bed hg18ToHg19.over.chain.gz output.bed unlifted.bed
   −
unlifted file will contain all genomic positions that cannot be lifted. The reason for that varies. See [[#Various reasons that lift over could fail | Various reasons that lift over could fail]]
+
unlifted.bed file will contain all genome positions that cannot be lifted. The reason for that varies. See [[#Various reasons that lift over could fail | Various reasons that lift over could fail]]
    
=== Web interface ===
 
=== Web interface ===
 
Alternatively, you can lift over BED file in web interface
 
Alternatively, you can lift over BED file in web interface
 
at: [http://genome.ucsc.edu/cgi-bin/hgLiftOver Link]
 
at: [http://genome.ucsc.edu/cgi-bin/hgLiftOver Link]
Web interface can tell you why some genomic position cannot
+
Web interface can tell you why some genome position cannot
 
be lifted if you click "Explain failure messages"
 
be lifted if you click "Explain failure messages"
   −
== Lift Merlin format ==
+
== Lift dbSNP rs numbers ==
 +
rs number is release by dbSNP. UCSC also make their own copy from each dbSNP version. Be aware that the same version of dbSNP from these two centers are not the same.
 +
When we convert rs number from lower version to higher version, there are practically two ways.
 +
 
 +
=== Use RsMergeArch and SNPHistory ===
 +
It is necessary to quickly summarize how dbSNP merge/re-activate rs number:
 +
 
 +
# when different rs number are found to refer to the same SNP, then higher rs number will be merged to lower rs number, and the merging will be recorded in RsMergeArch.bcp.gz.
 +
# when rs number have to be retracted, rs number will be recorded in SNPHistory.bcp.gz
 +
# a retracted SNP can be [http://www.ncbi.nlm.nih.gov/books/NBK44496/#Schema.rs4823903_which_has_merged_into re-activated] in SNPHistory.bcp.gz by adding comment
 +
 
 +
With the above in mind, we are able to combine these two tables to obtain the relationship between older rs number and new rs number.
 +
We have developed a script (for internal use), named [http://genome.sph.umich.edu/wiki/LiftRsNumber.py liftRsNumber.py] for lift rs numbers between builds.
 +
This scripts require RsMergeArch.bcp.gz  and SNPHistory.bcp.gz, those can be found in [[#Resources | Resources]].
 +
 
 +
Example input:
 +
<pre>
 +
3000
 +
3001
 +
3002
 +
</pre>
 +
 
 +
Command:
 +
<pre>
 +
python liftRsNumber.py input.rs > output.rs
 +
</pre>
 +
 
 +
Example output:
 +
<pre>
 +
unchanged      3000
 +
lifted  2032
 +
unchanged      3002
 +
</pre>
 +
 
 +
== Lift Merlin/PLINK format ==
 +
In Merlin/PLINK .map files, each line contains both genome position and dbSNP rs number. Our goal here is to use both information to liftOver as many position as possible.
 +
There are 3 methods to liftOver and we recommend the first 2 method. The first method is common and applicable in most cases, and in our observations it lifts the most genome positions, however, it does not reflect the rs  number change between different dbSNP builds. The second method is more robust in the sense that each lifted rs number has valid genome position, as it lift over old rs number as the first step by using dbSNP data. The third method is not straigtforward, and we just briefly mention it.
 +
 
 +
=== Lift Merlin format ===
    
PLINK format and [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin format are nearly identical].  
 
PLINK format and [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin format are nearly identical].  
Line 43: Line 93:  
to obtain Merlin .map file.  
 
to obtain Merlin .map file.  
   −
== Lift PLINK format ==
+
=== Lift PLINK format ===
 
[http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml PLINK] format usually referrs to .ped and .map files.  
 
[http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml PLINK] format usually referrs to .ped and .map files.  
   −
We recommend split the jobs in several steps:
+
 
(1) convert .map to .bed file
+
==== Method 1 ====
 +
We mainly use UCSC LiftOver binary tools to help lift over.
 +
We have a script [[#Resources | liftMap.py]], however, it is recommended to understand the job step by step:
 +
 
 +
(1) Convert .map to .bed file
 +
 
 
By rearrange columns of .map file, we obtain a standard BED format file.
 
By rearrange columns of .map file, we obtain a standard BED format file.
   −
(2) liftOver .bed file
+
(2) LiftOver .bed file
Use method mentioned above to convert .bed file from one build to another.
+
 
 +
Use method mentioned [[#Lift genome positions | above]] to convert .bed file from one build to another.
 +
 
 +
(3) Convert lifted .bed file back to .map file
   −
(3) convert lifted .bed file back to .map file
   
Rearrange column of .map file to obtain .bed file in the new build.
 
Rearrange column of .map file to obtain .bed file in the new build.
   −
(4) modify .ped file
+
(4) Modify .ped file
 +
 
 
.ped file have many column files. By convention, the first six columns are family_id, person_id, father_id, mother_id, sex, and phenotype.
 
.ped file have many column files. By convention, the first six columns are family_id, person_id, father_id, mother_id, sex, and phenotype.
From the 7th column, there are two letters/digits representing a genotype at the certain marker. In step (2), as some genomic positions cannot
+
From the 7th column, there are two letters/digits representing a genotype at the certain marker. In step (2), as some genome positions cannot
 
be lifted to the new version, we need to drop their corresponding columns from .ped file to keep consistency. You can use PLINK --exclude those snps,
 
be lifted to the new version, we need to drop their corresponding columns from .ped file to keep consistency. You can use PLINK --exclude those snps,
 
see [http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#exclude Remove a subset of SNPs].
 
see [http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#exclude Remove a subset of SNPs].
    
(5) (optionally) change the rs number in the .map file
 
(5) (optionally) change the rs number in the .map file
 +
 
Similar to the human reference build, dbSNP also have different versions. You may consider change rs number from the old dbSNP version to new dbSNP version  
 
Similar to the human reference build, dbSNP also have different versions. You may consider change rs number from the old dbSNP version to new dbSNP version  
depending on your needs Such steps are described in [#Lift dbSNP rs numbers | Lift dbSNP rs numbers].
+
depending on your needs. Such steps are described in [[#Lift dbSNP rs numbers | Lift dbSNP rs numbers]].
   −
(6) (optionally) additional method to lift dbSNP postion
+
==== Method 2 ====
NCBI dbSNP team has provided a provisional map for converting the genomic position of a larget set dbSNP from NCBI build 36 to NCBI build 37.
+
The idea is to use [[#Resources |LiftRsNumber.py]] to convert old rs number to new rs number, use the data file b132_SNPChrPosOnRef_37_1.bcp.gz (a data file containing each dbSNP and its positions in NCBI build 37), and adjust .map and .ped files accordingly.  
In the second step, we have obtained unlifted genomic positions, so we can try to use the table to convert those unlfted dbSNPs.
  −
After this step, there are still some SNPs that cannot be lifted, and they are mostly located on non-reference chromosome.
     −
== Lift dbSNP rs numbers ==
+
(1) Extract and lift rs numbers
rs number is release by dbSNP. UCSC also make their own copy from each dbSNP version. Be aware that the same version of dbSNP from these two centers are not the same.
+
 
When we convert rs number from lower version to higher version, there are practically two ways.
+
Use the tools [[#Use RsMergeArch and SNPHistory | LiftRsNumber.py]] to lift the rs number in the map file from old build to new build.
 +
 
 +
(2) Lookup SNP positions from rs number
 +
 
 +
dbSNP provides a file [[#Resources | b132_SNPChrPosOnRef_37_1.bcp.gz]] which contains rsNumber, chromosome and its position.
 +
Use this file along with the new rsNumber obtained in the first step.
 +
In practice, some rs numbers do not exist in build 132, or not suitable to be considered ( e.g. they do not reside on human reference, or they are mapped to multiple locations, these scenarios are noted by the chromosome column with values like "AltOnly", "Multi", "NotOn", "PAR", "Un"), we can drop them in the liftover procedure.
 +
We will obtain the rs number and its position in the new build after this step.
 +
 
 +
(3) Lift .map file and .ped file
 +
 
 +
To lift over .map files, we can scan its content line by line, and skip those not lifted rs number.
 +
Accordingly, it is necessary to drop the un-lifted SNP genotypes from .ped file.
 +
 
 +
==== Method 3 ====
 +
NCBI dbSNP team has provided a [[ #Resources | provisional map ]] for converting the genome position of a larget set dbSNP from NCBI build 36 to NCBI build 37.
 +
In the second step, we have obtained unlifted genome positions, so we can try to use the table to convert those unlfted dbSNPs.
 +
After this step, there are still some SNPs that cannot be lifted, as they are mostly located on non-reference chromosome.
 +
Note: due to the limitation of the provisional map, some SNP can have multiple locations.
 +
For example, we cannot convert rs10000199 to chromosome 4, 7, 12.
 +
<pre>
 +
10000199 A/G 4 166142415 166142415 2 3 G + 4 165922965 165922965 2 3 G +
 +
10000199 A/G 7 4589694 4589694 2 3 C - 7 4623168 4623168 2 3 C -
 +
10000199 A/G 12 57008620 57008620 2 3 C - 12 58722353 58722353 2 3 C -
 +
10000199 A/G 5 156018406 156018406 2 3 C - 5 156085828 156085828 2 3 C -
 +
</pre>
 +
 
 +
We can dissect this method into steps:
 +
 
 +
(1) Remove invalid record in dbSNP provisional map.
 +
 
 +
Provisional map have duplicated rs number or the chromsome in the new build can be "Unable to map"(UN), we need to clean this table.
 +
 
 +
(2) Use provisional map to update .map file
   −
=== Use RsMergeArch and SNPHistory ===
+
By joining .map file and this provisional map, we can obtain the new genome position in the new build.
In short,  
+
Note: provisional map uses 1-based chromosomal index. Things will get tricker if we want to lift non-single site SNP e.g. AA/GG
 +
Since provisional map provides a range in this case, it is necessary to know the genome position of that single base provided in the .map file,
 +
and then we can look up the table, so it is not straigtforward.
   −
# when different rs number are found to refer to the same SNP, then higher rs number will be merged to lower rs number, and the merging will be recorded in RsMergedArch.bcp.gz.
+
(3) Adjust .map and .ped file
# when rs number have to be retracted, rs number will be recorded in SNPHistory.bcp.gz
     −
So we need to combine these two tables to obtain the relationship between older rs number and new rs number.
+
For those lifted dbSNP, we need to keep them in the .map files, otherwise, we need to delete them.
Luckily, we have a script for internal use. See liftRsNumber.py
+
Accordingly, we need to deleted SNP genotypes for those cannot be lifted.
   −
== Various reasons that lift over could fail ==
+
== Various reasons that lift over can fail ==
   −
=== Genomic position cannot be lifted ===
+
=== Genome position cannot be lifted ===
When a SNP resides in a contig that only exists in older reference build, liftOver cannot give it new genomic.
+
When a SNP resides in a contig that only exists in older reference build, liftOver cannot give it new genome.
    
You can try the following SNP (in BED format) in UCSC online liftOver site:
 
You can try the following SNP (in BED format) in UCSC online liftOver site:
Line 95: Line 186:  
=== SNP in higher build are located in non-referernce assembly ===
 
=== SNP in higher build are located in non-referernce assembly ===
 
Some SNP are not in autosomes or sex chromosomes in NCBI build 37. dbSNP does not include them.  
 
Some SNP are not in autosomes or sex chromosomes in NCBI build 37. dbSNP does not include them.  
You cannot use dbSNP database to lookup its genomic position by rs number.
+
You cannot use dbSNP database to lookup its genome position by rs number.
    
Take rs1006094 as an example:
 
Take rs1006094 as an example:
Line 101: Line 192:  
Thus it is probably not very useful to lift this SNP.
 
Thus it is probably not very useful to lift this SNP.
   −
=== Cannot find rs number in newer dbSNP build ===
+
=== rs number changed in newer dbSNP build ===
 
It is possible that new dbSNP build does not have certain rs numbers.
 
It is possible that new dbSNP build does not have certain rs numbers.
 
When dbSNp release new build, higher rs number may be merged to lower rs number because of those rs numbers are actually the same SNP.
 
When dbSNp release new build, higher rs number may be merged to lower rs number because of those rs numbers are actually the same SNP.
This merge process can be complicate. For short description, see [#Use RsMergeArch and SNPHistory | Use RsMergeArch and SNPHistory].  
+
This merge process can be complicate. For short description, see [[ #Use RsMergeArch and SNPHistory | Use RsMergeArch and SNPHistory ]].  
 
For detail, see:
 
For detail, see:
   Line 124: Line 215:  
* SNPs that are not mapped on the reference genome (GRCh37)
 
* SNPs that are not mapped on the reference genome (GRCh37)
   −
For UCSC release, see [#Resources | UCSC dbSNP track note]
+
For UCSC release, see [[ #Resources | UCSC dbSNP track note ]]
    
Use rs1054140 as an example:
 
Use rs1054140 as an example:
Line 142: Line 233:  
</pre>
 
</pre>
   −
== Resouces ==
+
== Resources ==
 +
* liftRsNumber.py [[liftRsNumber.py]] and its interal location: /net/fantasia/home/zhanxw/amd/analyze/verifyBamID/liftRsNumber.py
 +
* liftMap.py [[liftMap.py]]
 
* NCBI provisional map [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.txt.gz file] and [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.info info]
 
* NCBI provisional map [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.txt.gz file] and [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.info info]
* NCBI RgMergeArch file and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=RsMergeArch schema]
+
* NCBI RgMergeArch [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/database/organism_data/RsMergeArch.bcp.gz file] and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=RsMergeArch schema]
* NCBI SNPHistory file and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=SNPHistory schema]
+
* NCBI SNPHistory [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/database/organism_data/SNPHistory.bcp.gz file] and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=SNPHistory schema]
 +
* NCBI SNPChrPosOnRef build 132 [http://qbrc.swmed.edu/zhanxw/software/liftOver/b132_SNPChrPosOnRef_37_1.bcp.gz file] and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=SNPChrPosOnRef schema]. If this link becomes unavailable, please consider using this updated file ([ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/database/organism_data/b144_SNPChrPosOnRef_107.bcp.gz link]).
 +
 
 
* How UCSC dbSNP differs from NCBI dbSNP [http://genomewiki.ucsc.edu/index.php/DbSNP_Track_Notes UCSC dbSNP track note]
 
* How UCSC dbSNP differs from NCBI dbSNP [http://genomewiki.ucsc.edu/index.php/DbSNP_Track_Notes UCSC dbSNP track note]
 
* The dbSNP mapping process [http://www.ncbi.nlm.nih.gov/books/NBK44455/ link]
 
* The dbSNP mapping process [http://www.ncbi.nlm.nih.gov/books/NBK44455/ link]
 
* NCBI dbSNP release 132 [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/VCF/v4.0/ByChromosomeNoGeno/00-All.vcf.gz 00-All.vcf.gz]
 
* NCBI dbSNP release 132 [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/VCF/v4.0/ByChromosomeNoGeno/00-All.vcf.gz 00-All.vcf.gz]
 
* UCSC dbSNP release 132 [http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/snp132.txt.gz snp132.txt.gz]
 
* UCSC dbSNP release 132 [http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/snp132.txt.gz snp132.txt.gz]
 +
 +
== Third party contribution ==
 +
 +
* liftOver for BEDPE format by [mailto://doug.phanstiel@gmail.com Doug] - [http://www.sharedproteomics.com/forum/showthread.php?2930-liftOverBedpe.py-Convert-genome-coordinates-for-bedpe-files&p=4854#post4854 liftOverBedpe.py].
    
== Acknowledge ==
 
== Acknowledge ==
255

edits

Navigation menu