255
edits
Changes
From Genome Analysis Wiki
LiftOver
,no edit summary
LiftOver is a necesary step to bring all genetical analysis to the same reference build.
=== Binary liftOver tool ===
Provide BED format file (e.g. input.bed)
NOTE: Use the 'chr' before each chromosome name
liftOver input.bed hg18ToHg19.over.chain.gz output.bed unlifted.bed
unlifted .bed file will contain all genomic positions that cannot be lifted. The reason for that varies. See [[#Various reasons that lift over could fail | Various reasons that lift over could fail]]
=== Web interface ===
be lifted if you click "Explain failure messages"
== Lift dbSNP rs numbers ==rs number is release by dbSNP. UCSC also make their own copy from each dbSNP version. Be aware that the same version of dbSNP from these two centers are not the same.When we convert rs number from lower version to higher version, there are practically two ways. === Use RsMergeArch and SNPHistory ===It is necessary to quickly summarize how dbSNP merge/re-activate rs number: # when different rs number are found to refer to the same SNP, then higher rs number will be merged to lower rs number, and the merging will be recorded in RsMergeArch.bcp.gz.# when rs number have to be retracted, rs number will be recorded in SNPHistory.bcp.gz# a retracted SNP can be [http://www.ncbi.nlm.nih.gov/books/NBK44496/#Schema.rs4823903_which_has_merged_into re-activated] in SNPHistory.bcp.gz by adding comment With the above in mind, we are able to combine these two tables to obtain the relationship between older rs number and new rs number.We have developed a script (for internal use), named liftRsNumber.py [/net /dumbo/net/dumbo/home/zhanxw/amd/analyze/verifyBamID/] for lift rs numbers between builds.This scripts require RsMergeArch.bcp.gz and SNPHistory.bcp.gz, those can be found in [[#Resources | Resources]]. Example input:<pre>31158601212481922290021130683</pre> Command:<pre>python liftRsNumber.py input.rs > output.rs</pre> Example otuput:<pre>unchanged 3115860unchanged 12124819lifted 2229002lifted 1130683</pre> == Lift Merlin /PLINK format ==In Merlin/PLINK .map files, each line contains both genomic position and dbSNP rs number. Our goal here is to use both information to liftOver as many position as possible.There are 3 methods to liftOver and we recommend the first 2 method. The first method is common, and it lifts most genomic positions, however, it does not reflect the dbSNP build change. The second method is more robust in the sense that each lifted rs number has valid genomic position, as its uses dbSNP as data source. The third method is not straigtforward, and we just briefly mention it. === Lift Merlin format ===
PLINK format and [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin format are nearly identical].
to obtain Merlin .map file.
=== Lift PLINK format ===
[http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml PLINK] format usually referrs to .ped and .map files.
==== Method 1 ====We recommend split mainly use UCSC LiftOver binary tools to help lift over. We have a script [[#Resources | liftMap.py]], however, it is recommended to understand the jobs in several stepsjob step by step: (1) convert Convert .map to .bed file
By rearrange columns of .map file, we obtain a standard BED format file.
(2) liftOver LiftOver .bed file Use method mentioned [[#Lift genomic positions | above ]] to convert .bed file from one build to another. (3) Convert lifted .bed file back to .map file
Rearrange column of .map file to obtain .bed file in the new build.
(4) modify Modify .ped file
.ped file have many column files. By convention, the first six columns are family_id, person_id, father_id, mother_id, sex, and phenotype.
From the 7th column, there are two letters/digits representing a genotype at the certain marker. In step (2), as some genomic positions cannot
(5) (optionally) change the rs number in the .map file
Similar to the human reference build, dbSNP also have different versions. You may consider change rs number from the old dbSNP version to new dbSNP version
depending on your needs Such steps are described in [#Lift dbSNP rs numbers | Lift dbSNP rs numbers].
==== Method 2 ====The idea is to use [[#Resources |LiftRsNumber.py]] to convert old rs number to new rs number, use the data file b132_SNPChrPosOnRef_37_1.bcp.gz (6a data file containing each dbSNP and its positions in NCBI build 37) , and adjust .map and .ped files accordingly. (optionally1) additional method Extract and lift rs numbers Use the tools [[#Use RsMergeArch and SNPHistory | LiftRsNumber.py]] to lift the rs number in the map file from old build to new build. (2) Lookup SNP positions from rs number dbSNP postionprovides a file [[#Resources | joinb132_SNPChrPosOnRef_37_1.bcp.gz]] which contains rsNumber, chromosome and its position.Use this file along with the new rsNumber obtained in the first step.In practice, some rs numbers do not exist in build 132, or not suitable to be considered ( e.g. they do not reside on human reference, or they are mapped to multiple locations, these scenarios are noted by the chromosome column with values like "AltOnly", "Multi", "NotOn", "PAR", "Un"), we can drop them in the liftover procedure.We will obtain the rs number and its position in the new build after this step. (3) Lift .map file and .ped file To lift over .map files, we can scan its content line by line, and skip those not lifted rs number.Accordingly, it is necessary to drop the un-lifted SNP genotypes from .ped file. ==== Method 3 ====NCBI dbSNP team has provided a [#Resources | provisional map ] for converting the genomic position of a larget set dbSNP from NCBI build 36 to NCBI build 37.
In the second step, we have obtained unlifted genomic positions, so we can try to use the table to convert those unlfted dbSNPs.
After this step, there are still some SNPs that cannot be lifted, and as they are mostly located on non-reference chromosome.Note: due to the limitation of the provisional map, some SNP can have multiple locations.For example, we cannot convert rs10000199 to chromosome 4, 7, 12.<pre>10000199 A/G 4 166142415 166142415 2 3 G + 4 165922965 165922965 2 3 G +10000199 A/G 7 4589694 4589694 2 3 C - 7 4623168 4623168 2 3 C -10000199 A/G 12 57008620 57008620 2 3 C - 12 58722353 58722353 2 3 C -10000199 A/G 5 156018406 156018406 2 3 C - 5 156085828 156085828 2 3 C -</pre> We can dissect this method into steps:
== Various reasons that lift over could fail ==
Thus it is probably not very useful to lift this SNP.
=== Cannot find rs number changed in newer dbSNP build ===
It is possible that new dbSNP build does not have certain rs numbers.
When dbSNp release new build, higher rs number may be merged to lower rs number because of those rs numbers are actually the same SNP.
</pre>
== Resouces Resources ==* liftRsNumber.py [[Media: liftRsNumber.py]]* liftMap.py [[Media: liftMap.py]
* NCBI provisional map [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.txt.gz file] and [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.info info]
* NCBI RgMergeArch file and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=RsMergeArch schema]