Changes

From Genome Analysis Wiki
Jump to: navigation, search

LiftOver

123 bytes added, 11:24, 10 August 2011
no edit summary
LiftOver can have three use cases:
(1) [[#Lift genomic genome positions | convert genomic genome position from one genome assembly to another genome assembly]]In most scenarios, we have known genomic genome positions in NCBI build 36 (UCSC hg 18) and hope to lift them over to NCBI build 37 (UCSC hg19).
(2) [[#Lift dbSNP rs numbers | convert dbSNP rs number from one build to another]]
(3) [[#Lift Merlin/PLINK format |convert both genomic genome position and dbSNP rs number over different versions]]
It is likely to see such type of data in Merlin/PLINK format.
With our customized scripts, we can also lift rsNumber and Merlin/PLINK data files.
== Lift genomic genome positions ==Genomic Genome positions are best represented in [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED format]. UCSC provides tools to convert BED file from one genome assembly to another.
=== Binary liftOver tool ===
liftOver input.bed hg18ToHg19.over.chain.gz output.bed unlifted.bed
unlifted.bed file will contain all genomic genome positions that cannot be lifted. The reason for that varies. See [[#Various reasons that lift over could fail | Various reasons that lift over could fail]]
=== Web interface ===
Alternatively, you can lift over BED file in web interface
at: [http://genome.ucsc.edu/cgi-bin/hgLiftOver Link]
Web interface can tell you why some genomic genome position cannot
be lifted if you click "Explain failure messages"
== Lift Merlin/PLINK format ==
In Merlin/PLINK .map files, each line contains both genomic genome position and dbSNP rs number. Our goal here is to use both information to liftOver as many position as possible.There are 3 methods to liftOver and we recommend the first 2 method. The first method is common, and it lifts most genomic genome positions, however, it does not reflect the dbSNP build change. The second method is more robust in the sense that each lifted rs number has valid genomic genome position, as its uses dbSNP as data source. The third method is not straigtforward, and we just briefly mention it.
=== Lift Merlin format ===
(2) LiftOver .bed file
Use method mentioned [[#Lift genomic genome positions | above]] to convert .bed file from one build to another.
(3) Convert lifted .bed file back to .map file
.ped file have many column files. By convention, the first six columns are family_id, person_id, father_id, mother_id, sex, and phenotype.
From the 7th column, there are two letters/digits representing a genotype at the certain marker. In step (2), as some genomic genome positions cannot
be lifted to the new version, we need to drop their corresponding columns from .ped file to keep consistency. You can use PLINK --exclude those snps,
see [http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#exclude Remove a subset of SNPs].
==== Method 3 ====
NCBI dbSNP team has provided a [#Resources | provisional map] for converting the genomic genome position of a larget set dbSNP from NCBI build 36 to NCBI build 37. In the second step, we have obtained unlifted genomic genome positions, so we can try to use the table to convert those unlfted dbSNPs.
After this step, there are still some SNPs that cannot be lifted, as they are mostly located on non-reference chromosome.
Note: due to the limitation of the provisional map, some SNP can have multiple locations.
(2) Use provisional map to update .map file
By joining .map file and this provisional map, we can obtain the new genomic genome position in the new build.
Note: provisional map uses 1-based chromosomal index. Things will get tricker if we want to lift non-single site SNP e.g. AA/GG
Since provisional map provides a range in this case, it is necessary to know the genomic genome position of that single base provided in the .map file,
and then we can look up the table, so it is not straigtforward.
== Various reasons that lift over could fail ==
=== Genomic Genome position cannot be lifted ===When a SNP resides in a contig that only exists in older reference build, liftOver cannot give it new genomicgenome.
You can try the following SNP (in BED format) in UCSC online liftOver site:
=== SNP in higher build are located in non-referernce assembly ===
Some SNP are not in autosomes or sex chromosomes in NCBI build 37. dbSNP does not include them.
You cannot use dbSNP database to lookup its genomic genome position by rs number.
Take rs1006094 as an example:
== Resources ==
* liftRsNumber.py [[Media: liftRsNumber.py]]* liftMap.py [[Media: liftMap.py]
* NCBI provisional map [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.txt.gz file] and [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.info info]
* NCBI RgMergeArch [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/database/organism_data/RsMergeArch.bcp.gz file ] and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=RsMergeArch schema]* NCBI SNPHistory [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/database/organism_data/SNPHistory.bcp.gz file ] and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=SNPHistory schema]
* How UCSC dbSNP differs from NCBI dbSNP [http://genomewiki.ucsc.edu/index.php/DbSNP_Track_Notes UCSC dbSNP track note]
* The dbSNP mapping process [http://www.ncbi.nlm.nih.gov/books/NBK44455/ link]
255
edits

Navigation menu