Difference between revisions of "LiftOver"

Revision as of 11:49, 9 August 2011

LiftOver is a necesary step to bring all genetical analysis to the same reference build. Particularly, our current data are mainly in either NCBI build 36 (UCSC hg 18) or NCBI build 37 (UCSC hg19). Although lift over can be from higher build to lower build, we always recommend lift lower build to higher/current build.

LiftOver is not hard. The easier way is to use UCSC liftOver tool to lift BED format file to BED format file. With additional steps, we can also lift Merlin and PLINK data files.

Besides introducing lift over genomic positions, lifting SNPs is also introduced.

Lift over using BED files

Binary liftOver tool

Download the liftOver binary from UCSC and hg18 to hg 19 chain file

Provide BED format file (input.bed)

NOTE: Use the 'chr' before each chromosome name

chr1    743267  743268  rs3115860
chr1    766408  766409  rs12124819
chr1    773885  773886  rs17160939

Run liftOver:

   liftOver input.bed hg18ToHg19.over.chain.gz output.bed unlifted.bed

unlifted file will contain all genomic positions that cannot be lifted. The reason for that varies. See Various reasons that lift over could fail

Web interface

Alternatively, you can lift over BED file in web interface at: Link Web interface can tell you why some genomic position cannot be lifted if you click "Explain failure messages"

Lift Merlin format

PLINK format and Merlin format are nearly identical. The difference is that Merlin .map file have 4 columns. We will show the lift over procedure for PLINK format, then you can use:

 awk '{print $1,$2,"\t",$3;}' PLINK.map > Merlin.map

to obtain Merlin .map file.

Lift PLINK format

PLINK format usually referrs to .ped and .map files.

We recommend split the jobs in several steps: (1) convert .map to .bed file By rearrange columns of .map file, we obtain a standard BED format file.

(2) liftOver .bed file Use method mentioned above to convert .bed file from one build to another.

(3) convert lifted .bed file back to .map file Rearrange column of .map file to obtain .bed file in the new build.

(4) modify .ped file .ped file have many column files. By convention, the first six columns are family_id, person_id, father_id, mother_id, sex, and phenotype. From the 7th column, there are two letters/digits representing a genotype at the certain marker. In step (2), as some genomic positions cannot be lifted to the new version, we need to drop their corresponding columns from .ped file to keep consistency. You can use PLINK --exclude those snps, see Remove a subset of SNPs.

(5) (optionally) change the rs number in the .map file Similar to the human reference build, dbSNP also have different versions. You may consider change rs number from the old dbSNP version to new dbSNP version depending on your needs Such steps are described in [#Lift dbSNP rs numbers | Lift dbSNP rs numbers].

(6) (optionally) additional method to lift dbSNP postion NCBI dbSNP team has provided a provisional map for converting the genomic position of a larget set dbSNP from NCBI build 36 to NCBI build 37. In the second step, we have obtained unlifted genomic positions, so we can try to use the table to convert those unlfted dbSNPs. After this step, there are still some SNPs that cannot be lifted, and they are mostly located on non-reference chromosome.

Lift dbSNP rs numbers

rs number is release by dbSNP. UCSC also make their own copy from each dbSNP version. Be aware that the same version of dbSNP from these two centers are not the same. When we convert rs number from lower version to higher version, there are practically two ways.

Use RsMergeArch and SNPHistory

In short,

when different rs number are found to refer to the same SNP, then higher rs number will be merged to lower rs number, and the merging will be recorded in RsMergedArch.bcp.gz.
when rs number have to be retracted, rs number will be recorded in SNPHistory.bcp.gz

So we need to combine these two tables to obtain the relationship between older rs number and new rs number. Luckily, we have a script for internal use. See liftRsNumber.py

Various reasons that lift over could fail

Genomic position cannot be lifted

When a SNP resides in a contig that only exists in older reference build, liftOver cannot give it new genomic.

You can try the following SNP (in BED format) in UCSC online liftOver site:

20 56737667 56737668 rs1073519

The error message will be: "Sequence intersects no chains"

SNP in higher build are located in non-referernce assembly

Some SNP are not in autosomes or sex chromosomes in NCBI build 37. dbSNP does not include them. You cannot use dbSNP database to lookup its genomic position by rs number.

Take rs1006094 as an example: In NCBI dbSNP webpage, this SNP is reported as "Mapped unambiguously on non-reference assembly only" Thus it is probably not very useful to lift this SNP.

Cannot find rs number in newer dbSNP build

It is possible that new dbSNP build does not have certain rs numbers. When dbSNp release new build, higher rs number may be merged to lower rs number because of those rs numbers are actually the same SNP. This merge process can be complicate. For short description, see [#Use RsMergeArch and SNPHistory | Use RsMergeArch and SNPHistory]. For detail, see:

Finding Specific Data in dbSNP’s FTP Files

Merging RefSNP Numbers and RefSNP Clusters

For example:

rs3001 has merged to rs2032.

Different dbSNP build

NCBI released dbSNP132 (VCF format), and UCSC also have their version of dbSNP132 (plain txt). The two database files differ not only in file format, but in content.

For NCBI release, its release will not contain:

SNPs listed as microsatellites or named variations
SNPs with multibyte alleles and unknown (N) adjacent base pairs
SNPs that are not mapped on the reference genome (GRCh37)

For UCSC release, see [#Resources | UCSC dbSNP track note]

Use rs1054140 as an example:

NCBI dbSNP website gives 1 location: Link

NCBI dbSNP VCF file has NO record.

UCSC genome browser website gives 2 locations: Link

UCSC dbSNP file give 2 locations:

721     chr10   17842693        17842694        rs1054140       0       +       T       T       A/T     genomic single  by-cluster,by-submitter ...
723     chr10   18089681        18089682        rs1054140       0       +       T       T       A/T     genomic single  by-cluster,by-submitter ...

Resouces

NCBI provisional map file and info
NCBI RgMergeArch file and schema
NCBI SNPHistory file and schema
How UCSC dbSNP differs from NCBI dbSNP UCSC dbSNP track note
The dbSNP mapping process link
NCBI dbSNP release 132 00-All.vcf.gz
UCSC dbSNP release 132 snp132.txt.gz

Acknowledge

Hyun: provides sample liftOver tool: [/net/wonderland/home/hmkang/prj/Sardinia/MetaboChip/scripts/j01-liftover-metabochip-positions.pl]
Alex: careful examines of 0-based index in UCSC data file
Adrian: explaination of SNPs omitted in NCBI dbSNP file
Goncalo: all other supports

Questions and Comments

Please contact Xiaowei Zhan.

Difference between revisions of "LiftOver"

Revision as of 11:49, 9 August 2011

Contents

Lift over using BED files

Binary liftOver tool

Web interface

Lift Merlin format

Lift PLINK format

Lift dbSNP rs numbers

Use RsMergeArch and SNPHistory

Various reasons that lift over could fail

Genomic position cannot be lifted

SNP in higher build are located in non-referernce assembly

Cannot find rs number in newer dbSNP build

Different dbSNP build

Resouces

Acknowledge

Questions and Comments

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools

@@ Line 1: / Line 1: @@
-LiftOver
+LiftOver is a necesary step to bring all genetical analysis to the same reference build.
-LiftOver is a necesary step to bring all genetical analysis back to the same reference build.
 Particularly, our current data are mainly in either NCBI build 36 (UCSC hg 18) or NCBI build 37 (UCSC hg19).
-Although lift over can be from high build to lower build, we always recommend lift lower build to higher/current build.
+Although lift over can be from higher build to lower build, we always recommend lift lower build to higher/current build.
 LiftOver is not hard. The easier way is to use UCSC liftOver tool to lift [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED format] file to BED format file.
-With additional steps, we can also lift Merlin and PLINK format.
+With additional steps, we can also lift Merlin and PLINK data files.
+Besides introducing lift over genomic positions, lifting SNPs is also introduced.
 == Lift over using BED files ==
-.1 Binary liftOver tool
+=== Binary liftOver tool ===
 Download the [http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver liftOver binary] from UCSC and [http://hgdownload.cse.ucsc.edu/goldenPath/hg18/liftOver/hg18ToHg19.over.chain.gz hg18 to hg 19 chain file]
@@ Line 16: / Line 16: @@
 NOTE: Use the 'chr' before each chromosome name
+<pre>
 chr1    743267  743268  rs3115860
 chr1    766408  766409  rs12124819
 chr1    773885  773886  rs17160939
+</pre>
 Run liftOver:
      liftOver input.bed hg18ToHg19.over.chain.gz output.bed unlifted.bed
-.2 Web interface
+unlifted file will contain all genomic positions that cannot be lifted. The reason for that varies. See [[#Various reasons that lift over could fail | Various reasons that lift over could fail]]
+=== Web interface ===
 Alternatively, you can lift over BED file in web interface
-at: [http://genome.ucsc.edu/cgi-bin/hgLiftOver]
+at: [http://genome.ucsc.edu/cgi-bin/hgLiftOver Link]
 Web interface can tell you why some genomic position cannot
-be lifted.
+be lifted if you click "Explain failure messages"
+== Lift Merlin format ==
+PLINK format and [http://www.sph.umich.edu/csg/abecasis/Merlin/tour/input_files.html Merlin format are nearly identical].
+The difference is that Merlin .map file have 4 columns. We will show
+the lift over procedure for PLINK format, then you can use:
+  awk '{print $1,$2,"\t",$3;}' PLINK.map > Merlin.map
+to obtain Merlin .map file.
+== Lift PLINK format ==
+[http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml PLINK] format usually referrs to .ped and .map files.
-. Lift Merlin format
-PLINK format and Merlin format are nearly identical, except the
-.map file.
-. Lift PLINK format
-PLINK format usually referrs to .ped and .map files.
 We recommend split the jobs in several steps:
 (1) convert .map to .bed file
+By rearrange columns of .map file, we obtain a standard BED format file.
 (2) liftOver .bed file
+Use method mentioned above to convert .bed file from one build to another.
 (3) convert lifted .bed file back to .map file
+Rearrange column of .map file to obtain .bed file in the new build.
 (4) modify .ped file
+.ped file have many column files. By convention, the first six columns are family_id, person_id, father_id, mother_id, sex, and phenotype.
+From the 7th column, there are two letters/digits representing a genotype at the certain marker. In step (2), as some genomic positions cannot
+be lifted to the new version, we need to drop their corresponding columns from .ped file to keep consistency. You can use PLINK --exclude those snps,
+see [http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#exclude Remove a subset of SNPs].
+(5) (optionally) change the rs number in the .map file
+Similar to the human reference build, dbSNP also have different versions. You may consider change rs number from the old dbSNP version to new dbSNP version
+depending on your needs Such steps are described in [#Lift dbSNP rs numbers | Lift dbSNP rs numbers].
+(6) (optionally) additional method to lift dbSNP postion
+NCBI dbSNP team has provided a provisional map for converting the genomic position of a larget set dbSNP from NCBI build 36 to NCBI build 37.
+In the second step, we have obtained unlifted genomic positions, so we can try to use the table to convert those unlfted dbSNPs.
+After this step, there are still some SNPs that cannot be lifted, and they are mostly located on non-reference chromosome.
+== Lift dbSNP rs numbers ==
+rs number is release by dbSNP. UCSC also make their own copy from each dbSNP version. Be aware that the same version of dbSNP from these two centers are not the same.
+When we convert rs number from lower version to higher version, there are practically two ways.
+=== Use RsMergeArch and SNPHistory ===
+In short,
+# when different rs number are found to refer to the same SNP, then higher rs number will be merged to lower rs number, and the merging will be recorded in RsMergedArch.bcp.gz.
+# when rs number have to be retracted, rs number will be recorded in SNPHistory.bcp.gz
+So we need to combine these two tables to obtain the relationship between older rs number and new rs number.
+Luckily, we have a script for internal use. See liftRsNumber.py
+== Various reasons that lift over could fail ==
+=== Genomic position cannot be lifted ===
+When a SNP resides in a contig that only exists in older reference build, liftOver cannot give it new genomic.
+You can try the following SNP (in BED format) in UCSC online liftOver site:
+56737667 56737668 rs1073519
+The error message will be: "Sequence intersects no chains"
-(5) (optionally) change the rs number in .map file to
+=== SNP in higher build are located in non-referernce assembly ===
-newer version
+Some SNP are not in autosomes or sex chromosomes in NCBI build 37. dbSNP does not include them.
+You cannot use dbSNP database to lookup its genomic position by rs number.
+Take rs1006094 as an example:
+In NCBI dbSNP webpage, this SNP is reported as "Mapped unambiguously on non-reference assembly only"
+Thus it is probably not very useful to lift this SNP.
-. Lift RS id numbers
+=== Cannot find rs number in newer dbSNP build ===
-RS number is release by dbSNP. UCSC also make their own copy from the dbSNP release. However, the same SNP build from these two centers are not the same.
+It is possible that new dbSNP build does not have certain rs numbers.
+When dbSNp release new build, higher rs number may be merged to lower rs number because of those rs numbers are actually the same SNP.
+This merge process can be complicate. For short description, see [#Use RsMergeArch and SNPHistory | Use RsMergeArch and SNPHistory].
+For detail, see:
-.1 Use dbSNP provided exchange file
+[http://www.ncbi.nlm.nih.gov/books/NBK44395/#FTP.do_you_have_a_table_of_merged_snps_s Finding Specific Data in dbSNP’s FTP Files]
-.2 Use the combination of RgMergeArch.bcp.gz and
+[http://www.ncbi.nlm.nih.gov/books/NBK44468/#Build.can_two_id_numbers_correspond_to_t Merging RefSNP Numbers and RefSNP Clusters]
-SNPHistory.bcp.gz
-. Why you cannot lift ?
-.1 genomic position cannot be lifted
-Possible reasons:
-That could happen if SNP position exists in old build but not in new build.
 For example:
-Try the following SNP (BED format) cannot be lifted:
-56737667 56737668 rs1073519
-.2 rs number cannot be lifted between build
-Possible reasons:
-When dbSNp release new build, some high rs number may be merged to low rs
-number because of those rs numbers are actually the same SNP.
-This merge process can be complicate. For detail, see:
-http://www.ncbi.nlm.nih.gov/books/NBK44395/#FTP.do_you_have_a_table_of_merge
-d_snps_s
-For example:
 rs3001 has merged to rs2032.
-.3. SNP in higher build are located in non-referernce assembly.
+=== Different dbSNP build ===
-For example:
+NCBI released dbSNP132 (VCF format), and UCSC also have their version of dbSNP132 (plain txt).
-rs1006094
+The two database files differ not only in file format, but in content.
-In NCBI dbSNP, this SNP is reported as "Mapped unambiguously on
-non-reference assembly only"
+For NCBI release, its [[#Resources| release]] will not contain:
-Thus it is probably not very useful to lift this SNP.
+* SNPs listed as microsatellites or named variations
+* SNPs with multibyte alleles and unknown (N) adjacent base pairs
+* SNPs that are not mapped on the reference genome (GRCh37)
+For UCSC release, see [#Resources | UCSC dbSNP track note]
+Use rs1054140 as an example:
+NCBI dbSNP website gives 1 location:
+[http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1054140 Link]
+NCBI dbSNP VCF file has NO record.
+UCSC genome browser website gives 2 locations:
+[http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg19&position=rs1054140&hgt.suggest=&hgt.suggestTrack=knownGene&pix=800&Submit=submit&hgsid=205770459&hgt.newJQuery=1 Link]
+UCSC dbSNP file give 2 locations:
+<pre>
+     chr10   17842693        17842694        rs1054140       0       +       T       T       A/T     genomic single  by-cluster,by-submitter ...
+     chr10   18089681        18089682        rs1054140       0       +       T       T       A/T     genomic single  by-cluster,by-submitter ...
+</pre>
-.4 different dbSNP list
+== Resouces ==
-. Different dbSNP database
+* NCBI provisional map [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.txt.gz file] and [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.info info]
-NCBI released dbSNP132 in VCF format, and UCSC also have their version of
+* NCBI RgMergeArch file and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=RsMergeArch schema]
-dbSNP132 in plain txt format.
+* NCBI SNPHistory file and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=SNPHistory schema]
-The two database files differ not only in file format, but in content as
+* How UCSC dbSNP differs from NCBI dbSNP [http://genomewiki.ucsc.edu/index.php/DbSNP_Track_Notes UCSC dbSNP track note]
-well.
+* The dbSNP mapping process [http://www.ncbi.nlm.nih.gov/books/NBK44455/ link]
-For example:
+* NCBI dbSNP release 132 [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/VCF/v4.0/ByChromosomeNoGeno/00-All.vcf.gz 00-All.vcf.gz]
-rs1054140
+* UCSC dbSNP release 132 [http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/snp132.txt.gz snp132.txt.gz]
-NCBI dbSNP website (showed 1 location):
+== Acknowledge ==
-http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1054140
-UCSC genome browser website(showed 2 locations):
-http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg19&posit
-ion=rs1054140&hgt.suggest=&hgt.suggestTrack=knownGene&pix=800&Submit=submit&
-hgsid=205770459&hgt.newJQuery=1
-NCBI dbSNP VCF file (no record)
-UCSC genome browser (2 locations):
-	chr10	17842693	17842694	rs1054140	0	+
-T	T	A/T	genomic	single	by-cluster,by-submitter	0.5	0
-unknown	exact	2	MultipleAlignments	8
-ABI,BCM-HGSC-SUB,HUMANGENOME_JCVI,ILLUMINA,KRIBB_YJKIM,LEE,SEQUENOM,WI_SSAHA
-SNP,	2	T,A,	1.000000,1.000000,	0.500000,0.500000,
-maf-5-some-pop,maf-5-all-pops
-	chr10	18089681	18089682	rs1054140	0	+
-T	T	A/T	genomic	single	by-cluster,by-submitter	0.5	0
-untranslated-3	exact	2	MultipleAlignments	8
-ABI,BCM-HGSC-SUB,HUMANGENOME_JCVI,ILLUMINA,KRIBB_YJKIM,LEE,SEQUENOM,WI_SSAHA
-SNP,	2	T,A,	1.000000,1.000000,	0.500000,0.500000,
-maf-5-some-pop,maf-5-all-pops
-e.g. rs3115860 in UCSC dbSNP 132 appeared twice, but does not appear in NCBI dbSNP 132.
+* Hyun: provides sample liftOver tool: [/net/wonderland/home/hmkang/prj/Sardinia/MetaboChip/scripts/j01-liftover-metabochip-positions.pl]
+* Alex: careful examines of 0-based index in UCSC data file
+* Adrian: explaination of SNPs omitted in NCBI dbSNP file
+* Goncalo: all other supports
-. Resouces
+== Questions and Comments ==
-BED format:
+Please contact [mailto:zhanxw@umich.edu Xiaowei Zhan].
-NCBI exchange file and schema:
-NCBI RgMergeArch file and schema:
-NCBI SNPHistory file and schema: