LiftOver

From Genome Analysis Wiki
Revision as of 17:11, 8 August 2011 by Zhanxw (talk | contribs) (Created page with 'LiftOver LiftOver is a necesary step to bring all genetical analysis back to the same reference build. Particularly, our current data are mainly in either NCBI build 36 (UCSC h…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

LiftOver

LiftOver is a necesary step to bring all genetical analysis back to the same reference build. Particularly, our current data are mainly in either NCBI build 36 (UCSC hg 18) or NCBI build 37 (UCSC hg19). Although lift over can be from high build to lower build, we always recommend lift lower build to higher/current build.

LiftOver is not hard. The easier way is to use UCSC liftOver tool to lift BED format file to BED format file. With additional steps, we can also lift Merlin and PLINK format.

Lift over using BED files

1.1 Binary liftOver tool Download the liftOver binary from UCSC and hg18 to hg 19 chain file

Provide BED format file (input.bed)

NOTE: Use the 'chr' before each chromosome name chr1 743267 743268 rs3115860 chr1 766408 766409 rs12124819 chr1 773885 773886 rs17160939

Run liftOver:

   liftOver input.bed hg18ToHg19.over.chain.gz output.bed unlifted.bed

1.2 Web interface Alternatively, you can lift over BED file in web interface at: [1] Web interface can tell you why some genomic position cannot be lifted.

2. Lift Merlin format PLINK format and Merlin format are nearly identical, except the .map file.

3. Lift PLINK format PLINK format usually referrs to .ped and .map files. We recommend split the jobs in several steps: (1) convert .map to .bed file

(2) liftOver .bed file

(3) convert lifted .bed file back to .map file

(4) modify .ped file

(5) (optionally) change the rs number in .map file to newer version


4. Lift RS id numbers RS number is release by dbSNP. UCSC also make their own copy from the dbSNP release. However, the same SNP build from these two centers are not the same.

4.1 Use dbSNP provided exchange file

4.2 Use the combination of RgMergeArch.bcp.gz and SNPHistory.bcp.gz

5. Why you cannot lift ? 5.1 genomic position cannot be lifted Possible reasons: That could happen if SNP position exists in old build but not in new build. For example: Try the following SNP (BED format) cannot be lifted: 20 56737667 56737668 rs1073519

5.2 rs number cannot be lifted between build Possible reasons: When dbSNp release new build, some high rs number may be merged to low rs number because of those rs numbers are actually the same SNP. This merge process can be complicate. For detail, see: http://www.ncbi.nlm.nih.gov/books/NBK44395/#FTP.do_you_have_a_table_of_merge d_snps_s For example: rs3001 has merged to rs2032.

5.3. SNP in higher build are located in non-referernce assembly. For example: rs1006094 In NCBI dbSNP, this SNP is reported as "Mapped unambiguously on non-reference assembly only" Thus it is probably not very useful to lift this SNP.

5.4 different dbSNP list 4. Different dbSNP database NCBI released dbSNP132 in VCF format, and UCSC also have their version of dbSNP132 in plain txt format. The two database files differ not only in file format, but in content as well. For example: rs1054140

NCBI dbSNP website (showed 1 location): http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1054140 UCSC genome browser website(showed 2 locations): http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg19&posit ion=rs1054140&hgt.suggest=&hgt.suggestTrack=knownGene&pix=800&Submit=submit& hgsid=205770459&hgt.newJQuery=1 NCBI dbSNP VCF file (no record) UCSC genome browser (2 locations): 721 chr10 17842693 17842694 rs1054140 0 + T T A/T genomic single by-cluster,by-submitter 0.5 0 unknown exact 2 MultipleAlignments 8 ABI,BCM-HGSC-SUB,HUMANGENOME_JCVI,ILLUMINA,KRIBB_YJKIM,LEE,SEQUENOM,WI_SSAHA SNP, 2 T,A, 1.000000,1.000000, 0.500000,0.500000, maf-5-some-pop,maf-5-all-pops 723 chr10 18089681 18089682 rs1054140 0 + T T A/T genomic single by-cluster,by-submitter 0.5 0 untranslated-3 exact 2 MultipleAlignments 8 ABI,BCM-HGSC-SUB,HUMANGENOME_JCVI,ILLUMINA,KRIBB_YJKIM,LEE,SEQUENOM,WI_SSAHA SNP, 2 T,A, 1.000000,1.000000, 0.500000,0.500000, maf-5-some-pop,maf-5-all-pops

e.g. rs3115860 in UCSC dbSNP 132 appeared twice, but does not appear in NCBI dbSNP 132.

6. Resouces

BED format: NCBI exchange file and schema: NCBI RgMergeArch file and schema: NCBI SNPHistory file and schema: