Changes

From Genome Analysis Wiki
Jump to navigationJump to search
123 bytes added ,  11:24, 10 August 2011
no edit summary
Line 2: Line 2:  
LiftOver can have three use cases:  
 
LiftOver can have three use cases:  
   −
(1) [[#Lift genomic positions | convert genomic position from one genome assembly to another genome assembly]]
+
(1) [[#Lift genome positions | convert genome position from one genome assembly to another genome assembly]]
In most scenarios, we have known genomic positions in NCBI build 36 (UCSC hg 18) and hope to lift them over to NCBI build 37 (UCSC hg19).
+
In most scenarios, we have known genome positions in NCBI build 36 (UCSC hg 18) and hope to lift them over to NCBI build 37 (UCSC hg19).
    
(2) [[#Lift dbSNP rs numbers | convert dbSNP rs number from one build to another]]
 
(2) [[#Lift dbSNP rs numbers | convert dbSNP rs number from one build to another]]
   −
(3) [[#Lift Merlin/PLINK format |convert both genomic position and dbSNP rs number over different versions]]
+
(3) [[#Lift Merlin/PLINK format |convert both genome position and dbSNP rs number over different versions]]
 
It is likely to see such type of data in Merlin/PLINK format.
 
It is likely to see such type of data in Merlin/PLINK format.
   Line 17: Line 17:  
With our customized scripts, we can also lift rsNumber and Merlin/PLINK data files.  
 
With our customized scripts, we can also lift rsNumber and Merlin/PLINK data files.  
   −
== Lift genomic positions ==
+
== Lift genome positions ==
Genomic positions are best represented in [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED format]. UCSC provides tools to convert BED file from one genome assembly to another.  
+
Genome positions are best represented in [http://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED format]. UCSC provides tools to convert BED file from one genome assembly to another.  
    
=== Binary liftOver tool ===
 
=== Binary liftOver tool ===
Line 35: Line 35:  
     liftOver input.bed hg18ToHg19.over.chain.gz output.bed unlifted.bed
 
     liftOver input.bed hg18ToHg19.over.chain.gz output.bed unlifted.bed
   −
unlifted.bed file will contain all genomic positions that cannot be lifted. The reason for that varies. See [[#Various reasons that lift over could fail | Various reasons that lift over could fail]]
+
unlifted.bed file will contain all genome positions that cannot be lifted. The reason for that varies. See [[#Various reasons that lift over could fail | Various reasons that lift over could fail]]
    
=== Web interface ===
 
=== Web interface ===
 
Alternatively, you can lift over BED file in web interface
 
Alternatively, you can lift over BED file in web interface
 
at: [http://genome.ucsc.edu/cgi-bin/hgLiftOver Link]
 
at: [http://genome.ucsc.edu/cgi-bin/hgLiftOver Link]
Web interface can tell you why some genomic position cannot
+
Web interface can tell you why some genome position cannot
 
be lifted if you click "Explain failure messages"
 
be lifted if you click "Explain failure messages"
   Line 80: Line 80:     
== Lift Merlin/PLINK format ==
 
== Lift Merlin/PLINK format ==
In Merlin/PLINK .map files, each line contains both genomic position and dbSNP rs number. Our goal here is to use both information to liftOver as many position as possible.
+
In Merlin/PLINK .map files, each line contains both genome position and dbSNP rs number. Our goal here is to use both information to liftOver as many position as possible.
There are 3 methods to liftOver and we recommend the first 2 method. The first method is common, and it lifts most genomic positions, however, it does not reflect the dbSNP build change. The second method is more robust in the sense that each lifted rs number has valid genomic position, as its uses dbSNP as data source. The third method is not straigtforward, and we just briefly mention it.
+
There are 3 methods to liftOver and we recommend the first 2 method. The first method is common, and it lifts most genome positions, however, it does not reflect the dbSNP build change. The second method is more robust in the sense that each lifted rs number has valid genome position, as its uses dbSNP as data source. The third method is not straigtforward, and we just briefly mention it.
    
=== Lift Merlin format ===
 
=== Lift Merlin format ===
Line 107: Line 107:  
(2) LiftOver .bed file
 
(2) LiftOver .bed file
   −
Use method mentioned [[#Lift genomic positions | above]] to convert .bed file from one build to another.
+
Use method mentioned [[#Lift genome positions | above]] to convert .bed file from one build to another.
    
(3) Convert lifted .bed file back to .map file
 
(3) Convert lifted .bed file back to .map file
Line 116: Line 116:     
.ped file have many column files. By convention, the first six columns are family_id, person_id, father_id, mother_id, sex, and phenotype.
 
.ped file have many column files. By convention, the first six columns are family_id, person_id, father_id, mother_id, sex, and phenotype.
From the 7th column, there are two letters/digits representing a genotype at the certain marker. In step (2), as some genomic positions cannot
+
From the 7th column, there are two letters/digits representing a genotype at the certain marker. In step (2), as some genome positions cannot
 
be lifted to the new version, we need to drop their corresponding columns from .ped file to keep consistency. You can use PLINK --exclude those snps,
 
be lifted to the new version, we need to drop their corresponding columns from .ped file to keep consistency. You can use PLINK --exclude those snps,
 
see [http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#exclude Remove a subset of SNPs].
 
see [http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#exclude Remove a subset of SNPs].
Line 145: Line 145:     
==== Method 3 ====
 
==== Method 3 ====
NCBI dbSNP team has provided a [#Resources | provisional map] for converting the genomic position of a larget set dbSNP from NCBI build 36 to NCBI build 37.  
+
NCBI dbSNP team has provided a [#Resources | provisional map] for converting the genome position of a larget set dbSNP from NCBI build 36 to NCBI build 37.  
In the second step, we have obtained unlifted genomic positions, so we can try to use the table to convert those unlfted dbSNPs.
+
In the second step, we have obtained unlifted genome positions, so we can try to use the table to convert those unlfted dbSNPs.
 
After this step, there are still some SNPs that cannot be lifted, as they are mostly located on non-reference chromosome.
 
After this step, there are still some SNPs that cannot be lifted, as they are mostly located on non-reference chromosome.
 
Note: due to the limitation of the provisional map, some SNP can have multiple locations.
 
Note: due to the limitation of the provisional map, some SNP can have multiple locations.
Line 165: Line 165:  
(2) Use provisional map to update .map file
 
(2) Use provisional map to update .map file
   −
By joining .map file and this provisional map, we can obtain the new genomic position in the new build.
+
By joining .map file and this provisional map, we can obtain the new genome position in the new build.
 
Note: provisional map uses 1-based chromosomal index. Things will get tricker if we want to lift non-single site SNP e.g. AA/GG
 
Note: provisional map uses 1-based chromosomal index. Things will get tricker if we want to lift non-single site SNP e.g. AA/GG
Since provisional map provides a range in this case, it is necessary to know the genomic position of that single base provided in the .map file,  
+
Since provisional map provides a range in this case, it is necessary to know the genome position of that single base provided in the .map file,  
 
and then we can look up the table, so it is not straigtforward.
 
and then we can look up the table, so it is not straigtforward.
   Line 177: Line 177:  
== Various reasons that lift over could fail ==
 
== Various reasons that lift over could fail ==
   −
=== Genomic position cannot be lifted ===
+
=== Genome position cannot be lifted ===
When a SNP resides in a contig that only exists in older reference build, liftOver cannot give it new genomic.
+
When a SNP resides in a contig that only exists in older reference build, liftOver cannot give it new genome.
    
You can try the following SNP (in BED format) in UCSC online liftOver site:
 
You can try the following SNP (in BED format) in UCSC online liftOver site:
Line 186: Line 186:  
=== SNP in higher build are located in non-referernce assembly ===
 
=== SNP in higher build are located in non-referernce assembly ===
 
Some SNP are not in autosomes or sex chromosomes in NCBI build 37. dbSNP does not include them.  
 
Some SNP are not in autosomes or sex chromosomes in NCBI build 37. dbSNP does not include them.  
You cannot use dbSNP database to lookup its genomic position by rs number.
+
You cannot use dbSNP database to lookup its genome position by rs number.
    
Take rs1006094 as an example:
 
Take rs1006094 as an example:
Line 234: Line 234:     
== Resources ==
 
== Resources ==
* liftRsNumber.py [[Media: liftRsNumber.py]]
+
* liftRsNumber.py []
* liftMap.py [[Media: liftMap.py]
+
* liftMap.py []
 
* NCBI provisional map [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.txt.gz file] and [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.info info]
 
* NCBI provisional map [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.txt.gz file] and [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/misc/exchange/Remap_36_3_37_1.info info]
* NCBI RgMergeArch file and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=RsMergeArch schema]
+
* NCBI RgMergeArch [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/database/organism_data/RsMergeArch.bcp.gz file] and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=RsMergeArch schema]
* NCBI SNPHistory file and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=SNPHistory schema]
+
* NCBI SNPHistory [ftp://ftp.ncbi.nih.gov:/snp/organisms/human_9606/database/organism_data/SNPHistory.bcp.gz file] and [http://www.ncbi.nlm.nih.gov/SNP/snp_db_table_description.cgi?t=SNPHistory schema]
 
* How UCSC dbSNP differs from NCBI dbSNP [http://genomewiki.ucsc.edu/index.php/DbSNP_Track_Notes UCSC dbSNP track note]
 
* How UCSC dbSNP differs from NCBI dbSNP [http://genomewiki.ucsc.edu/index.php/DbSNP_Track_Notes UCSC dbSNP track note]
 
* The dbSNP mapping process [http://www.ncbi.nlm.nih.gov/books/NBK44455/ link]
 
* The dbSNP mapping process [http://www.ncbi.nlm.nih.gov/books/NBK44455/ link]
255

edits

Navigation menu