Regions of high linkage disequilibrium (LD)

From Genome Analysis Wiki
Jump to navigationJump to search

There are regions of long-range, high linkage diequilibrium in the human genome [1][2]. These regions should be excluded when performing certain analyses such as principal component analysis on genotype data.

High-ld-b38.png


Here is a list of positions for GRCH Build 38. There positions are provided by the plinkQC R package and were provided by Anderson2010[3]

Chr Start Stop
chr1 47761740 51761740
chr1 125169943 125170022
chr1 144106678 144106709
chr1 181955019 181955047
chr2 85919365 100517106
chr2 87416141 87416186
chr2 87417804 87417863
chr2 87418924 87418981
chr2 89917298 89917322
chr2 135275091 135275210
chr2 182427027 189427029
chr2 207609786 207609808
chr3 47483505 49987563
chr3 83368158 86868160
chr5 44464140 51168409
chr5 129636407 132636409
chr6 25391792 33424245
chr6 26726947 26726981
chr6 57788603 58453888
chr6 61109122 61357029
chr6 61424410 61424451
chr6 139637169 142137170
chr7 54964812 66897578
chr7 62182500 62277073
chr8 8105067 12105082
chr8 43025699 48924888
chr8 47303500 47317337
chr8 110918594 113918595
chr9 40365644 40365693
chr9 64198500 64200392
chr9 88958735 88959017
chr10 36671065 43184546
chr10 41693521 41885273
chr11 88127183 91127184
chr12 32955798 41319931
chr12 34639034 34639084
chr14 87391719 87391996
chr14 94658026 94658080
chr17 43159541 43159574
chr20 4031884 4032441
chr20 33948532 36438183
chr22 30060084 30060162
chr22 42980497 42980522

Here is a list of positions for GRCH Build 37

High-ld.png

Chr Start Stop
1 48000000 52000000
2 86000000 100500000
2 134500000 138000000
2 183000000 190000000
3 47500000 50000000
3 83500000 87000000
3 89000000 97500000
5 44500000 50500000
5 98000000 100500000
5 129000000 132000000
5 135500000 138500000
6 25000000 35000000
6 57000000 64000000
6 140000000 142500000
7 55000000 66000000
8 7000000 13000000
8 43000000 50000000
8 112000000 115000000
10 37000000 43000000
11 46000000 57000000
11 87500000 90500000
12 33000000 40000000
12 109500000 112000000
20 32000000 34500000


These positions are for GRCH build 36.

Chr Start Stop ID
1 48060567 52060567 hild1
2 85941853 100407914 hild2
2 134382738 137882738 hild3
2 182882739 189882739 hild4
3 47500000 50000000 hild5
3 83500000 87000000 hild6
3 89000000 97500000 hild7
5 44500000 50500000 hild8
5 98000000 100500000 hild9
5 129000000 132000000 hild10
5 135500000 138500000 hild11
6 25500000 33500000 hild12
6 57000000 64000000 hild13
6 140000000 142500000 hild14
7 55193285 66193285 hild15
8 8000000 12000000 hild16
8 43000000 50000000 hild17
8 112000000 115000000 hild18
10 37000000 43000000 hild19
11 46000000 57000000 hild20
11 87500000 90500000 hild21
12 33000000 40000000 hild22
12 109521663 112021663 hild23
20 32000000 34500000 hild24
X 14150264 16650264 hild25
X 25650264 28650264 hild26
X 33150264 35650264 hild27
X 55133704 60500000 hild28
X 65133704 67633704 hild29
X 71633704 77580511 hild30
X 80080511 86080511 hild31
X 100580511 103080511 hild32
X 125602146 128102146 hild33
X 129102146 131602146 hild34

Excluding Regions With Plink

You can remove these regions from a PED file using the following PLINK commands. Assuming you have the data stored in a file named "high-ld.txt"

   plink --file mydata --make-set high-ld.txt --write-set --out hild
  plink --file mydata --exclude hild.set --recode --out mydatatrimmed

References

  1. Price et al. (2008) Long-Range LD Can Confound Genome Scans in Admixed Populations. Am. J. Hum. Genet. 86, 127-147
  2. Weale M. (2010) Quality Control for Genome-Wide Association Studies from Michael R. Barnes and Gerome Breen (eds.), Genetic Variation: Methods and Protocols, Methods in Molecular Biology, vol. 628, DOI 10.1007/978-1-60327-367-1_19, © Springer Science+Business Media, LLC 2010
  3. Anderson, Carl A., et al. "Data quality control in genetic case-control association studies." Nature protocols 5.9 (2010): 1564-1573.