So, I left regular readers on a geographical cliffhanger last week in my search for a decent gazetteer of the whole world which we can use to let pledge creators tell us where their pledges apply to (and to let people to search for pledges near them). No doubt you expect me to now say that I’ve done this and that this marvellous new feature is now up and running on PledgeBank.
Sadly not.
The best data I’ve been able to find is the GEOnet Names Server dumps from the US Department Of Knowing What Places Are Called (or “National Geospatial Intelligence Agency”, as they call themselves). They maintain a big database of all the places in the world (except for the United States, which task falls to the US Geological Survey), mostly, as I understand it, for military applications. Presumably the idea here is that if some US soldier finds himself sharing Hicksville, Iraq, with something he wants to blow up, he can whip out his satellite-telephone, dial 1-800-US-AIR-FORCE (“You Call: We Bomb”) and, once he’s outwitted the phone menu and call-center staff, can arrange an air-strike without having to know anything tedious like his coordinates (“I’m sorry, could you repeat that, please. Do you mean Hicksville, Iraq, or Hicksville, Alabama?”).
Now, when I last looked at this data, it was full of random and quite significant errors (~5km, for the locations of villages in England — much larger than we’d expected from the coordinate transforms from WGS84 to OSGB36 and the National Grid). For its intended application I suppose this just comes down to a question of how big a bomb you’re prepared to use; for PledgeBank, this is irritating, but not fatal, since all we want is approximate location data which is good enough to let users look for things in the same general area as them.
(There are alternative gazetteers but those I’ve looked at are either proprietary, derived from the GNS data, much smaller than GNS, share its problems while adding new ones of their own, or several of the above. That said, I remain open to alternative suggestions.)
So, the plan is simple: grab all the GNS data (717MB of it), import it into a big database, then let people select their country and type in their (nearest) town, and look up coordinates from that. What could be simpler? It turns out that they’ve even abandoned their rather quaint practice of inventing their own characters sets for everything, and now use UTF-8 for (most of) the fields in the database.
Unfortunately, at this point there’s a more serious problem. If somebody in the UK, say, types “Cambridge” as their location, then probably they’re talking about Cambridge, Cambridgeshire; but there’s a small chance that they might be talking about Cambridge, Gloucestershire (population ~1,700), and we’d look like total muppets if we confused the two. Generally, place names have a habit of nonuniqueness; for instance in the GNS data for the UK we have,
| Occurences | Name |
|---|---|
| 18 | Sutton |
| 17 | Weston |
| 17 | Middleton |
| 16 | Newton |
| 15 | Preston |
Now, ideally we’d disambiguate these by asking which one of those they meant, using the name of the enclosing administrative region or some other piece of information the user might be expected to know as a qualifier. Sadly, though GNS nominally has this kind of hierarchical structure (see the ADM1 and 2 fields in this list), in practice many placenames are coded without any information on enclosing geographical region, but with ADM1 set to, for instance, “00 — United Kingdom (general)”.
(As an aside, we don’t actually need this stuff for the UK particularly, because we can ask users for their postcodes and do coordinate lookups from that. But the reason I’m starting by looking at the UK part of the gazetteer is that I know the UK’s geography better than that of, say, Congo or France or somewhere, so it’s easier to see what processing steps are required. Plus, privacy-conscious users may prefer to name a town rather than give their actual postcode, since this makes it much harder for us to lock on to them with our orbiting mind-control LASER satellites.)
So, my current job is to invent some plausible heuristic for annotating nonunique place-names, either by trying to guess the administrative region in which they live (probably not usually practical) or adding other qualifiers such as “near Gloucester” or whatever. I suspect this won’t work all that well, but it only has to be good enough. After all, I’m not going to be using the results to bomb anyone….
Please do not forget that places (especially near borders) tend to have different names in different languages (e.g. Brussel (Dutch) = Bruxelles (French)(bilangual city), Liège (French) = Luik (Dutch) = Lüttich (German).
Places names also tend to change in history: Beijing (modern internaitonal spelling) = Peking (former international spelling), Ceylon = Sri Lanka, …
Hi,
btw, the NASA worldwind file (BlueMarble-Placenames.zip) might
be interesting as well:
http://sourceforge.net/project/showfiles.php?group_id=69528
The format is explained here:
http://www.worldwindcentral.com/wiki/Placename_Format
Would be cool to have a converter to CSV for this…
M.
The placename data for “Blue Marble” is just the GEOnet names data in a binary format. You can download the same data in tab-separated-values text files from the National Geospatial Intelligence Agency (link to “names files for countries and territories”).
geo-coding – capture part of the pledger’s postcode, first 3 characters maybe, and country as a 2-letter code?
“N11, UK” lets people know it’s Southgate in North London, England, but doesn’t give enough information to pinpoint the pledger’s exact location.
It would save you having to store places in a huge static database.
Although, I’m not sure how it would scale worldwide.
I have a database of 27,213 towns contained within a postal region e.g NP7 has 38 towns and villages in it and I have a the corisponding county as well.
You could therefore cross-reference between the town names you have and the county names I have
I have the GNS database available behind a simple REST interface (similar to geocoder.us)
http://brainoff.com/geocoder/rest/
No it doesn’t handle multiple placenames well (or at all), but that’s a future improvement.
Regarding spatial and naming inaccuracies, seems like wiki-fying the database could be fruitful.
Beautiful site, First nike uptown sneakers [url=http://www.oyax.com/nikeuptownsneakers#1]First nike uptown sneakers[/url], 8(((, watch pirates jesse jane free [url=http://www.oyax.com/watchpiratesjessejanefree#1]watch pirates jesse jane free[/url], xsfh, design your own sneakers now [url=http://www.oyax.com/designyourownsneakers#1]design your own sneakers now[/url], flo, smart wool socks [url=http://www.oyax.com/smartwoolsocks#1]smart wool socks[/url], 1285, watch my cheating wife [url=http://www.oyax.com/watchmycheatingwife#1]watch my cheating wife[/url], %-P, Discount new balance sneakers [url=http://www.oyax.com/newbalancesneakers#1]Discount new balance sneakers[/url], 618, Cheap husband watches wife [url=http://www.oyax.com/husbandwatcheswife#1]Cheap husband watches wife[/url], hggjw, First lash tamaron imports [url=http://www.oyax.com/lashtamaronimports#1]First lash tamaron imports[/url], plt, Real men wearing stocking garter belts [url=http://www.oyax.com/menwearingstockinggarterbelts#1]Real men wearing stocking garter belts[/url], caxkhq, fake jordan sneakers [url=http://www.oyax.com/fakejordansneakers#1]fake jordan sneakers[/url], zesymh,