… or, “how near is ‘nearby’?”
On PledgeBank we offer search and local alert features which will tell users about pledges which have been set up near them, the idea being that if somebody’s organising a street party in the next street over, you might well want to hear about it, but if it’s somebody a thousand miles away, you probably don’t.
At the moment we do this by gathering location data from pledge creators (either using postcodes, or location names via Gaze), and comparing it to search / alert locations using a fixed distance threshold — presently 20km (or about 12 miles). This works moderately well, but leads to complaints from Londoners of the form “why have I been sent a pledge which is TEN MILES away from me?” — the point being that, within London, people’s idea of how far away “nearby” things is is quite different from that of people who live in the countryside — they mean one tube stop, or a few minutes’ walk, or whatever. If you live in the countryside, “nearby” might be the nearest village or even the nearest town.
So, ages ago we decided that the solution to this was to find some population density data and use it to produce an estimate for what is “nearby”, defined as, “the radius around a point which contains at least N people”. That should capture the difference between rural areas and small and large towns.
(In fairness, the London issue could be solved by having the code understand north vs south of the river as a special case, and never showing North-Londoners pledges in South London. But that’s just nasty.)
Unfortunately the better solution requires finding population density data for the whole world, which is troublesome. There seem to be two widely-used datasets with global coverage: NASA SEDAC’s Gridded Population of the World, and Oak Ridge National Laboratory’s Landscan database. GPW is built from census data and information about the boundaries of each administrative unit for which the census data is recorded, and Landscan improves on this by using remote-sensing data such as the distribution of night-time lights, transport networks and so forth.
(Why, you might wonder, is Oak Ridge National Laboratory interested in such a thing? It is, apparently, “for estimating ambient populations at risk” from natural disasters and whatnot. That’s very worthy, but I can’t help but wonder whether the original motivation for this sort of work may have been a touch more sinister. But what do I know?)
Anyway, licence terms seem to mean that we can use the GPW data and we can’t use the Landscan data, which is a pity, since the GPW data is only really very good in its coverage of rich western countries which produce very detailed census returns on, e.g., a per-municipality basis. Where census returns are only available on the level of regions, the results are less good. Anyway, subject to that caveat, it seems to solve the problem. Here’s a map showing a selection of points, and the circles around them which contain about 200,000 people (that seems to be about the right value for N):
The API to access this will go into the Gaze interface, but it’s not live yet. I’ll document the RESTful API when it is.
One last note, which might be of use to people working with the GPW data in the future. GPW is a cell-based grid: each cell is a region lying between two lines of longitude and two lines of latitude, and within each cell three variables are defined: the population in the cell, the population density of the cell, and the land area of the cell. (This is one of those rare exceptions described in to Alvy Ray Smith’s rant, A Pixel Is Not A Little Square….) But note that the land area is not the surface area of the cell, and the population density is not the population divided by the surface area of the cell!
This becomes important in the case of small islands; for instance (a case I hit debugging the code) the Scilly Isles. The quoted population density for the Scilly Isles is rather high: somewhere between 100 and 200 persons/km2, but when integrating the population density to find the total population in an area, this is absolutely not the right value to use: the proper value there is the total population of a cell, divided by its total surface area. The reason for that is that when sampling from the grid to find the value of the integrand (the population density) you don’t know, a priori, whether the point you’re sampling at has landed on land or non-land, but the quoted population density assumes that you are always asking about the land. When integrating, the total population of each cell should be “smeared out” over the whole area of the cell. If you don’t do this then you will get very considerable overestimates of the population in regions which contain islands.