1. More on customary proximity

    And a follow-up to my last post: the population density and customary proximity APIs are now available in Gaze. The additional APIs are:

    get_population_density
    Parameters:

    lat
    WGS84 latitude, in decimal degrees
    lon
    WGS84 longitude, in decimal degrees

    Return an estimate of the population density at (lat, lon), in persons per square kilometer, as a decimal number followed by a line feed.

    get_radius_containing_population
    Parameters:

    lat
    WGS84 latitude, in decimal degrees
    lon
    WGS84 longitude, in decimal degrees
    number
    number of persons
    maximum
    largest radius returned, in kilometers; optional; default 150

    Return an estimate of the smallest radius around (lat, lon) containing at least number persons, or maximum, if that value is smaller, as a decimal number followed by a line feed.

    For instance,

    Enjoy! Questions and comments to chris@mysociety.org, please.

  2. Population density and customary proximity

    … or, “how near is ‘nearby’?”

    On PledgeBank we offer search and local alert features which will tell users about pledges which have been set up near them, the idea being that if somebody’s organising a street party in the next street over, you might well want to hear about it, but if it’s somebody a thousand miles away, you probably don’t.

    At the moment we do this by gathering location data from pledge creators (either using postcodes, or location names via Gaze), and comparing it to search / alert locations using a fixed distance threshold — presently 20km (or about 12 miles). This works moderately well, but leads to complaints from Londoners of the form “why have I been sent a pledge which is TEN MILES away from me?” — the point being that, within London, people’s idea of how far away “nearby” things is is quite different from that of people who live in the countryside — they mean one tube stop, or a few minutes’ walk, or whatever. If you live in the countryside, “nearby” might be the nearest village or even the nearest town.

    So, ages ago we decided that the solution to this was to find some population density data and use it to produce an estimate for what is “nearby”, defined as, “the radius around a point which contains at least N people”. That should capture the difference between rural areas and small and large towns.

    (In fairness, the London issue could be solved by having the code understand north vs south of the river as a special case, and never showing North-Londoners pledges in South London. But that’s just nasty.)

    Unfortunately the better solution requires finding population density data for the whole world, which is troublesome. There seem to be two widely-used datasets with global coverage: NASA SEDAC’s Gridded Population of the World, and Oak Ridge National Laboratory’s Landscan database. GPW is built from census data and information about the boundaries of each administrative unit for which the census data is recorded, and Landscan improves on this by using remote-sensing data such as the distribution of night-time lights, transport networks and so forth.

    (Why, you might wonder, is Oak Ridge National Laboratory interested in such a thing? It is, apparently, “for estimating ambient populations at risk” from natural disasters and whatnot. That’s very worthy, but I can’t help but wonder whether the original motivation for this sort of work may have been a touch more sinister. But what do I know?)

    Anyway, licence terms seem to mean that we can use the GPW data and we can’t use the Landscan data, which is a pity, since the GPW data is only really very good in its coverage of rich western countries which produce very detailed census returns on, e.g., a per-municipality basis. Where census returns are only available on the level of regions, the results are less good. Anyway, subject to that caveat, it seems to solve the problem. Here’s a map showing a selection of points, and the circles around them which contain about 200,000 people (that seems to be about the right value for N):

    Map showing example proximity circles

    The API to access this will go into the Gaze interface, but it’s not live yet. I’ll document the RESTful API when it is.

    One last note, which might be of use to people working with the GPW data in the future. GPW is a cell-based grid: each cell is a region lying between two lines of longitude and two lines of latitude, and within each cell three variables are defined: the population in the cell, the population density of the cell, and the land area of the cell. (This is one of those rare exceptions described in to Alvy Ray Smith’s rant, A Pixel Is Not A Little Square….) But note that the land area is not the surface area of the cell, and the population density is not the population divided by the surface area of the cell!

    This becomes important in the case of small islands; for instance (a case I hit debugging the code) the Scilly Isles. The quoted population density for the Scilly Isles is rather high: somewhere between 100 and 200 persons/km2, but when integrating the population density to find the total population in an area, this is absolutely not the right value to use: the proper value there is the total population of a cell, divided by its total surface area. The reason for that is that when sampling from the grid to find the value of the integrand (the population density) you don’t know, a priori, whether the point you’re sampling at has landed on land or non-land, but the quoted population density assumes that you are always asking about the land. When integrating, the total population of each cell should be “smeared out” over the whole area of the cell. If you don’t do this then you will get very considerable overestimates of the population in regions which contain islands.

  3. How we learn to stop worrying and love statistics

    Just a brief one today. MORI has recently done a poll chiefly on the subject of Britain’s nuclear deterrent. Now, here at mySociety we don’t have any political views, so no comments on The Bomb itself; but MORI did ask another question which intrigued me:

    And which, if any of the things on this list have you done in the last two or three years?

    What How many
    Presented my views to a local councillor or MP 14
    Written a letter to an editor 6
    Urged someone outside my family to vote 16
    Urged someone to get in touch with a local councillor or MP 12
    Made a speech before an organised group 11
    Been an officer of an organisation or club 8
    Stood for public office 1
    Taken an active part in a political campaign 3
    Helped on fund raising drives 20
    Voted in the last general election 68

    So, 14% of British adults have “presented [their] views to” a councillor or MP in the past 2–3 years. I presume most people will have interpreted the question as including writing to their MPs; that gives us something like 6 million letter-writers over that period. On WriteToThem, about 75% of messages are for MPs, so if those 6 million people sent one letter each over the three years, that works out as about 2,000 messages/year/MP, or about ten per working day.

    That’s a lot lower than typical estimates I’ve heard (~50/day/MP). Of course, the poll asked about people rather than letters, so doesn’t account for people sending several letters over the given time period. However, judging by the WriteToThem data, that’s not all that significant an effect:
    [Plot of number of letters per author in WriteToThem, image gone]
    — something like 90% of letters sent through WriteToThem to MPs and councillors are the only ones sent by that author. (Note that this measurement is quite crude; in particular, I have identified two letters as being from the same author if they share a common email address. Also, since we remove all personal data about authors from messages after a little while, it only shows a few weeks’ worth of data. A further complication is that if an MP or councillor responds by email and the constituent sends a further email, they’re likely to do it by replying to the email, so not showing up as a further communication on that plot.)

    Anyway, if the crude data from WriteToThem are characteristic of all mail received by councillors and MPs, then MORI’s estimate of the number of people communicating with their MPs seems pretty low. Thoughts?

  4. Starting on GiveItAway

    So, right now I’m working on the first draft of mySociety’s fifth ODPM-funded project, GiveItAway. The site will let users tell local charities about stuff they want to get rid of but which might still be useful; in its first draft we’re going to aim for the simplest possible interface, partly because that’s the sort of interface we like, and partly because there are already other sites which address this sort of problem. There’s no point in pouring effort into the thing if we can’t do anything better than existing competitors, after all. More later in the week, anyway.

  5. Gaze web service

    A very quick post to announce the launch of a public interface to our Gaze web gazetteer service. The motivation behind Gaze is collecting location information from users without using maps (a clunky approach with poor accessibility and licensing problems) or postcodes (which do not have universal coverage and have privacy issues as well as licensing problems). Instead the idea is to use place names to identify locations, even in the presence of ambiguity, alternate names, etc. We do this by providing a search service over a large gazetteer (2.2 million places and 3 million names), and supplying additional contextual information to disambiguate common place names. The API is very simple, with one major function and two other supporting ones.

    Anyway, without further ado, here is the API. Internally we use one based on RABX, but we’ve done a special “RESTful” API for everyone else. All requests should be HTTP GETs; all parameters must be in UTF-8; and all responses are in UTF-8 plain text or comma-separated values. All calls should be passed to the URL,

    http://gaze.mysociety.org/gaze-rest

    selecting a particular function by specifying the HTTP parameter f, for instance

    http://gaze.mysociety.org/gaze-rest?f=get_find_places_countries

    Available functions are:

    get_country_from_ip
    Parameters:

    ip
    IPv4 address of a host, in dotted-quad format

    Guess the country of location of a host from its IP address. The result of this call will be an ISO country code, followed by a line feed; or, if it was not possible to determine a country, a line feed on its own.

    get_find_places_countries
    No parameters.Return the list of countries for which the find_places call has a gazetteer available. The list is returned as a list of ISO country codes followed by line feeds.

    find_places
    Parameters:

    country
    ISO country code of country in which to search for places
    state
    state in which to search for places; presently this is only meaningful for country=US (United States), in which case it should be a conventional two-letter state code (AZ, CA, NY etc.); optional
    query
    query term input by the user; must be at least two characters long
    maxresults
    largest number of results to return, from 1 to 100 inclusive; optional; default 10
    minscore
    minimum match score of returned results, from 1 to 100 inclusive; optional; default 0

    Returns in CSV format (as defined by this internet draft) with a one-line header a list of the following fields:

    name
    name of the place described by this row
    in-qualifier
    blank, or the name of an administrative region in which this place lies (for instance, a county)
    near-qualifier
    blank, or a list of nearby places, separated by commas
    latitude
    WGS-84 latitude of place in decimal degrees, north-positive
    longitude
    WGS-84 longitude of place in decimal degrees, east-positive
    state
    blank, or containing state code for US
    score
    match score for this place, from 0 to 100 inclusive

    Enjoy! Questions and comments to hello@mysociety.org, please.

    Update: we’ve now added the facilities for discovering population densities and “customary proximity” (as discussed in this post) to Gaze. The additional APIs are documented here.

  6. Electoral geography again

    So, it’s back to electoral geography for me, this time to get the new county and county electoral-division boundaries live on WriteToThem. This is a prerequisite for getting mail to county councillors working again after the election on May 5th, so we’re already three months behind the times. But more generally, electoral boundaries are revised all the time to account for changes of population within each ward, constituency and so forth; and at most (local and national) elections some set of boundary changes takes effect. So to keep WriteToThem running we need to incorporate such updates routinely.

    The way we handle electoral geography in general is to start with Ordnance Survey’s Boundary Line product, which, for each administrative or electoral area in Great Britain gives a polygon identifying that region. We then take a big list of all the postcodes in Britain (CodePoint) and figure out which polygons they lie in. Then when somebody comes along to WriteToThem and types in their postcode, we can figure out which ward, constituency etc. they are in, and tell them appropriate things about their representatives. (Technically this is a lie, of course, because postcodes represent regions, not points — we use the centroids of those regions — and each such region isn’t guaranteed to lie either wholly within or without all electoral and administrative regions. Unfortunately there isn’t a lot we can do about this beyond throwing our hands up and saying “oops, sorry”, so that’s what we do.)

    As an aside, outside Great Britain — that is, in Northern Ireland, we don’t have the same sort of data so instead we rely on another field in the CodePoint data which gives, for each postcode centroid, the ONS ward code for the ward in which that point lies. From that ward code you can find the enclosing local authority area, local electoral area — in Northern Ireland local councils are elected by STV over multimember regions, rather than by first-past-the-post as in Great Britain — and constituency. Happily it turns out that all of those other regions are composed of whole numbers of wards; this happy state of affairs does not necessarily prevail elsewhere.

    Now, twice a year, a new edition of Boundary Line is issued, taking account of recent changes in electoral geography. Usually this happens in May and October, though the schedule has been known to slip. In principle this should be easy to deal with: load up the new copy of Boundary Line, pass all the postcodes through it, and hey presto.

    Life, of course, is rarely that simple, and this isn’t one of those occasions. When the boundaries of a region don’t change between one year and the next, we don’t want to make any alteration to that region in our database (which uses ID numbers to identify each area). More specifically, when a new revision of Boundary Line comes along, we want to ensure that — let’s say — Cambridge Constituency in the new revision is identified with Cambridge Constituency in the old version. Now, in principle, this should be easy, because each area in the data set, in the words of the manual,

    … carries a unique identifier AI; this is the same identifier that was supplied in the previous specification of Boundary- Line. The same AI attribute is associated with every component polygon forming part of an administrative unit, irrespective of the number of polygons.

    Now, the first time that we did this, we worked from a copy of the Boundary Line data supplied in the form of “ShapeFiles” (a format used in various proprietary GIS systems, and with which our local government partners were able to supply us without having to order it specially from Ordnance Survey). Unfortunately in the ShapeFile version, the allegedly unique administrative area IDs were, in fact, not unique. After discussion with Ordnance Survey it was concluded that this was a problem which affected the translation of the data from NTF (“National Transfer Format”, their own preferred format) into ShapeFile; and that the problem would be fixed in the next release.

    So, taking no chances, we decided we’d work from the NTF format in future, since that seems to be closer to the authoritative source of the data, and anyway the ShapeFile format isn’t at all well-documented (for instance, many of the field names for the metadata about each area differ from those described in the manual for Boundary Line). So I’ve written code to parse the (slightly bonkers, natch) NTF files and modified our import scripts to use this code, with a view to then being able to keep up-to-date with future boundary revisions without too much trouble.

    You will not be surprised, therefore, to hear that this has not worked out exactly as planned. Unfortunately it appears that the May 2005 NTF release of Boundary Line suffers exactly the same problems of non-uniqueness as did the previous ShapeFile release. So unless some cleverer solution presents itself, I’ll have to revive the hack we intended to use with the ShapeFile data — try to construct unique IDs for areas from their geometry, and hope that the exact coordinates of the polygon vertices for unchanged areas do not change between revisions. We shall see. But right now I’m mostly worrying about why my parser script runs out of memory on my 1GB computer after reading a couple of hundred megs of input data.

  7. More Geography

    So, I left regular readers on a geographical cliffhanger last week in my search for a decent gazetteer of the whole world which we can use to let pledge creators tell us where their pledges apply to (and to let people to search for pledges near them). No doubt you expect me to now say that I’ve done this and that this marvellous new feature is now up and running on PledgeBank.

    Sadly not.

    The best data I’ve been able to find is the GEOnet Names Server dumps from the US Department Of Knowing What Places Are Called (or “National Geospatial Intelligence Agency”, as they call themselves). They maintain a big database of all the places in the world (except for the United States, which task falls to the US Geological Survey), mostly, as I understand it, for military applications. Presumably the idea here is that if some US soldier finds himself sharing Hicksville, Iraq, with something he wants to blow up, he can whip out his satellite-telephone, dial 1-800-US-AIR-FORCE (“You Call: We Bomb”) and, once he’s outwitted the phone menu and call-center staff, can arrange an air-strike without having to know anything tedious like his coordinates (“I’m sorry, could you repeat that, please. Do you mean Hicksville, Iraq, or Hicksville, Alabama?”).

    Now, when I last looked at this data, it was full of random and quite significant errors (~5km, for the locations of villages in England — much larger than we’d expected from the coordinate transforms from WGS84 to OSGB36 and the National Grid). For its intended application I suppose this just comes down to a question of how big a bomb you’re prepared to use; for PledgeBank, this is irritating, but not fatal, since all we want is approximate location data which is good enough to let users look for things in the same general area as them.

    (There are alternative gazetteers but those I’ve looked at are either proprietary, derived from the GNS data, much smaller than GNS, share its problems while adding new ones of their own, or several of the above. That said, I remain open to alternative suggestions.)

    So, the plan is simple: grab all the GNS data (717MB of it), import it into a big database, then let people select their country and type in their (nearest) town, and look up coordinates from that. What could be simpler? It turns out that they’ve even abandoned their rather quaint practice of inventing their own characters sets for everything, and now use UTF-8 for (most of) the fields in the database.

    Unfortunately, at this point there’s a more serious problem. If somebody in the UK, say, types “Cambridge” as their location, then probably they’re talking about Cambridge, Cambridgeshire; but there’s a small chance that they might be talking about Cambridge, Gloucestershire (population ~1,700), and we’d look like total muppets if we confused the two. Generally, place names have a habit of nonuniqueness; for instance in the GNS data for the UK we have,

    Occurences Name
    18 Sutton
    17 Weston
    17 Middleton
    16 Newton
    15 Preston

    Now, ideally we’d disambiguate these by asking which one of those they meant, using the name of the enclosing administrative region or some other piece of information the user might be expected to know as a qualifier. Sadly, though GNS nominally has this kind of hierarchical structure (see the ADM1 and 2 fields in this list), in practice many placenames are coded without any information on enclosing geographical region, but with ADM1 set to, for instance, “00 — United Kingdom (general)”.

    (As an aside, we don’t actually need this stuff for the UK particularly, because we can ask users for their postcodes and do coordinate lookups from that. But the reason I’m starting by looking at the UK part of the gazetteer is that I know the UK’s geography better than that of, say, Congo or France or somewhere, so it’s easier to see what processing steps are required. Plus, privacy-conscious users may prefer to name a town rather than give their actual postcode, since this makes it much harder for us to lock on to them with our orbiting mind-control LASER satellites.)

    So, my current job is to invent some plausible heuristic for annotating nonunique place-names, either by trying to guess the administrative region in which they live (probably not usually practical) or adding other qualifiers such as “near Gloucester” or whatever. I suspect this won’t work all that well, but it only has to be good enough. After all, I’m not going to be using the results to bomb anyone….

  8. Geography

    So, it’s my turn to write something here again. Ho-hum. Anyway, lately I’ve been working on adding the geographical features to PledgeBank which everyone thinks are there already: specifically, finding pledges which are nearby. To start with, we’re doing this for the UK only (because we already have the infrastructure to do postcode-to-coordinates lookups through MaPit), but the intention is to do thiis for the whole world as soon as we can, either by having users select their location through a gazetteer, or, where we can get the data, by using a similar postcode/zip-code/whatever-to-coordinates lookup. So that means we have to deal with places which might be anywhere on earth, which means dealing with latitudes and longitudes. And as anyone who’s dealt with this stuff knows, it’s very tedious to get this right. I’m afraid I’ve now spent too long reading about datum ellipsoids and Helmert Transforms to want to spend any time at all talking about them, so in the unlikely event that you’re interested, you’ll just have to read the (relevant parts of) code.

    Oh, and if anybody has good suggestions for a hierarchical gazetteer with world-wide coverage, I’d love to hear them. Similarly, does anybody know if this one is any good?

  9. More PledgeBank

    So, it’s my turn to write something here. Well, this’ll be short. As Francis mentions below, PledgeBank launched today, and everything’s gone reasonably smoothly, with the exception of some tedious PHP bug we haven’t tracked down yet. With any luck the new version of PHP will fix it, so there won’t be hours of painful debugging to do. (I could bore you for hours with my opinions of PHP — actually, I could probably shorten that quite a lot if I were allowed to swear, but this is a family-friendly ‘blog — but let’s just say that fixing PHP problems Is Not My Favourite Job.)

    Instead I’ll say what I’m doing right now, which is beginning to add geographical lookup to PledgeBank. At the moment we ask pledge authors for a country (though it can either be “UK” or “global” at the moment), and, if they’re in the UK, a postcode. The idea is to “georeference” (i.e. look up the coordinates of) the postcode, though we don’t actually do that yet. So I’m modifying the database a bit to store coordinates (as a latitude/longitude, so that we don’t have to write a separate case for every wacky national coordinate grid) and generalise the notion of “country” so that we can let non-Brits actually put in their own countries when they create pledges.

    Other things we’ve discovered today:

    • People are confused by the “(suspicious signer?)” link next to signatures on each pledge page — several people thought that we were reporting our suspicions of the signer. You probably think that’s stupid, but if so that’s only because you’re familiar with sites that have this sort of retroactive moderation button everywhere. Actually it’s us that’s being stupid and we’re going to remove it until we have a better way to implement it — at the moment we think we’re mostly on top of the occasional joke/abusive signature.
    • People are confused by the pledge signature confirmation mail, which currently reads (for instance),

      Please click on the link below to confirm your signature on the pledge at the bottom of this email.

      http://www.pledgebank.com/L/…

      The pledge reads:

      ‘Phil Booth will refuse to register for an ID card and will donate ¬£10 to a legal defence fund but only if 10,000 other people will also make this same pledge.’

      — the PledgeBank.com team

      We got several emails from people saying “your site has got my name wrong — I’m not Phil Booth”. The point is that Phil Booth wrote the pledge, so it’s in his name; the email reflects that. But that’s not obvious to the signer, and since the only name in the body of the email isn’t theirs, they think it’s got it wrong and complain. (To be fair, only three out of ~1,100 did, but that’s still bad.) This is the sort of problem we need user testing to spot: none of us saw anything wrong with the text when we were testing it. So we need to reword that.

    • We’ve had several people email to say that they’d like to do versions of PledgeBank in their own countries, and we’d like to hear from anyone interested in localising the mySociety projects who has time, expertise or even just opinions to donate. If that’s you, please get in touch!

    And probably some other stuff, but I said this post would be short….

  10. PledgeBank Launch

    So, PledgeBank launched today. Already, you can sign pledges to have trees planted to compensate for your CO2 emissions, help clean up the banks of the River Taff, support local retailers, and one or two, err, more politically controversial ones. So, take a look, and sign — or, better, create — some pledges!