1. Time shifting

    So, we’ve been a bit quiet on this blog, but naturally busy. I just did my invoice and timesheet for last month, and remembered how bitty it has been. In one day I often do things to 3 websites, and that is just CVS commit messages – no doubt I handled emails for more. This makes it quite hard to summarise what has been happening, and also quite hard to measure how much time we spend maintaining each website.

    We’ve recently made a London version of PledgeBank, which I’ll remind Tom to explain about on the main news blog. It is a PledgeBank “microsite”, with a special query for the front page and all pledges page that shows only pledges in Greater London. Which is conveniently almost exactly a circle radius 25km with centre at 51.5N -0.1166667E. I worked that out by dividing the area (found on the Greater London Wikipedia page) by pi and taking the square root And rounding up a bit.

    Yesterday we launched a new call for proposals – head on over, and tell us your ideas for new civic websites. It is another WordPress modification, but this time to the very blog that you’re reading now. The form for submitting proposals I made anew, It creates a new WordPress low-privileged user by directly inserting into the database, and then calls the function wp_insert_post to create a post by them in a special category. The rest of the blogging software then trivially does comments, RSS, search, email alerts and archiving.

    Meanwhile, Chris has written some monitoring software for our servers, to alert us of problems and potential problems. Perl modules do the tests, things like enough disk space and that web servers that are up. I’ve been tweaking it a bit, for example adding a test to watch for long-running PostgreSQL queries which indicate a deadlock. We’ve got a problem in the PledgeBank SMS code which causes deadlocks sometimes, which we’re still debugging.

  2. Postcodeine

    So, a silly post for today: Postcodeine. This is a British version of Ben Fry’s zipdecode, a “tool” for visualising the distribution of zipcodes in the United States. This is, as has been pointed out to me, wholly pointless, but it’s quite fun and writing it was an interesting exercise (it also taught me a little bit about AJAX, the web’s technology trend du jour). If you want the source code, it’s at the foot here; licence is the Affero GPL, as for all the other mySociety code.

    How it works: this is pretty obvious, but I might as well spell it out. The web page has four images on it: the big and small base maps, and two overlays. The back-end code is responsible for drawing sets of postcode locations into transparent PNGs, and when you type things in the text field, the src for each of the overlay images is changed. Panning the large map is done by issuing another request from Javascript to grab the mean location of all postcodes matching the given prefix (slightly hobbled, so that this isn’t a generalised postcode-to-coordinates oracle — sorry!); the rightmost pane, with a list of postcodes and their areas, is populated from another HTTP request. It could be done with an iframe but, as Paul Graham puts it, “Javascript works now”, so we might as well use that.

    (I should say, by the way, that I wrote this in my copious spare time. It’s copyright mySociety because I don’t have the right to use the postcode database myself.)

  3. Population density and customary proximity

    … or, “how near is ‘nearby’?”

    On PledgeBank we offer search and local alert features which will tell users about pledges which have been set up near them, the idea being that if somebody’s organising a street party in the next street over, you might well want to hear about it, but if it’s somebody a thousand miles away, you probably don’t.

    At the moment we do this by gathering location data from pledge creators (either using postcodes, or location names via Gaze), and comparing it to search / alert locations using a fixed distance threshold — presently 20km (or about 12 miles). This works moderately well, but leads to complaints from Londoners of the form “why have I been sent a pledge which is TEN MILES away from me?” — the point being that, within London, people’s idea of how far away “nearby” things is is quite different from that of people who live in the countryside — they mean one tube stop, or a few minutes’ walk, or whatever. If you live in the countryside, “nearby” might be the nearest village or even the nearest town.

    So, ages ago we decided that the solution to this was to find some population density data and use it to produce an estimate for what is “nearby”, defined as, “the radius around a point which contains at least N people”. That should capture the difference between rural areas and small and large towns.

    (In fairness, the London issue could be solved by having the code understand north vs south of the river as a special case, and never showing North-Londoners pledges in South London. But that’s just nasty.)

    Unfortunately the better solution requires finding population density data for the whole world, which is troublesome. There seem to be two widely-used datasets with global coverage: NASA SEDAC’s Gridded Population of the World, and Oak Ridge National Laboratory’s Landscan database. GPW is built from census data and information about the boundaries of each administrative unit for which the census data is recorded, and Landscan improves on this by using remote-sensing data such as the distribution of night-time lights, transport networks and so forth.

    (Why, you might wonder, is Oak Ridge National Laboratory interested in such a thing? It is, apparently, “for estimating ambient populations at risk” from natural disasters and whatnot. That’s very worthy, but I can’t help but wonder whether the original motivation for this sort of work may have been a touch more sinister. But what do I know?)

    Anyway, licence terms seem to mean that we can use the GPW data and we can’t use the Landscan data, which is a pity, since the GPW data is only really very good in its coverage of rich western countries which produce very detailed census returns on, e.g., a per-municipality basis. Where census returns are only available on the level of regions, the results are less good. Anyway, subject to that caveat, it seems to solve the problem. Here’s a map showing a selection of points, and the circles around them which contain about 200,000 people (that seems to be about the right value for N):

    Map showing example proximity circles

    The API to access this will go into the Gaze interface, but it’s not live yet. I’ll document the RESTful API when it is.

    One last note, which might be of use to people working with the GPW data in the future. GPW is a cell-based grid: each cell is a region lying between two lines of longitude and two lines of latitude, and within each cell three variables are defined: the population in the cell, the population density of the cell, and the land area of the cell. (This is one of those rare exceptions described in to Alvy Ray Smith’s rant, A Pixel Is Not A Little Square….) But note that the land area is not the surface area of the cell, and the population density is not the population divided by the surface area of the cell!

    This becomes important in the case of small islands; for instance (a case I hit debugging the code) the Scilly Isles. The quoted population density for the Scilly Isles is rather high: somewhere between 100 and 200 persons/km2, but when integrating the population density to find the total population in an area, this is absolutely not the right value to use: the proper value there is the total population of a cell, divided by its total surface area. The reason for that is that when sampling from the grid to find the value of the integrand (the population density) you don’t know, a priori, whether the point you’re sampling at has landed on land or non-land, but the quoted population density assumes that you are always asking about the land. When integrating, the total population of each cell should be “smeared out” over the whole area of the cell. If you don’t do this then you will get very considerable overestimates of the population in regions which contain islands.

  4. Gaze web service

    A very quick post to announce the launch of a public interface to our Gaze web gazetteer service. The motivation behind Gaze is collecting location information from users without using maps (a clunky approach with poor accessibility and licensing problems) or postcodes (which do not have universal coverage and have privacy issues as well as licensing problems). Instead the idea is to use place names to identify locations, even in the presence of ambiguity, alternate names, etc. We do this by providing a search service over a large gazetteer (2.2 million places and 3 million names), and supplying additional contextual information to disambiguate common place names. The API is very simple, with one major function and two other supporting ones.

    Anyway, without further ado, here is the API. Internally we use one based on RABX, but we’ve done a special “RESTful” API for everyone else. All requests should be HTTP GETs; all parameters must be in UTF-8; and all responses are in UTF-8 plain text or comma-separated values. All calls should be passed to the URL,

    http://gaze.mysociety.org/gaze-rest

    selecting a particular function by specifying the HTTP parameter f, for instance

    http://gaze.mysociety.org/gaze-rest?f=get_find_places_countries

    Available functions are:

    get_country_from_ip
    Parameters:

    ip
    IPv4 address of a host, in dotted-quad format

    Guess the country of location of a host from its IP address. The result of this call will be an ISO country code, followed by a line feed; or, if it was not possible to determine a country, a line feed on its own.

    get_find_places_countries
    No parameters.Return the list of countries for which the find_places call has a gazetteer available. The list is returned as a list of ISO country codes followed by line feeds.

    find_places
    Parameters:

    country
    ISO country code of country in which to search for places
    state
    state in which to search for places; presently this is only meaningful for country=US (United States), in which case it should be a conventional two-letter state code (AZ, CA, NY etc.); optional
    query
    query term input by the user; must be at least two characters long
    maxresults
    largest number of results to return, from 1 to 100 inclusive; optional; default 10
    minscore
    minimum match score of returned results, from 1 to 100 inclusive; optional; default 0

    Returns in CSV format (as defined by this internet draft) with a one-line header a list of the following fields:

    name
    name of the place described by this row
    in-qualifier
    blank, or the name of an administrative region in which this place lies (for instance, a county)
    near-qualifier
    blank, or a list of nearby places, separated by commas
    latitude
    WGS-84 latitude of place in decimal degrees, north-positive
    longitude
    WGS-84 longitude of place in decimal degrees, east-positive
    state
    blank, or containing state code for US
    score
    match score for this place, from 0 to 100 inclusive

    Enjoy! Questions and comments to hello@mysociety.org, please.

    Update: we’ve now added the facilities for discovering population densities and “customary proximity” (as discussed in this post) to Gaze. The additional APIs are documented here.

  5. Placeopedia and YourHistoryHere

    mySociety yesterday launched a pair of Back o’ The Envelope projects based on Google Maps.

    Placeopedia.com — Connect Wikipedia articles with the places they represent

    YourHistoryHere.com — Share local and geographic history and trivia.

    There are a few things to say about both projects:

    1 – As is normal with mySociety projects the code for these projects (excepting Google maps) is open source. We hope that by providing a ready-made annotation system, people will find it easier to make their own publicly-authored layers of information.

    2 – Both sites syndicate their data under open source licenses, and in a location-queryable fashion. This is really important, as it allows for all types of nice local history to be syndicated to tourism sites, local community discussion boards, blogs and so on.

    3 – We’re calling them ‘Back o’ the Envelope’ to contrast them to the big, polished and time consuming projects we run like PledgeBank.com and WriteToThem.com.

  6. All the places in the world

    Lots of countries gradually loading into one of our servers. There’s 220Mb of data comprising 227 countries, with about 5,000,000 places altogether. With a global population of about 6 billion, that means the average “place” has 1,200 people living in it. For each place we have the latitude and longitude. (All this data comes from the US military)

    Try it out by signing up for a local alert in any country. Let us know if you find any bugs, or have any problems or suggestions to make. Also, if you want access to this gazetteer as a web service, send us a mail.

    Currently it’s up to Uruguay, it’ll be a bit longer before we’ve finished the alphabet. It takes quite a while partly because of the volume of data and indices being built, partly because for places with the same name as each other it hunts for nearby towns to disambiguate, and partly because we didn’t optimise the perl script. It won’t run very often.

  7. More Geography

    So, I left regular readers on a geographical cliffhanger last week in my search for a decent gazetteer of the whole world which we can use to let pledge creators tell us where their pledges apply to (and to let people to search for pledges near them). No doubt you expect me to now say that I’ve done this and that this marvellous new feature is now up and running on PledgeBank.

    Sadly not.

    The best data I’ve been able to find is the GEOnet Names Server dumps from the US Department Of Knowing What Places Are Called (or “National Geospatial Intelligence Agency”, as they call themselves). They maintain a big database of all the places in the world (except for the United States, which task falls to the US Geological Survey), mostly, as I understand it, for military applications. Presumably the idea here is that if some US soldier finds himself sharing Hicksville, Iraq, with something he wants to blow up, he can whip out his satellite-telephone, dial 1-800-US-AIR-FORCE (“You Call: We Bomb”) and, once he’s outwitted the phone menu and call-center staff, can arrange an air-strike without having to know anything tedious like his coordinates (“I’m sorry, could you repeat that, please. Do you mean Hicksville, Iraq, or Hicksville, Alabama?”).

    Now, when I last looked at this data, it was full of random and quite significant errors (~5km, for the locations of villages in England — much larger than we’d expected from the coordinate transforms from WGS84 to OSGB36 and the National Grid). For its intended application I suppose this just comes down to a question of how big a bomb you’re prepared to use; for PledgeBank, this is irritating, but not fatal, since all we want is approximate location data which is good enough to let users look for things in the same general area as them.

    (There are alternative gazetteers but those I’ve looked at are either proprietary, derived from the GNS data, much smaller than GNS, share its problems while adding new ones of their own, or several of the above. That said, I remain open to alternative suggestions.)

    So, the plan is simple: grab all the GNS data (717MB of it), import it into a big database, then let people select their country and type in their (nearest) town, and look up coordinates from that. What could be simpler? It turns out that they’ve even abandoned their rather quaint practice of inventing their own characters sets for everything, and now use UTF-8 for (most of) the fields in the database.

    Unfortunately, at this point there’s a more serious problem. If somebody in the UK, say, types “Cambridge” as their location, then probably they’re talking about Cambridge, Cambridgeshire; but there’s a small chance that they might be talking about Cambridge, Gloucestershire (population ~1,700), and we’d look like total muppets if we confused the two. Generally, place names have a habit of nonuniqueness; for instance in the GNS data for the UK we have,

    Occurences Name
    18 Sutton
    17 Weston
    17 Middleton
    16 Newton
    15 Preston

    Now, ideally we’d disambiguate these by asking which one of those they meant, using the name of the enclosing administrative region or some other piece of information the user might be expected to know as a qualifier. Sadly, though GNS nominally has this kind of hierarchical structure (see the ADM1 and 2 fields in this list), in practice many placenames are coded without any information on enclosing geographical region, but with ADM1 set to, for instance, “00 — United Kingdom (general)”.

    (As an aside, we don’t actually need this stuff for the UK particularly, because we can ask users for their postcodes and do coordinate lookups from that. But the reason I’m starting by looking at the UK part of the gazetteer is that I know the UK’s geography better than that of, say, Congo or France or somewhere, so it’s easier to see what processing steps are required. Plus, privacy-conscious users may prefer to name a town rather than give their actual postcode, since this makes it much harder for us to lock on to them with our orbiting mind-control LASER satellites.)

    So, my current job is to invent some plausible heuristic for annotating nonunique place-names, either by trying to guess the administrative region in which they live (probably not usually practical) or adding other qualifiers such as “near Gloucester” or whatever. I suspect this won’t work all that well, but it only has to be good enough. After all, I’m not going to be using the results to bomb anyone….

  8. Changes are in the works

    Well, I’m back from my holiday, suitably sunburned and (relatively) relaxed. As Francis mentions, I was off in the Mediterranean somewhere (Majorca, specifically) suffering from miserable internet withdrawal symptoms. I did manage to get IRC up-and-running over dialup for election night, though this turned out to be surprisingly expensive. For once I was grateful to my iBook, which did actually Just Work when plugged into the wall.

    Anyway, today’s job is sorting out the new Scottish constituency boundaries. Scotland’s Parliament was dissolved in 1707 on the passing of the Act of Union, to be reconstituted in 1999. The quid pro quo for the Scots was enhanced representation in the House of Commons; Scottish constituencies had, in 1998, an average of 55,000 electors, compared to 69,000 in England. This anomaly has now been corrected, reducing the number of constituencies in Scotland from 72 to 59; all but three of the latter have different boundaries.

    This means updating MaPit, the component we built to map postcodes into electoral geography, to deal with the new boundaries. Ideally the way that we’d do this is to wait for Ordnance Survey to ship us, via our friends in ODPM, the new revision of their Boundary-Line (TM, apparently) product, with the outlines of the new constituencies encoded in attractive machine-readable form, and feed it to our existing import scripts. (As so often in life, it’s not quite that simple, but you get the general idea.) In an ideal world, this would also contain all the changed boundaries of the English counties and their constituent county electoral divisions.

    However, this is not an ideal world, and though there is a new revision of Boundary-Line in the works, it hasn’t come out yet, so we have to construct the point-to-constituency mapping in some other way. Happily, at this stage of the boundary revision process, the constituency boundaries are coterminous with ward boundaries, so it’s possible to just lift the definitions of the new constituencies from the relevant Statutory Instrument and fix up the constituencies from the ward boundaries, which haven’t changed. This, sadly, has occasioned a bit of a hack to our code, because we generally don’t assume that electoral geography is hierarchically defined — because it isn’t.

    (I don’t feel too bad about committing this hack, actually, because we’re likely to chuck the whole MaPit database and reconstruct it later in the year from OS data. When we built it originally, we did so from data in ESRI shapefile format; unfortunately, OS stuffed up the process of generating this from their own, internal and quite bonkers, NTF format, so the various area ID numbers in the database are not unique and not expected to be stable. We’d rather like stable ID numbers, so that we can cope gracefully with revisions to geography while maintaining continuity of, for instance, statistical data about MPs, so next time round we’re going to work from the NTF instead.)

    Sadly this Scottish hack doesn’t get us anywhere with the new county boundaries, and OS have told us that not all of the updated counties will be included in the forthcoming Boundary-Line revision. So it’ll be back to the tedious conversion of statutory instruments into SQL at some point in the near future, except that we’ll probably have to start building things up from parishes, rather than wards. Expect more anguished posts on this in the future.

    Meanwhile, Francis and Tom are collecting names and contact details for the new MPs. Tom tells me that this intake looks much more tech-savvy than the last, which could be good news from our (and everyone else’s) point of view. Hopefully WriteToThem will be cranking back into action — as far as MPs go, at least — fairly soon.