Whereas new sites are lovely, and I talk about Neighbourhood Fix-It improvements further down, there’s still quite a bit of work that needs to go into making sure our current sites are always up-to-date, working, and full of the joys of spring. Here’s a bit of what I’ve been up to recently, whilst everyone else chats about database upgrades, server memory, and statistics.
The elections last week meant much of WriteToThem has had to be switched off until we can add the new election results – that means the following aren’t currently contactable: the Scottish Parliament; the Welsh Assembly; every English metropolitan borough, unitary authority, and district council (bar seven); and every Scottish council. The fact that the electoral geography has changed a lot in Wales means there will almost certainly be complicated shenanigans for us in the near future so that our postcode lookup continues to return the correct results as much as possible.
Talking of postcode lookups, I also noticed yesterday that some Northern Ireland postcodes were returning incorrect results, which was caused by some out of date entries left lying around in our MaPit postcode-to-area database. Soon purged, but that led me to spot that Gerry Adams had been deleted from our database! Odd, I thought, and tracked it down to the fact our internal CSV file of MLAs had lost its header line, and so poor Mr Adams was heroically taking its place. He should be back now.
A Catalan news article about PledgeBank brought a couple of requests for new countries to be added to our list on PledgeBank. We’re sticking to the ISO 3166-1 list of country codes, but the requests led us to spot that Jersey, Guernsey and the Isle of Man had been given full entry status in that list and so needed added to our own. I’m hoping the interest will lead to a Catalan translation of the site; we should hopefully also have Chinese and Belarussian soon, which will be great.
Neighbourhood Fix-It update
New features are still being added to Neighbourhood Fix-It.
Questionnaires are now being sent out to people who create problems four weeks after their problem is sent to the council, asking them to check the status of their problem and thereby keep the site up-to-date. Adding the questionnaire functionality threw up a number of bugs elsewhere – the worst of which was that we would be sending email alerts to people whether their alert had been confirmed or not. Thankfully, there hadn’t yet been any such alert, phew.
Lastly, the Fix-It RSS feeds now have GeoRSS too, which means you can easily plot them on a Google map.
So, a silly post for today: Postcodeine. This is a British version of Ben Fry’s zipdecode, a “tool” for visualising the distribution of zipcodes in the United States. This is, as has been pointed out to me, wholly pointless, but it’s quite fun and writing it was an interesting exercise (it also taught me a little bit about AJAX, the web’s technology trend du jour). If you want the source code, it’s at the foot here; licence is the Affero GPL, as for all the other mySociety code.
(I should say, by the way, that I wrote this in my copious spare time. It’s copyright mySociety because I don’t have the right to use the postcode database myself.)
… or, “how near is ‘nearby’?”
On PledgeBank we offer search and local alert features which will tell users about pledges which have been set up near them, the idea being that if somebody’s organising a street party in the next street over, you might well want to hear about it, but if it’s somebody a thousand miles away, you probably don’t.
At the moment we do this by gathering location data from pledge creators (either using postcodes, or location names via Gaze), and comparing it to search / alert locations using a fixed distance threshold — presently 20km (or about 12 miles). This works moderately well, but leads to complaints from Londoners of the form “why have I been sent a pledge which is TEN MILES away from me?” — the point being that, within London, people’s idea of how far away “nearby” things is is quite different from that of people who live in the countryside — they mean one tube stop, or a few minutes’ walk, or whatever. If you live in the countryside, “nearby” might be the nearest village or even the nearest town.
So, ages ago we decided that the solution to this was to find some population density data and use it to produce an estimate for what is “nearby”, defined as, “the radius around a point which contains at least N people”. That should capture the difference between rural areas and small and large towns.
(In fairness, the London issue could be solved by having the code understand north vs south of the river as a special case, and never showing North-Londoners pledges in South London. But that’s just nasty.)
Unfortunately the better solution requires finding population density data for the whole world, which is troublesome. There seem to be two widely-used datasets with global coverage: NASA SEDAC’s Gridded Population of the World, and Oak Ridge National Laboratory’s Landscan database. GPW is built from census data and information about the boundaries of each administrative unit for which the census data is recorded, and Landscan improves on this by using remote-sensing data such as the distribution of night-time lights, transport networks and so forth.
(Why, you might wonder, is Oak Ridge National Laboratory interested in such a thing? It is, apparently, “for estimating ambient populations at risk” from natural disasters and whatnot. That’s very worthy, but I can’t help but wonder whether the original motivation for this sort of work may have been a touch more sinister. But what do I know?)
Anyway, licence terms seem to mean that we can use the GPW data and we can’t use the Landscan data, which is a pity, since the GPW data is only really very good in its coverage of rich western countries which produce very detailed census returns on, e.g., a per-municipality basis. Where census returns are only available on the level of regions, the results are less good. Anyway, subject to that caveat, it seems to solve the problem. Here’s a map showing a selection of points, and the circles around them which contain about 200,000 people (that seems to be about the right value for N):
The API to access this will go into the Gaze interface, but it’s not live yet. I’ll document the RESTful API when it is.
One last note, which might be of use to people working with the GPW data in the future. GPW is a cell-based grid: each cell is a region lying between two lines of longitude and two lines of latitude, and within each cell three variables are defined: the population in the cell, the population density of the cell, and the land area of the cell. (This is one of those rare exceptions described in to Alvy Ray Smith’s rant, A Pixel Is Not A Little Square….) But note that the land area is not the surface area of the cell, and the population density is not the population divided by the surface area of the cell!
This becomes important in the case of small islands; for instance (a case I hit debugging the code) the Scilly Isles. The quoted population density for the Scilly Isles is rather high: somewhere between 100 and 200 persons/km2, but when integrating the population density to find the total population in an area, this is absolutely not the right value to use: the proper value there is the total population of a cell, divided by its total surface area. The reason for that is that when sampling from the grid to find the value of the integrand (the population density) you don’t know, a priori, whether the point you’re sampling at has landed on land or non-land, but the quoted population density assumes that you are always asking about the land. When integrating, the total population of each cell should be “smeared out” over the whole area of the cell. If you don’t do this then you will get very considerable overestimates of the population in regions which contain islands.
Things have been quiet here recently, but are now getting busy again. Tom’s back from America, Chris is back from holiday, I’m better after being ill for most of last week.
Earlier in the week we finally managed to load new county boundaries into MaPit. So WriteToThem once again has county councils working. Please try it out with your postcode. Let us know of any problems.
This required lots of work from Chris, because a new version of BoundaryLine (from Ordnance Survey) has not yet been released with the updated boundaries. He’s done it using lists of the district council wards which make up the county electoral divisions.
These lists were taken from the Statutory Instruments. This has covered most postcodes, but there are still some where the boundaries were specificed in text (walk along this river etc.) rather than wards. And we don’t have those.
The last couple of days I’ve been turning on lots of things to automate updating of WriteToThem. A cron job now grabs new data on councillors from GovEval once a day, and merges their changes with any changes we’ve made.
It’s automatically emailing GovEval with user submitted corrections to councillor data (the “Have you spotted a mistake in the above list?” link on WriteToThem). Hopefully this will create a virtuous feedback loop of ever improving data quality goodness. Or at least let us keep up with council by-elections without having to lift a finger.
Finally I’ve made it send a mail once a week to the mailing list where WriteToThem admins (mostly volunteers) hang out. This describes what needs doing – such as missing contact details to gather, or messages in the queue which need human attention.
Next up, wiring up the new screenscrapers Richard and Jonathan contributed last week, so the Welsh and London Assemblies automatically update…
A very quick post to announce the launch of a public interface to our Gaze web gazetteer service. The motivation behind Gaze is collecting location information from users without using maps (a clunky approach with poor accessibility and licensing problems) or postcodes (which do not have universal coverage and have privacy issues as well as licensing problems). Instead the idea is to use place names to identify locations, even in the presence of ambiguity, alternate names, etc. We do this by providing a search service over a large gazetteer (2.2 million places and 3 million names), and supplying additional contextual information to disambiguate common place names. The API is very simple, with one major function and two other supporting ones.
Anyway, without further ado, here is the API. Internally we use one based on RABX, but we’ve done a special “RESTful” API for everyone else. All requests should be HTTP GETs; all parameters must be in UTF-8; and all responses are in UTF-8 plain text or comma-separated values. All calls should be passed to the URL,
selecting a particular function by specifying the HTTP parameter f, for instance
Available functions are:
- IPv4 address of a host, in dotted-quad format
Guess the country of location of a host from its IP address. The result of this call will be an ISO country code, followed by a line feed; or, if it was not possible to determine a country, a line feed on its own.
- No parameters.Return the list of countries for which the find_places call has a gazetteer available. The list is returned as a list of ISO country codes followed by line feeds.
- ISO country code of country in which to search for places
- state in which to search for places; presently this is only meaningful for country=US (United States), in which case it should be a conventional two-letter state code (AZ, CA, NY etc.); optional
- query term input by the user; must be at least two characters long
- largest number of results to return, from 1 to 100 inclusive; optional; default 10
- minimum match score of returned results, from 1 to 100 inclusive; optional; default 0
Returns in CSV format (as defined by this internet draft) with a one-line header a list of the following fields:
- name of the place described by this row
- blank, or the name of an administrative region in which this place lies (for instance, a county)
- blank, or a list of nearby places, separated by commas
- WGS-84 latitude of place in decimal degrees, north-positive
- WGS-84 longitude of place in decimal degrees, east-positive
- blank, or containing state code for US
- match score for this place, from 0 to 100 inclusive
Enjoy! Questions and comments to firstname.lastname@example.org, please.
So, it’s back to electoral geography for me, this time to get the new county and county electoral-division boundaries live on WriteToThem. This is a prerequisite for getting mail to county councillors working again after the election on May 5th, so we’re already three months behind the times. But more generally, electoral boundaries are revised all the time to account for changes of population within each ward, constituency and so forth; and at most (local and national) elections some set of boundary changes takes effect. So to keep WriteToThem running we need to incorporate such updates routinely.
The way we handle electoral geography in general is to start with Ordnance Survey’s Boundary Line product, which, for each administrative or electoral area in Great Britain gives a polygon identifying that region. We then take a big list of all the postcodes in Britain (CodePoint) and figure out which polygons they lie in. Then when somebody comes along to WriteToThem and types in their postcode, we can figure out which ward, constituency etc. they are in, and tell them appropriate things about their representatives. (Technically this is a lie, of course, because postcodes represent regions, not points — we use the centroids of those regions — and each such region isn’t guaranteed to lie either wholly within or without all electoral and administrative regions. Unfortunately there isn’t a lot we can do about this beyond throwing our hands up and saying “oops, sorry”, so that’s what we do.)
As an aside, outside Great Britain — that is, in Northern Ireland, we don’t have the same sort of data so instead we rely on another field in the CodePoint data which gives, for each postcode centroid, the ONS ward code for the ward in which that point lies. From that ward code you can find the enclosing local authority area, local electoral area — in Northern Ireland local councils are elected by STV over multimember regions, rather than by first-past-the-post as in Great Britain — and constituency. Happily it turns out that all of those other regions are composed of whole numbers of wards; this happy state of affairs does not necessarily prevail elsewhere.
Now, twice a year, a new edition of Boundary Line is issued, taking account of recent changes in electoral geography. Usually this happens in May and October, though the schedule has been known to slip. In principle this should be easy to deal with: load up the new copy of Boundary Line, pass all the postcodes through it, and hey presto.
Life, of course, is rarely that simple, and this isn’t one of those occasions. When the boundaries of a region don’t change between one year and the next, we don’t want to make any alteration to that region in our database (which uses ID numbers to identify each area). More specifically, when a new revision of Boundary Line comes along, we want to ensure that — let’s say — Cambridge Constituency in the new revision is identified with Cambridge Constituency in the old version. Now, in principle, this should be easy, because each area in the data set, in the words of the manual,
… carries a unique identifier AI; this is the same identifier that was supplied in the previous specification of Boundary- Line. The same AI attribute is associated with every component polygon forming part of an administrative unit, irrespective of the number of polygons.
Now, the first time that we did this, we worked from a copy of the Boundary Line data supplied in the form of “ShapeFiles” (a format used in various proprietary GIS systems, and with which our local government partners were able to supply us without having to order it specially from Ordnance Survey). Unfortunately in the ShapeFile version, the allegedly unique administrative area IDs were, in fact, not unique. After discussion with Ordnance Survey it was concluded that this was a problem which affected the translation of the data from NTF (“National Transfer Format”, their own preferred format) into ShapeFile; and that the problem would be fixed in the next release.
So, taking no chances, we decided we’d work from the NTF format in future, since that seems to be closer to the authoritative source of the data, and anyway the ShapeFile format isn’t at all well-documented (for instance, many of the field names for the metadata about each area differ from those described in the manual for Boundary Line). So I’ve written code to parse the (slightly bonkers, natch) NTF files and modified our import scripts to use this code, with a view to then being able to keep up-to-date with future boundary revisions without too much trouble.
You will not be surprised, therefore, to hear that this has not worked out exactly as planned. Unfortunately it appears that the May 2005 NTF release of Boundary Line suffers exactly the same problems of non-uniqueness as did the previous ShapeFile release. So unless some cleverer solution presents itself, I’ll have to revive the hack we intended to use with the ShapeFile data — try to construct unique IDs for areas from their geometry, and hope that the exact coordinates of the polygon vertices for unchanged areas do not change between revisions. We shall see. But right now I’m mostly worrying about why my parser script runs out of memory on my 1GB computer after reading a couple of hundred megs of input data.
So, it’s my turn to write something here again. Ho-hum. Anyway, lately I’ve been working on adding the geographical features to PledgeBank which everyone thinks are there already: specifically, finding pledges which are nearby. To start with, we’re doing this for the UK only (because we already have the infrastructure to do postcode-to-coordinates lookups through MaPit), but the intention is to do thiis for the whole world as soon as we can, either by having users select their location through a gazetteer, or, where we can get the data, by using a similar postcode/zip-code/whatever-to-coordinates lookup. So that means we have to deal with places which might be anywhere on earth, which means dealing with latitudes and longitudes. And as anyone who’s dealt with this stuff knows, it’s very tedious to get this right. I’m afraid I’ve now spent too long reading about datum ellipsoids and Helmert Transforms to want to spend any time at all talking about them, so in the unlikely event that you’re interested, you’ll just have to read the (relevant parts of) code.
Oh, and if anybody has good suggestions for a hierarchical gazetteer with world-wide coverage, I’d love to hear them. Similarly, does anybody know if this one is any good?
So, it’s my turn to write something here. Well, this’ll be short. As Francis mentions below, PledgeBank launched today, and everything’s gone reasonably smoothly, with the exception of some tedious PHP bug we haven’t tracked down yet. With any luck the new version of PHP will fix it, so there won’t be hours of painful debugging to do. (I could bore you for hours with my opinions of PHP — actually, I could probably shorten that quite a lot if I were allowed to swear, but this is a family-friendly ‘blog — but let’s just say that fixing PHP problems Is Not My Favourite Job.)
Instead I’ll say what I’m doing right now, which is beginning to add geographical lookup to PledgeBank. At the moment we ask pledge authors for a country (though it can either be “UK” or “global” at the moment), and, if they’re in the UK, a postcode. The idea is to “georeference” (i.e. look up the coordinates of) the postcode, though we don’t actually do that yet. So I’m modifying the database a bit to store coordinates (as a latitude/longitude, so that we don’t have to write a separate case for every wacky national coordinate grid) and generalise the notion of “country” so that we can let non-Brits actually put in their own countries when they create pledges.
Other things we’ve discovered today:
- People are confused by the “(suspicious signer?)” link next to signatures on each pledge page — several people thought that we were reporting our suspicions of the signer. You probably think that’s stupid, but if so that’s only because you’re familiar with sites that have this sort of retroactive moderation button everywhere. Actually it’s us that’s being stupid and we’re going to remove it until we have a better way to implement it — at the moment we think we’re mostly on top of the occasional joke/abusive signature.
- People are confused by the pledge signature confirmation mail, which currently reads (for instance),
Please click on the link below to confirm your signature on the pledge at the bottom of this email.
The pledge reads:
‘Phil Booth will refuse to register for an ID card and will donate £10 to a legal defence fund but only if 10,000 other people will also make this same pledge.’
— the PledgeBank.com team
We got several emails from people saying “your site has got my name wrong — I’m not Phil Booth”. The point is that Phil Booth wrote the pledge, so it’s in his name; the email reflects that. But that’s not obvious to the signer, and since the only name in the body of the email isn’t theirs, they think it’s got it wrong and complain. (To be fair, only three out of ~1,100 did, but that’s still bad.) This is the sort of problem we need user testing to spot: none of us saw anything wrong with the text when we were testing it. So we need to reword that.
- We’ve had several people email to say that they’d like to do versions of PledgeBank in their own countries, and we’d like to hear from anyone interested in localising the mySociety projects who has time, expertise or even just opinions to donate. If that’s you, please get in touch!
And probably some other stuff, but I said this post would be short….
Well, I’m back from my holiday, suitably sunburned and (relatively) relaxed. As Francis mentions, I was off in the Mediterranean somewhere (Majorca, specifically) suffering from miserable internet withdrawal symptoms. I did manage to get IRC up-and-running over dialup for election night, though this turned out to be surprisingly expensive. For once I was grateful to my iBook, which did actually Just Work when plugged into the wall.
Anyway, today’s job is sorting out the new Scottish constituency boundaries. Scotland’s Parliament was dissolved in 1707 on the passing of the Act of Union, to be reconstituted in 1999. The quid pro quo for the Scots was enhanced representation in the House of Commons; Scottish constituencies had, in 1998, an average of 55,000 electors, compared to 69,000 in England. This anomaly has now been corrected, reducing the number of constituencies in Scotland from 72 to 59; all but three of the latter have different boundaries.
This means updating MaPit, the component we built to map postcodes into electoral geography, to deal with the new boundaries. Ideally the way that we’d do this is to wait for Ordnance Survey to ship us, via our friends in ODPM, the new revision of their Boundary-Line (TM, apparently) product, with the outlines of the new constituencies encoded in attractive machine-readable form, and feed it to our existing import scripts. (As so often in life, it’s not quite that simple, but you get the general idea.) In an ideal world, this would also contain all the changed boundaries of the English counties and their constituent county electoral divisions.
However, this is not an ideal world, and though there is a new revision of Boundary-Line in the works, it hasn’t come out yet, so we have to construct the point-to-constituency mapping in some other way. Happily, at this stage of the boundary revision process, the constituency boundaries are coterminous with ward boundaries, so it’s possible to just lift the definitions of the new constituencies from the relevant Statutory Instrument and fix up the constituencies from the ward boundaries, which haven’t changed. This, sadly, has occasioned a bit of a hack to our code, because we generally don’t assume that electoral geography is hierarchically defined — because it isn’t.
(I don’t feel too bad about committing this hack, actually, because we’re likely to chuck the whole MaPit database and reconstruct it later in the year from OS data. When we built it originally, we did so from data in ESRI shapefile format; unfortunately, OS stuffed up the process of generating this from their own, internal and quite bonkers, NTF format, so the various area ID numbers in the database are not unique and not expected to be stable. We’d rather like stable ID numbers, so that we can cope gracefully with revisions to geography while maintaining continuity of, for instance, statistical data about MPs, so next time round we’re going to work from the NTF instead.)
Sadly this Scottish hack doesn’t get us anywhere with the new county boundaries, and OS have told us that not all of the updated counties will be included in the forthcoming Boundary-Line revision. So it’ll be back to the tedious conversion of statutory instruments into SQL at some point in the near future, except that we’ll probably have to start building things up from parishes, rather than wards. Expect more anguished posts on this in the future.
Meanwhile, Francis and Tom are collecting names and contact details for the new MPs. Tom tells me that this intake looks much more tech-savvy than the last, which could be good news from our (and everyone else’s) point of view. Hopefully WriteToThem will be cranking back into action — as far as MPs go, at least — fairly soon.
We’ve just launched a testing version of FaxYourRepresentative. This is not a working site and not even a beta – because you cannot email representatives at the moment. What you can do, though, is practice sending messages – they’ll just be routed back to your own inbox so you can see that they’ve gone through.