Electoral geography again

So, it’s back to electoral geography for me, this time to get the new county and county electoral-division boundaries live on WriteToThem. This is a prerequisite for getting mail to county councillors working again after the election on May 5th, so we’re already three months behind the times. But more generally, electoral boundaries are revised all the time to account for changes of population within each ward, constituency and so forth; and at most (local and national) elections some set of boundary changes takes effect. So to keep WriteToThem running we need to incorporate such updates routinely.

The way we handle electoral geography in general is to start with Ordnance Survey’s Boundary Line product, which, for each administrative or electoral area in Great Britain gives a polygon identifying that region. We then take a big list of all the postcodes in Britain (CodePoint) and figure out which polygons they lie in. Then when somebody comes along to WriteToThem and types in their postcode, we can figure out which ward, constituency etc. they are in, and tell them appropriate things about their representatives. (Technically this is a lie, of course, because postcodes represent regions, not points — we use the centroids of those regions — and each such region isn’t guaranteed to lie either wholly within or without all electoral and administrative regions. Unfortunately there isn’t a lot we can do about this beyond throwing our hands up and saying “oops, sorry”, so that’s what we do.)

As an aside, outside Great Britain — that is, in Northern Ireland, we don’t have the same sort of data so instead we rely on another field in the CodePoint data which gives, for each postcode centroid, the ONS ward code for the ward in which that point lies. From that ward code you can find the enclosing local authority area, local electoral area — in Northern Ireland local councils are elected by STV over multimember regions, rather than by first-past-the-post as in Great Britain — and constituency. Happily it turns out that all of those other regions are composed of whole numbers of wards; this happy state of affairs does not necessarily prevail elsewhere.

Now, twice a year, a new edition of Boundary Line is issued, taking account of recent changes in electoral geography. Usually this happens in May and October, though the schedule has been known to slip. In principle this should be easy to deal with: load up the new copy of Boundary Line, pass all the postcodes through it, and hey presto.

Life, of course, is rarely that simple, and this isn’t one of those occasions. When the boundaries of a region don’t change between one year and the next, we don’t want to make any alteration to that region in our database (which uses ID numbers to identify each area). More specifically, when a new revision of Boundary Line comes along, we want to ensure that — let’s say — Cambridge Constituency in the new revision is identified with Cambridge Constituency in the old version. Now, in principle, this should be easy, because each area in the data set, in the words of the manual,

… carries a unique identifier AI; this is the same identifier that was supplied in the previous specification of Boundary- Line. The same AI attribute is associated with every component polygon forming part of an administrative unit, irrespective of the number of polygons.

Now, the first time that we did this, we worked from a copy of the Boundary Line data supplied in the form of “ShapeFiles” (a format used in various proprietary GIS systems, and with which our local government partners were able to supply us without having to order it specially from Ordnance Survey). Unfortunately in the ShapeFile version, the allegedly unique administrative area IDs were, in fact, not unique. After discussion with Ordnance Survey it was concluded that this was a problem which affected the translation of the data from NTF (“National Transfer Format”, their own preferred format) into ShapeFile; and that the problem would be fixed in the next release.

So, taking no chances, we decided we’d work from the NTF format in future, since that seems to be closer to the authoritative source of the data, and anyway the ShapeFile format isn’t at all well-documented (for instance, many of the field names for the metadata about each area differ from those described in the manual for Boundary Line). So I’ve written code to parse the (slightly bonkers, natch) NTF files and modified our import scripts to use this code, with a view to then being able to keep up-to-date with future boundary revisions without too much trouble.

You will not be surprised, therefore, to hear that this has not worked out exactly as planned. Unfortunately it appears that the May 2005 NTF release of Boundary Line suffers exactly the same problems of non-uniqueness as did the previous ShapeFile release. So unless some cleverer solution presents itself, I’ll have to revive the hack we intended to use with the ShapeFile data — try to construct unique IDs for areas from their geometry, and hope that the exact coordinates of the polygon vertices for unchanged areas do not change between revisions. We shall see. But right now I’m mostly worrying about why my parser script runs out of memory on my 1GB computer after reading a couple of hundred megs of input data.