1. Under the bonnet

    Partly for our own internal documentation, and partly because it might be of interest to (some) readers, some notes on how the Number 10 petitions site works. On the face of it you’d imagine this would be very simple, but as usual it’s complicated by performance requirements (our design was motivated by the possibility that a large NGO might mail a very large mailing list inviting each member to sign a particular petition, so that we might have to sustain a very high signup rate for an extended period of time). Here’s a picture of the overall architecture:

    Diagram representing petitions site architecture

    (This style of illustration is unaccountably popular in the IT industry but unlike most examples of the genre, I’ve tried to arrange that this one actually contains some useful information. In particular I’ve tried to mark the direction of flow of information, and separate out the various protocols; as usual there are too many of the latter. The diagram is actually a slight lie because it misses out yet another layer of IPC—between the web server, apache, and the front-end FastCGI scripts.)

    Viewing petition pages is pretty conventional. Incoming HTTP requests reach a front-end cache (an instance of squid, one per web server, cacheing in memory only); squid passes them to the petition browsing scripts (written in perl running under FastCGI) to display petition information. Those scripts consult the database for the current state of the relevant petition and pass it back to the proxy, and thence to the web client. This aspect of the site is not very challenging.

    Signing petitions is harder. The necessary steps are:

    • write a database record about the pending signature;
    • send the user an email containing a unique link to confirm their signature;
    • update the database record when the user clicks the link;
    • commit everything to stable storage; and finally
    • tell the user that their signature has been entered and confirmed.

    The conventional design for this would be to have the web script, when it processes the HTTP request for a new signature, submit a message for sending by a local mail server and write a row into the database and commit it, forcing the data out to disk. The mail server would then write the message into its spool directory, and fsync it, forcing it out to disk. The mail server will then pick the mail out of its queue and send it to a remote server, at which point it will be erased from the queue. Later on the mail will arrive in the
    user’s inbox, at which point they will (presumably) click the link, resulting in another HTTP request which causes the web script to update the corresponding database row and commit the result. While this is admirably modular it requires far more disk writes than necessary to actually complete the task, which limits its potential speed. (In particular, there’s no reason to have a separate MTA spool directory and for the MTA to make its own writes to that directory.)

    At times of high load, it is also extremely inefficient to do one commit per signature. It takes about as long to commit ten new or changed rows to the database as it is to commit one (because the time spent is determined by the disk seek time). Therefore to achieve high performance it is necessary to batch signatures. Unfortunately this is a real pain to implement because all the common web programming models use one process per concurrent request, and it is inconvenient to share database handles between different processes. The correct answer to this problem would of course be to write the signup web script as a single-process multiplexing implementation, but that’s a bit painful (we’d have had to implement our own FastCGI wire protocol library, or alternatively an HTTP server) and the deadlines were fairly tight. So instead we have a single-process server, petsignupd, which accepts signup and confirmation requests from the front-end web scripts over a simple UDP protocol, and passes them to the database in batches every quarter of a second. In theory, therefore, users should see a maximum latency of a bit over 0.25s, but we should achieve close to the theoretical best throughput of incoming requests. (Benchmarking more-or-less bears this out.)

    Sending the corresponding email is also a bit problematic. General-purpose MTAs are not optimised for this sort of situation, and (for instance) exim can’t keep up with the sustained signup rate we were aiming for even if you put all of its spool directories on a RAM disk and accept that you have to repopulate its outgoing queue in the event of a crash. The solution was to write petemaild, a small multiplexed SMTP sending server; unlike a general-purpose MTA this manages its queue in memory and communicates updates directly to the database (when a confirmation email is delivered or delivery fails permanently).

    It’s unfortunate that such a complex system is required to fulfil such a simple requirement. If we’d been prepared to write the whole thing ourselves, from processing HTTP requests down to writing signatures out to files on disk, the picture above would look much simpler (and there would be fewer IPC boundaries at which things could go wrong). On the other hand the code itself would be a lot more complex, and there’d be a lot more of it. I don’t think I’d describe this design as a “reasonable” compromise, but it’s at least an adequate one.

  2. Ordnance Survey mashups day

    Last Friday Matthew and I went to the Ordnance Survey’s UK Geospatial Mash-up day. And a splendid time was had by all. Really this post is just a placeholder for a link to a copy of my presentation slides (not quite what I delivered, I’m afraid), but if anyone was there and has any questions they weren’t able to put to us in person, or wasn’t there and wished they had been, then the comments section awaits….

  3. How (not) to survive a Slashdotting

    So, PledgeBank got Slashdotted a couple of weeks ago when Mike Liveright’s $100 laptop pledge was linked from a post about the laptop. We didn’t cope very well.

    Unfortunately, PledgeBank is a pretty slow site. Generating the individual pledge page (done by mysociety/pb/web/ref-index.php) can take anything up to 150ms. That’s astonishingly slow, given the speed of a modern computer. What takes the time?

    It’s quite hard to benchmark pages on a running web server, but one approach that I’ve found useful in the past is to use an analogue of phase-sensitive detection. Conveniently enough, all the different components of the site — the webserver, the database and the PHP process — run as different users, so you can easily count up the CPU time being used by the different components during an interval. To benchmark a page, then, request it a few times and compute the amount of CPU time used during those requests. Then sleep for the same amount of time, and compute the amount of CPU time used by the various processes while you were sleeping. The difference between the values is an estimate of the amount of CPU time taken servicing your requests; by repeating this, a more accurate estimate can be obtained. Here are the results after a few hundred requests to http://www.pledgebank.com/100laptop, expressed as CPU time per request in ms:

    Subsystem User System
    apache ~0 ~0
    PostgreSQL 55±9 6±4
    PHP 83±8 4±4

    (The code to do the measurements — Linux-specific, I’m afraid — is in mysociety/bin/psdbench.)

    So that’s pretty shocking. Obviously if you spend 150ms of CPU time on generating a page then the maximum rate at which you can serve users is ~1,000 / 150 requests/second/CPU, which is pretty miserable given that Slashdot can relatively easily drive 50 requests/second. But the really astonishing thing about these numbers is the ~83ms spent in the PHP interpreter. What’s it doing?

    The answer, it turns out, is… parsing PHP code! Benchmarking a page which consists only of this:

    <?
    /* ... */
    
    require_once '../conf/general';
    require_once '../../phplib/db.php';
    require_once '../../phplib/conditional.php';
    require_once '../phplib/pb.php';
    require_once '../phplib/fns.php';
    require_once '../phplib/pledge.php';
    require_once '../phplib/comments.php';
    require_once '../../phplib/utility.php';
    
    exit;
    ?>

    reveals that simply parsing the libraries we include in the page takes about 35ms per page view! PHP, of course, doesn’t parse the code once and then run the bytecode in a virtual machine for each page request, because that would be too much like a real programming language (and would also cut into Zend’s market for its “accelerator” product, which is just an implementation of this obvious idea for PHP).

    So this is bad news. The neatest approach to fixing this kind of performance problem is to stick a web cache like squid in front of the main web site; since the pledge page changes only when a user signs the pledge, or a new comment is posted, events which typically don’t occur anywhere near as frequently as the page is viewed, most hits ought to be servable from the cache, which can be done very quickly indeed. But it’s no good to allow the pledge page to just sit in cache for some fixed period of time (because that would be confusing to users who’ve just signed the pledge or written a comment, an effect familiar to readers of the countless “Movable Type” web logs which are adorned with warnings like, “Your comment may take a few seconds to appear — please don’t submit twice”). So to do this properly we have to modify the pledge page to handle a conditional GET (with an If-Modified-Since: or If-None-Match: header) and quickly return a “304 Not Modified” response to the cache if the page hasn’t changed. Unfortunately if PHP is going to take 35ms to process such a request (ignoring any time in the database), that still means only 20 to 30 requests/second, which is better but still not great.

    (For comparison, a mockup of a perl program to process conditional GETs for the pledge page can serve each one in about 3ms, which isn’t much more than the database queries it uses take on their own. Basically that’s because the perl interpreter only has to parse the code once, and then it runs in a loop accepting and processing requests on its own.)

    However, since we unfortunately don’t have time to rewrite the performance-critical bits of PledgeBank in a real language, the best we can do is to try to cut the number of lines of library code that the site has to parse on each page view. That’s reduced the optimal case for the pledge page — where the pledge has not changed — to this:

    <?
    /* ... */
    
    require_once '../conf/general';
    require_once '../../phplib/conditional.php';
    require_once '../../phplib/db.php';
    
    /* Short-circuit the conditional GET as soon as possible -- parsing the rest of
     * the includes is costly. */
    if (array_key_exists('ref', $_GET)
        && ($id = db_getOne('select id from pledges where ref = ?', $_GET['ref']))
        && cond_maybe_respond(intval(db_getOne('select extract(epoch from pledge_last_change_time(?))', $id))))
        exit();
    
    /* ... */
    ?>

    — that, and a rewrite of our database library so that it didn’t use the gigantic and buggy PEAR one, has got us up to somewhere between 60 and 100 reqs/sec, which while not great is enough that we should be able to cope with another similar Slashdotting.

    For other pages where interactivity isn’t so important, life is much easier: we can just emit a “Cache-Control: max-age=…” header, which tells squid that it can re-use that copy of the page for however long we specify. That means squid can serve that page at about 350reqs/sec; unfortunately the front page isn’t all that important (most users come to PledgeBank for a specific pledge).

    There’s a subtlety to using squid in this kind of (“accelerator”) application which I hadn’t really thought about before. What page you get for a particular URL on PledgeBank (as on lots of other sites) vary based on the content of various headers sent by the user, such as cookies, preferred languages, etc.; for instance, if you have a login cookie, you’ll see a “log out” link which isn’t there if you’re an anonymous user. HTTP is set up to handle this kind of situation through the Vary: header, which the server sends to tell clients and proxies on which headers in the request the content of the response depends. So, if you have login cookies, you should say, “Vary: Cookie”, and if you do content-negotiation for different languages, “Vary: Accept-Language” or whatever.

    PledgeBank has another problem. If the user doesn’t have a cookie saying which country they want to see pledges for, the site tries to guess, based on their IP address. Obviously that makes almost all PledgeBank pages potentially uncachable — the Vary: mechanism can’t express this dependency. That’s not a lot of help when your site gets featured on Slashdot!

    The (desperately ugly) solution? Patch squid to invent a header in each client request, X-GeoIP-Country:, which says which country the client’s IP address maps to, and then name that in the Vary: header of the outgoing pledges. It’s horrid, but it seems to work.

  4. Travel-time Maps and their Uses

    This project became Mapumental. Please visit www.mapumental.com for details of our transit-time maps and the services we can offer.

    The work was funded and supported by the Department for Transport.

    (See also the separate post on methods.)

    Introduction

    Transport maps and timetables help people work out how to get from A to B using buses, trains and other forms of public transport. But what if you don’t yet know what journey you want to make? How can maps help then?

    This may seem a strange question to ask, but it is one we all face in several situations:

    • Where would I like to work?
    • Where would I like to live?
    • Where would I like to go on holiday?

    These are much more complicated questions than those about individual journeys, but one thing they all have in common is transport: can I get to and from the places I’m considering quickly and easily?

    The maps on this page show one way of answering that question. Using colours and contour lines they show how long it takes to travel between one particular place and every other place in the area, using public transport. They also show the areas from which no such journey is possible, because the services are not good enough.

    Examples

    Rail travel in Great Britain

    Our first example is about rail travel. The map below shows how long it takes to get from Cambridge Station to every other station in the UK, starting at seven o’clock on a weekday morning. This could be useful if you lived in Cambridge, and were wondering where you might go for a long weekend away and didn’t want to travel more than 4 hours. We assume that people will take a taxi from a convenient train station to their destination, so long as that journey is no more than an hour. (Please see our methods page for a more detailed description of our assumptions.) You can click on any of the maps on this page to see a larger version.

    Map showing travel times by rail and taxi from Cambridge to other points in Great Britain, starting at 7 o'clock on a weekday morning

    © Crown copyright. All rights reserved. Department for Transport 100020237 2006

    The white contour lines are drawn at one-hour intervals, so the innermost, almost circular contour shows destinations up to an hour from Cambridge, the concentric one two hours, and so forth. (Technically, the contours are “isochrones”, meaning lines of constant time, as for “isobars” for lines of constant atmospheric pressure on a weather map.) Places, such as Leeds, which are surrounded by little circular contours are more-accessible “islands” served by fast trains with infrequent stops; they are therefore quicker to reach than those in the surrounding areas which require lengthy changes or journeys on slower services.

    The colour scale, shown on the top right of the map, is in hours of total travel time. Warm colours indicate short travel time—red for four hours or less, orange and yellow for four to eight hours—and cool colours longer journeys. The longest journey times of all, to destinations in the far north and west of Scotland, are over nineteen hours (remember that this includes waiting time, which could often be overnight). Areas with no colour at all, such as the area around Hawich on the Scottish Borders or the north-west coast of Scotland, cannot be reached at all by rail and a taxi journey of up to one hour.

    The map shows that the fastest journeys are those along the East Coast Main Line north to Edinburgh, and those south to London, which is served by frequent fast trains. Everywhere in England can be reached within about seven hours (so by two o’clock in this example), and everywhere in Wales within about ten hours, though many of the fastest journeys to rural areas of mid-Wales will involve a long taxi journey. The urban areas of lowland Scotland are similarly well-connected, but areas further north are much less accessible, or even inaccessible where there are no stations within an hour’s drive of a given destination.

    Edinburgh

    The next map is for journeys starting from Edinburgh Waverley station, but otherwise it is the same as the Cambridge map.

    Map showing travel times by rail and taxi from Edinburgh to other points in Great Britain, starting at 7 o'clock on a weekday morning

    © Crown copyright. All rights reserved. Department for Transport 100020237 2006

    Here the East Coast Main Line is even more obvious than in the preceding map; it appears as a tendril of red and pink stretching south from Edinburgh down to London.

    Comparing car and train travel

    Again considering journeys starting from Cambridge, this map shows which parts of the country are quicker to get to by train (red and orange), and which by car (green and blue). Yellow and light orange show areas where there’s no great difference. This could be useful if you had limited access to a car and were planning where to go, or wanted to see whether it was worth hiring a car for a particular trip.

    Map showing the difference in journey times by rail and taxi, and by road alone, from Cambridge to other points in Great Britain, starting at 7 o'clock on a weekday morning

    © Crown copyright. All rights reserved. Department for Transport 100020237 2006

    This time, contours are drawn for each hour of difference in travel time. Note also that the scale is quite asymmetric: the most time you can save travelling by train is about two hours, but—for places which are difficult to reach by train—you can save six or seven hours travelling by road.

    From this map, journeys to London are quicker by train (the road travel model takes no account of traffic or urban areas, so it is pessimistic about the time saving) as are journeys to Leeds, Berwick, Edinburgh, Glasgow and other points served by trains on the East Coast Main Line. In the west of England, journeys to Exeter and thereabouts are quicker by rail, but all other journeys are quicker by road (largely because most westward journeys require a change at London or a slow cross-country train to Birmingham).

    (However, the model of car journey times is very simplistic, so these results should not be taken too seriously—we hope to extend the work with a more realistic model of driving times, which may substantially change the comparative results.)

    Bus and rail travel in Cambridge and surrounds

    Our second example is based on public transport in and around Cambridge, which chiefly means buses within town and from outlying villages into the city center. Here we ask how early must you get up and leave your home to reach a particular place of work by public transport. In this example the destination we’ve selected is the University of Cambridge’s West Cambridge Site, which is also home to a number of commercial employers including Microsoft.

    Map of Cambridge showing times of departure to reach the West Cambridge site by 9 o'clock on a weekday morning

    © Crown copyright. All rights reserved. Department for Transport 100020237 2006

    This map shows the city center of Cambridge, and the area around the destination, which is marked by a small black circle (left middle). Again, warm colours show short journey times—in this case, later departures—but the contour lines are drawn at intervals of ten minutes, rather than an hour, so that the innermost contour around the destination corresponds to a start time of about ten to nine. Again, uncoloured areas are those not served by any services; on this map these are all fields lying outside the city itself.

    Cambridge is a small city with a lot of bus services, so it is not very surprising that the whole of the city center and much of the suburbs are within twenty to thirty minutes’ travel of the destination, even including waiting and walking time. Moving further out, though, the picture changes: (selected villages are labelled for orientation purposes)

    Map of Cambridge and surrounds showing times of departure to reach the West Cambridge site by 9 o'clock on a weekday morning

    © Crown copyright. All rights reserved. Department for Transport 100020237 2006

    The larger map clearly shows the differing level of service to various outlying villages within 5–10km of Cambridge. Areas which are connected to Cambridge by fast roads, such as the A14 which runs through Fenstanton and Bar Hill, then skirts Cambridge to the north, and continues east via Stow cum Quy (just south of Swaffham Bulbeck) are much better served than villages such as Reach, which lies well off the beaten track. Waterbeach and Great Shelford, to the north and south of the center of Cambridge respectively, are also served by train services. Even on this scale there are a few habitations with no or limited bus service and from which it is not possible to reach the West Cambridge Site for 9 o’clock purely by public transport without a long walk or an overnight journey (for instance, Rampton, to the north-east of Longstanton, or Childerley, to the north-west of Hardwick).

    Looking at a yet larger scale shows a similar pattern:Map of Cambridgeshire and surrounds showing times of departure to reach the West Cambridge site by 9 o'clock on a weekday morning

    © Crown copyright. All rights reserved. Department for Transport 100020237 2006

    This map shares the same colour scheme as the previous one—warm colours indicate short journeys, and cool colours long journeys—but the contours are at intervals of thirty minutes rather than ten. Towns such as Huntingdon, Newmarket and Ely are ideal commuter territory, as are some intermediate villages; but most outlying villages aren’t connected at all.

    London

    As a comparison with transport around Cambridge, we’ve also drawn maps for London, a much more densely-populated area with correspondingly better transport infrastructure (you will see that there are almost no inaccessible grey areas). Here the chosen destination is the Department for Transport (DfT) headquarters on Horseferry in Westminster. Starting with the most local map:

    Map of central London showing times of departure to reach the Department for Transport building by 9 o'clock on a weekday morning

    © Crown copyright. All rights reserved. Department for Transport 100020237 2006

    Again, warm colours indicate short journeys and cool ones longer journeys. On this map contours are shown at ten-minute intervals, so that the near-circular one around the destination indicates the area in which you can get to the DfT by leaving the house at about ten to nine in the morning.

    Moving slightly further out, nearby tube stations (St James’s Park, Westminster and Pimlico) and bus routes to the south and east are important. Further south there are islands of accessibility around mainline train stations such as Brixton and Clapham Junction. On a smaller-scale map, the tube and railway lines themselves show up as chains of such islands:

    Map of central London and suburbs showing times of departure to reach the Department for Transport building by 9 o'clock on a weekday morning

    © Crown copyright. All rights reserved. Department for Transport 100020237 2006

    On this scale, particular ill-connected areas of London are clearly visible: Hackney, Richmond and Dulwich (despite a direct train service to Victoria) both require an eight o’clock start to arrive at the DfT for our nine o’clock deadline. Compare these to the region of good connectivity stretching south and south-west from the center, along the Northern Line and mainline rail line through Wimbledon, the District line to the west, and the Docklands Light Railway to the east. On this scale individual bus routes are not particularly evident, even though they are significant everywhere that rail or tube stations are too far away to walk to in the model; contrast this with the Cambridgeshire maps above.

    On this final map, of the whole Greater London area and surrounds, the contour lines are at half-hour intervals:

    Map of Greater London and surrounds showing times of departure to reach the Department for Transport by 9 o'clock on a weekday morning

    © Crown copyright. All rights reserved. Department for Transport 100020237 2006

    At this scale the suburbs of London appear to be arranged along a southwest–northeast axis, a result of good rail links to Surbiton and Twickenham (at the ends of the two red tendrils which stretch away from the center to lower left). Other rail lines, such as those to Bromley and Orpington in the southeast, are also visible, as are islands of short journey time such as Watford (northwest), Hersham and Esher (southwest); these surround individual locations with fast (c. 1 hour) journey times into central London.

    What next? Further work

    We are considering various possible extensions and improvements, and we’re keen to talk to anyone else interested in the work so far, or any of these ideas:

    Relating journey times to house prices
    An obvious application for travel time data are to people’s decisions about where to live. By comparing journey time data for particular locations to house prices in areas nearby it would be possible to tabulate the cheapest areas to live in within a certain travel time of a desired location, for instance a person’s place of work.
    Improving the road travel model to better reflect comparative rail and road journey times
    At the moment our rail travel model assumes that the final stage of each journey is by taxi, under the assumption of a uniform and isotropic road network. This is clearly inadequate; we would like to extend it using journey planning software (preferably including traffic and time-of-day effects) to
    produce more reliable travel time and modal comparison maps.
    Cost
    At the moment the maps are constructed on the assumption that users will always choose the quickest journey under the imposed constraints. Of course in reality users also respond to other incentives, of which one of the most importantly is price. If we could obtain fare data we could show journey costs rather than times, use more realistic constraints (for instance, choosing the shortest journey cheaper than a certain amount), or comparing the prices for tickets bought on the day to those bought at varying intervals before travel.
    Improving the readability of the maps
    The maps could be made much easier to understand. Improvements could be made by refinement of the colours used and by replacing the current (OS Explorer and 1:250,000) base maps with simpler ones showing sufficient information to allow the viewer to understand the placement of the map without extraneous detail such as building lines etc.
    Incorporating reliability information
    Presently we assume that all public transport services run according to their published timetable. This means that we are more optimistic about total travel times than would be regular users of those services, who will know how often services are late, cancelled or whatever. In some cases—for instance, train services—reliability data are published, but in general this is probably better handled by a statistical model. Both reliability and
    users’ tolerance for unpunctuality could be incorporated, so that users could state the tolerance they have for late arrival at their destination (for
    instance, no more than once every month) and journey times computed so as to meet that target on average.
    A real-time web service
    While our approach uses only published data and services (existing public sector journey planners such as
    TransportDirect), the maps are quite slow to construct; for instance, the one of London above required about ten hours of computer time; further, each one is relevant only to a particular destination or origin (and points near it). If we could speed up this process, we could generate maps on demand for visitors to a website; to do this we would need to work with organisations such as Transport for London and incorporate more specific information about services in each area, rather than relying entirely on outside services. This is an attractive idea, although it would be expensive to build.

    Please contact us if you have any questions or comments.

    Acknowledgments

    This work was funded by the Department for Transport, who also made it possible for us to use Ordnance Survey maps and data through their licence; without this assistance we would have had to pay expensive fees to use the underlying mapping data or to produce maps with no landmarks, which would be almost incomprehensible. DfT also gave us access to their National Public Transport Access Node database, which records the locations of train and tube stations and bus stops; without this it would have been difficult to produce any maps at all.

    The data we used

    Although the journey planning services and software we used were publicly accessible, almost none of the other data is available unless you pay for it, or your work falls under an existing licencing agreement. So while we set out to demonstrate how easily we could make travel-time maps from public data, very little of this work could be cheaply reproduced or extended without assistance from a government department.

    That’s unfortunate, because it means that innovative work by outsiders in this area can only go ahead if it’s explicitly sponsored by government. If all the data we’ve used had been available for free, somebody else might well have done what we’ve done years ago, with no cost to the taxpayer. We’d love it if others extend the work that we’ve done, but realistically there aren’t very many people in a position to do this cheaply.

    Remember, this blog post was written many years ago

    Visit Mapumental for our latest mapping technologies!

  5. Travel time maps: methods

    This project became Mapumental. Please visit that site for details of our travel-time maps services.
    The work was funded and supported by the Department for Transport.

    See also: the main travel-time maps report.

    ——————

    Clearly the travel-time for a particular journey is a function of the origin and destination points, the time of travel (whether of departure or arrival), the modes used, and of decisions made by the traveller (for instance whether to prefer faster or cheaper journeys). Our travel-time maps can only conveniently display a function of one point (the origin or destination point), so we need to fix the other arguments so that (a) it describes journey times which are useful to users in some context, and (b) it is practical to compute.

    We evaluate the travel-time function using published the following journey planners:

    RailPlanner
    This is off-the-shelf desktop software for planning rail journeys on the national rail network. It includes a full timetable for rail services in Great Britain.
    Transport Direct
    This is a website which provides uniform access to regional journey planner services operated by local councils and others in different parts of the country. It will give routes by all modes of public transport and road journeys throughout the country.
    Transport for London’s journey planner
    This is a website which provides the same service as TransportDirect for the London area.

    In each case custom programs were written to drive the journey planner. For RailPlanner, this took the form of a small Microsoft Windows program which simulates filling of the text fields for origin, destination and departure time in the RailPlanner application window, simulates clicking of the button to start a search, and extracts the results using the Windows debugging API. This allows queries for routes between rail stations to be computed quickly and conveniently; because it runs on the local machine, the search is fairly fast, and a few minutes is enough to compute the journey time from a given starting location to all other railway stations in the country.

    In the other two cases a conventional web “scraper” was written using the WWW::Mechanize perl extension. In these cases we ask for routes between postcodes, rather than between named bus stops, tube stations, etc. We chose this approach because, while there is a standard dictionary of bus stop etc. names in the National Public Transport Access Node database, we were not confident that those names would be interpreted unambiguously by the journey planner websites. By contrast, postcodes are completely unambiguous and since most bus stops etc. are in built-up areas, nearby postcodes can be used as placeholders for their locations; we offset travel time by the assumed walking distance from the origin postcode to the nearby interchange point. Admittedly this is not ideal!

    The RailPlanner scraper is very fast: it can find the travel times from a single station to all the other (c. 2,800) stations in Great Britain in two or three minutes on a modest desktop PC. The web scrapers are much slower, partly because they have to call out over the network, partly because the multimodal route planning problem is much harder than for rail alone, and partly because those services are of course heavily used by people for real journey planning rather than experiments. For TransportDirect query times of tens of seconds for individual journeys are common. We limited the rate at which we sent queries to ensure that there was no adverse effect on the services for other users.

    Most journeys don’t start and finish at railway stations or bus stops, so we need to be able to calculate journey times between arbitrary points; these are used to estimate the time taken to get from an origin to join a public transport service, and from its destination to the final destination point.

    In the case of the railway travel-times, we did this by assuming that users would continue from the destination station to their final destination by taxi, taking ten minutes to change mode. Taxi journey times were assumed to be a simple function of the distance travelled—we ignore the effects of the road network, and of traffic. To form an estimate of this model we simulated a large number of car journeys using off-the-shelf route-planning software, and then fit a simple function to the journey time. For long distances we find that the cross-country speed is about 66km/h; while that seems slow, bear in mind that the distance in question is the as-the-crow-flies distance. We assume that people will tolerate a taxi journey of up to one hour, though obviously this is an arbitrary choice.

    In the multimodal case, TransportDirect and the TfL journey planners will actually compute travel-times for journeys starting and ending at arbitrary postcodes. However, it’s much more efficient to restrict our search to journeys starting at bus stops, tube stations etc., and then to compute travel times from arbitrary locations to the appropriate bus stop. In this case we assume that the user will begin their journey on foot, travelling at 1m/s cross-country; this matches the assumption made by the journey planner for the time to (e.g.) walk between nearby bus stops. We let the journey planner site determine the time taken for the final component of the journey (typically, from a nearby bus stop to the fixed final destination), since there is no disadvantage to doing so. We assume that people will tolerate a walk of up to 15 minutes at the start of their journey.

    Therefore, in each case, we compute the maps as follows:

    • Fix an origin and start time (in the case of the rail maps) or a destination and arrival deadline (in the case of the multimodal maps); and choose a region of interest; in the case of the rail maps, this was the whole of Great Britain; in the case of the multimodal maps, a square of side 40km centered on the destination. This latter is an arbitrary choice; we could pick a larger area (slower to cover), or an area which doesn’t contain the destination, for instance to ask from what parts of London it is possible to commute to somewhere outside London.
    • Iterate over all railway stations (rail maps) or over all transport interchanges (multimodal) in the region of interest, and compute the earliest arrival for a departure after the start time (rail) or the latest departure for an arrival before the deadline (multimodal) and record it.
    • Now we have enough information to draw the map. Do this by iterating over a grid of points in the region of interest (choosing the spacing by the resolution of the map to be drawn); at each point we search for transport interchanges within an hour’s taxi journey (rail) or 15 minutes’ walking distance (multimodal) of that point, and choose the one which gives the shortest overall journey time, if any. Record this value.

    This generates a grid of points at which we know the journey time, or know that no journey is possible. From this it is trivial to draw a map where each point is coloured according to journey time, or uncoloured if no journey is possible. We choose the colours according to a standard scale, but adjust the colours using histogram equalisation so that each colour covers approximately the same area of map.

    Drawing the contours, which are essential to making the map comprehensible, is slightly more subtle, because the function to contour (the travel-time) is not actually defined everywhere in the region of interest. We fix this up by extrapolating the values of the travel time outside the domain of the function, contouring the extrapolated function, and then clipping the contours against the domain of the function. Our extrapolation is a solution to Laplace’s equation, fixing its value to the value of the travel-time on the boundary of its domain, and fixing the normal derivative at zero on the boundary of the region of interest. Though there is no real justification for this approach it produces acceptable results.

    For the national rail travel maps, the coloured fields and contours were generated using the University of Hawaii’s Generic Mapping Tools, which generate PostScript output; this was then manually composited with raster base-map data in an image-processing program. Unfortunately it is hard to produce satisfactory results by this means, so for the other maps a custom tool was developed to plot and contour maps into raster graphics files which were therefore correctly aligned with the base map data. The overlays are rendered with alpha of around 50% (adjusted by eye for contrast with the different base maps).

    Software used

    custom software in C, mingw and msys, Windows 2000, VMWare, RailPlanner
    For computing rail travel-times.
    custom software in perl, WWW::Mechanize
    For computing multimodal travel-times.
    GMT, custom software written in C, GIMP
    For producing maps.

    If you’d like a copy of the custom software we wrote, or if you have any other questions or comments, please email hello@mysociety.org. Our software is available under the terms of the GNU Affero GPL.

    Summary of data used

    Excludes timetable data supplied through RailPlanner and the journey planning websites.

    NaPTAN (National Public Transport Access Node database)
    For locations of railway stations, bus stops etc.
    CodePoint (Ordnance Survey database of postcode centroid locations)
    Postcodes were used to describe origin and destination points as input to the two web scrapers, as described above.
    1:25,000 Scale Colour Raster Maps (Ordnance Survey)
    For use as large-scale base mapping.
    1:250,000 Scale Colour Raster Maps (Ordnance Survey)
    For use as small-scale base mapping.

    These data were kindly supplied by agreement with the
    Department for Transport.

  6. Stats Is Fun!

    So, Francis and Matthew have been working hard to produce the 2005 stats for WriteToThem, which measure, among other things, how responsive MPs are to messages from their constituents. We’ve had a couple of questions about what the quoted confidence intervals for the stats actually mean, so some notes on that and also on conclusions that can be drawn from them:

    We get the responsiveness stats by sending each user of WriteToThem a questionnaire two weeks after their message is sent to their MP (or other representative), with a reminder after three weeks. The questionnaire asks them whether they’ve had a response from the MP, and, as a follow-up question, whether this is the first time they’ve ever contacted an elected representative.

    Now, an estimate of the MP’s true response rate (that is, the probability that a random letter from a member of the public will receive an answer within 2–3 weeks) is obviously the number of respondents answering “yes” to the first questionnaire question, divided by the total number of responses to the question. (Assuming, that is, that the people who answer the question at all are representative of all the people who write to a given MP, that they answer honestly, etc. etc.) So, we can regard this as being a bit like an opinion poll: we ask a sample of constituents whether they got an answer, and extrapolate to estimate the total response rate.

    Unfortunately in many cases we don’t have very many responses to the questionnaire (either because an MP’s constituency isn’t very wired so few constituents used our service, or we didn’t have accurate contact details for a significant part of the year so couldn’t put those people in touch with their MP, or because the MP was recently elected so hasn’t had much time to accumulate data). Over the course of they year we sent MPs about 30,000 messages, giving an average of about fifty per MP, but the number of messages sent to individual MPs varied quite a bit:


    mps-nmsgs-hist

    Obviously, the fewer messages we send to an MP, the fewer questionnaire responses about their performance we receive, and therefore the less accurate our estimate of how well they perform. To see why this is, consider a simple analogy. Imagine that a particular MP manages to respond to 50% of messages within 2–3 weeks. You can model that with a flip of a coin: for each message, flip the coin, and if it comes up heads, count that as a response received, and if tails, not received. Suppose you do that once: 50% of the time the coin comes up heads (MP did respond), and 50% of times tails (MP did not respond). If there’s only one message, therefore, you will always see a response rate of 0% or of 100%. With two messages there are three possibilities: 25% of the time you’ll get two tails, 50% of the time one head and one tail, and 25% of the time you’ll get two heads. So 50% of the time the estimated responsiveness is correct (1 / 2 = 50%), and 50% of the time it’s wrong. As the number of responses increases, accuracy improves; for instance, for an MP who has a response rate of 50% and 20 questionnaire responses, the probability that we’ll estimate various values for their responsiveness looks like this:

    binomial-N=20

    —now the probability that our estimate will be exactly correct (10 “yes” answers out of 20 questionnaires) is only 18%, but the probability that we’ll get an answer quite close to the true value (between 8 and 12 “yes”s, or from 40% to 60%) is quite high: about 74%. It’s 95% certain that we’ll see between 6 and 14 “yes”s, corresponding to an estimated response rate between 30% and 70%. To put it another way, if all MPs had twenty questionnaire responses, and all had a responsiveness of 50%, then for about 2.5% of them (16 MPs) we’d estimate that their responsiveness was worse than 30%, and for another 16 that it was better than 70%. The [30%–70%] range is called a “95% confidence interval”, and is an indication of how sure we are about the statistics we are publishing. If you don’t pay attention to the confidence intervals you will get a misleading impression from these statistics—we show the confidence intervals for a reason!

    (For a comparison, a commercial opinion poll would typically use a sample of between 1,000 and 2,000 people, giving rather narrower confidence intervals, typically of ±2–3%. We can’t do that for individual MPs because we don’t have enough data, but for the data on the responsiveness of different types of representatives we have lots of data and can produce much more accurate numbers. For instance, our estimate of the overall responsiveness of MPs, which is based on tens of thousands of questionnaire responses, is 63%, and our confidence interval on this is much smaller than 1%.)

    I’ll finish with something on the question of which parties’ MPs are best at answering constituents’ correspondence. Here’s a plot of our estimates of the response rates of MPs from some major parties:

    parties-responsiveness

    (This ignores the Northern Ireland parties because (a) none of them has very many MPs, so we don’t have much data; and (b) I’d run out of colours and wouldn’t like to make an horrific faux pas. I’ve lumped together the Welsh and Scottish nationalists for much the same reasons. Apologies. The way to read the graph, by the way, is to consider each curve as telling you “what fraction of this party’s MPs have a responsiveness equal to or less than the value”.)

    So what does this tell us? Tories are better at answering their mail, and everybody else does about equally well. (If you want the gory details, the distributions may be compared using a Kolmogorov-Smirnov test; only the distribution for Conservative MPs differs significantly to those of the other parties.)

  7. Rate-limiting service

    So, I’m about to start work on loading the next copy of the BoundaryLine electoral geography data into MaPit, which will give us 100% coverage of county councillors and fix some problems which we weren’t able to work around when we did this after last year’s election. But this is a tedious job and so I’m not going to talk about it now.

    Instead I’ll draw your attention to Ratty, our rate-limiting service, which is of general usefulness but (so far as we know) hasn’t been used by anyone outside mySociety. Our major use for it is in the anti-abuse rules in the WriteToThem back-end. I’m about to head out for lunch,so I won’t explain how the thing works in detail, but if this is the sort of thing that you think might be useful to you, leave a comment or drop a mail to hello@mysociety.org with any questions or comments. Like almost all our code, Ratty is licenced under the Affero GPL.

  8. Postcodeine

    So, a silly post for today: Postcodeine. This is a British version of Ben Fry’s zipdecode, a “tool” for visualising the distribution of zipcodes in the United States. This is, as has been pointed out to me, wholly pointless, but it’s quite fun and writing it was an interesting exercise (it also taught me a little bit about AJAX, the web’s technology trend du jour). If you want the source code, it’s at the foot here; licence is the Affero GPL, as for all the other mySociety code.

    How it works: this is pretty obvious, but I might as well spell it out. The web page has four images on it: the big and small base maps, and two overlays. The back-end code is responsible for drawing sets of postcode locations into transparent PNGs, and when you type things in the text field, the src for each of the overlay images is changed. Panning the large map is done by issuing another request from Javascript to grab the mean location of all postcodes matching the given prefix (slightly hobbled, so that this isn’t a generalised postcode-to-coordinates oracle — sorry!); the rightmost pane, with a list of postcodes and their areas, is populated from another HTTP request. It could be done with an iframe but, as Paul Graham puts it, “Javascript works now”, so we might as well use that.

    (I should say, by the way, that I wrote this in my copious spare time. It’s copyright mySociety because I don’t have the right to use the postcode database myself.)

  9. HassleMe

    So, HassleMe launched today (despite mostly having been written just before Christmas). Good work by Etienne getting it all together. Today Matthew and I have been working on adding “instant-messenger” functionality to the site, which turns out to be a bit painful. Right now it seems like the most robust solution will be to use bitlbee, a proxy which allows you to interact with the various and wretched instant messenger protocols through the less varied and marginally less wretched IRC protocol.

    Integrating a website with instant messenger is an interesting problem. I’m not yet sure how much of the experience of building sites which send and receive email will carry over. We’ll see….