Partly for our own internal documentation, and partly because it might be of interest to (some) readers, some notes on how the Number 10 petitions site works. On the face of it you’d imagine this would be very simple, but as usual it’s complicated by performance requirements (our design was motivated by the possibility that a large NGO might mail a very large mailing list inviting each member to sign a particular petition, so that we might have to sustain a very high signup rate for an extended period of time). Here’s a picture of the overall architecture:
(This style of illustration is unaccountably popular in the IT industry but unlike most examples of the genre, I’ve tried to arrange that this one actually contains some useful information. In particular I’ve tried to mark the direction of flow of information, and separate out the various protocols; as usual there are too many of the latter. The diagram is actually a slight lie because it misses out yet another layer of IPC—between the web server, apache, and the front-end FastCGI scripts.)
Viewing petition pages is pretty conventional. Incoming HTTP requests reach a front-end cache (an instance of squid, one per web server, cacheing in memory only); squid passes them to the petition browsing scripts (written in perl running under FastCGI) to display petition information. Those scripts consult the database for the current state of the relevant petition and pass it back to the proxy, and thence to the web client. This aspect of the site is not very challenging.
Signing petitions is harder. The necessary steps are:
- write a database record about the pending signature;
- send the user an email containing a unique link to confirm their signature;
- update the database record when the user clicks the link;
- commit everything to stable storage; and finally
- tell the user that their signature has been entered and confirmed.
The conventional design for this would be to have the web script, when it processes the HTTP request for a new signature, submit a message for sending by a local mail server and write a row into the database and commit it, forcing the data out to disk. The mail server would then write the message into its spool directory, and fsync it, forcing it out to disk. The mail server will then pick the mail out of its queue and send it to a remote server, at which point it will be erased from the queue. Later on the mail will arrive in the
user’s inbox, at which point they will (presumably) click the link, resulting in another HTTP request which causes the web script to update the corresponding database row and commit the result. While this is admirably modular it requires far more disk writes than necessary to actually complete the task, which limits its potential speed. (In particular, there’s no reason to have a separate MTA spool directory and for the MTA to make its own writes to that directory.)
At times of high load, it is also extremely inefficient to do one commit per signature. It takes about as long to commit ten new or changed rows to the database as it is to commit one (because the time spent is determined by the disk seek time). Therefore to achieve high performance it is necessary to batch signatures. Unfortunately this is a real pain to implement because all the common web programming models use one process per concurrent request, and it is inconvenient to share database handles between different processes. The correct answer to this problem would of course be to write the signup web script as a single-process multiplexing implementation, but that’s a bit painful (we’d have had to implement our own FastCGI wire protocol library, or alternatively an HTTP server) and the deadlines were fairly tight. So instead we have a single-process server, petsignupd, which accepts signup and confirmation requests from the front-end web scripts over a simple UDP protocol, and passes them to the database in batches every quarter of a second. In theory, therefore, users should see a maximum latency of a bit over 0.25s, but we should achieve close to the theoretical best throughput of incoming requests. (Benchmarking more-or-less bears this out.)
Sending the corresponding email is also a bit problematic. General-purpose MTAs are not optimised for this sort of situation, and (for instance) exim can’t keep up with the sustained signup rate we were aiming for even if you put all of its spool directories on a RAM disk and accept that you have to repopulate its outgoing queue in the event of a crash. The solution was to write petemaild, a small multiplexed SMTP sending server; unlike a general-purpose MTA this manages its queue in memory and communicates updates directly to the database (when a confirmation email is delivered or delivery fails permanently).
It’s unfortunate that such a complex system is required to fulfil such a simple requirement. If we’d been prepared to write the whole thing ourselves, from processing HTTP requests down to writing signatures out to files on disk, the picture above would look much simpler (and there would be fewer IPC boundaries at which things could go wrong). On the other hand the code itself would be a lot more complex, and there’d be a lot more of it. I don’t think I’d describe this design as a “reasonable” compromise, but it’s at least an adequate one.
Last Friday Matthew and I went to the Ordnance Survey’s UK Geospatial Mash-up day. And a splendid time was had by all. Really this post is just a placeholder for a link to a copy of my presentation slides (not quite what I delivered, I’m afraid), but if anyone was there and has any questions they weren’t able to put to us in person, or wasn’t there and wished they had been, then the comments section awaits….
So, PledgeBank got Slashdotted a couple of weeks ago when Mike Liveright’s $100 laptop pledge was linked from a post about the laptop. We didn’t cope very well.
Unfortunately, PledgeBank is a pretty slow site. Generating the individual pledge page (done by mysociety/pb/web/ref-index.php) can take anything up to 150ms. That’s astonishingly slow, given the speed of a modern computer. What takes the time?
It’s quite hard to benchmark pages on a running web server, but one approach that I’ve found useful in the past is to use an analogue of phase-sensitive detection. Conveniently enough, all the different components of the site — the webserver, the database and the PHP process — run as different users, so you can easily count up the CPU time being used by the different components during an interval. To benchmark a page, then, request it a few times and compute the amount of CPU time used during those requests. Then sleep for the same amount of time, and compute the amount of CPU time used by the various processes while you were sleeping. The difference between the values is an estimate of the amount of CPU time taken servicing your requests; by repeating this, a more accurate estimate can be obtained. Here are the results after a few hundred requests to http://www.pledgebank.com/100laptop, expressed as CPU time per request in ms:
(The code to do the measurements — Linux-specific, I’m afraid — is in mysociety/bin/psdbench.)
So that’s pretty shocking. Obviously if you spend 150ms of CPU time on generating a page then the maximum rate at which you can serve users is ~1,000 / 150 requests/second/CPU, which is pretty miserable given that Slashdot can relatively easily drive 50 requests/second. But the really astonishing thing about these numbers is the ~83ms spent in the PHP interpreter. What’s it doing?
The answer, it turns out, is… parsing PHP code! Benchmarking a page which consists only of this:
/* ... */
reveals that simply parsing the libraries we include in the page takes about 35ms per page view! PHP, of course, doesn’t parse the code once and then run the bytecode in a virtual machine for each page request, because that would be too much like a real programming language (and would also cut into Zend’s market for its “accelerator” product, which is just an implementation of this obvious idea for PHP).
So this is bad news. The neatest approach to fixing this kind of performance problem is to stick a web cache like squid in front of the main web site; since the pledge page changes only when a user signs the pledge, or a new comment is posted, events which typically don’t occur anywhere near as frequently as the page is viewed, most hits ought to be servable from the cache, which can be done very quickly indeed. But it’s no good to allow the pledge page to just sit in cache for some fixed period of time (because that would be confusing to users who’ve just signed the pledge or written a comment, an effect familiar to readers of the countless “Movable Type” web logs which are adorned with warnings like, “Your comment may take a few seconds to appear — please don’t submit twice”). So to do this properly we have to modify the pledge page to handle a conditional GET (with an If-Modified-Since: or If-None-Match: header) and quickly return a “304 Not Modified” response to the cache if the page hasn’t changed. Unfortunately if PHP is going to take 35ms to process such a request (ignoring any time in the database), that still means only 20 to 30 requests/second, which is better but still not great.
(For comparison, a mockup of a perl program to process conditional GETs for the pledge page can serve each one in about 3ms, which isn’t much more than the database queries it uses take on their own. Basically that’s because the perl interpreter only has to parse the code once, and then it runs in a loop accepting and processing requests on its own.)
However, since we unfortunately don’t have time to rewrite the performance-critical bits of PledgeBank in a real language, the best we can do is to try to cut the number of lines of library code that the site has to parse on each page view. That’s reduced the optimal case for the pledge page — where the pledge has not changed — to this:
/* ... */
/* Short-circuit the conditional GET as soon as possible -- parsing the rest of
* the includes is costly. */
if (array_key_exists('ref', $_GET)
&& ($id = db_getOne('select id from pledges where ref = ?', $_GET['ref']))
&& cond_maybe_respond(intval(db_getOne('select extract(epoch from pledge_last_change_time(?))', $id))))
/* ... */
– that, and a rewrite of our database library so that it didn’t use the gigantic and buggy PEAR one, has got us up to somewhere between 60 and 100 reqs/sec, which while not great is enough that we should be able to cope with another similar Slashdotting.
For other pages where interactivity isn’t so important, life is much easier: we can just emit a “Cache-Control: max-age=…” header, which tells squid that it can re-use that copy of the page for however long we specify. That means squid can serve that page at about 350reqs/sec; unfortunately the front page isn’t all that important (most users come to PledgeBank for a specific pledge).
There’s a subtlety to using squid in this kind of (“accelerator”) application which I hadn’t really thought about before. What page you get for a particular URL on PledgeBank (as on lots of other sites) vary based on the content of various headers sent by the user, such as cookies, preferred languages, etc.; for instance, if you have a login cookie, you’ll see a “log out” link which isn’t there if you’re an anonymous user. HTTP is set up to handle this kind of situation through the Vary: header, which the server sends to tell clients and proxies on which headers in the request the content of the response depends. So, if you have login cookies, you should say, “Vary: Cookie”, and if you do content-negotiation for different languages, “Vary: Accept-Language” or whatever.
PledgeBank has another problem. If the user doesn’t have a cookie saying which country they want to see pledges for, the site tries to guess, based on their IP address. Obviously that makes almost all PledgeBank pages potentially uncachable — the Vary: mechanism can’t express this dependency. That’s not a lot of help when your site gets featured on Slashdot!
The (desperately ugly) solution? Patch squid to invent a header in each client request, X-GeoIP-Country:, which says which country the client’s IP address maps to, and then name that in the Vary: header of the outgoing pledges. It’s horrid, but it seems to work.
Click here to read and annotate a copy of their report.
So, Francis and Matthew have been working hard to produce the 2005 stats for WriteToThem, which measure, among other things, how responsive MPs are to messages from their constituents. We’ve had a couple of questions about what the quoted confidence intervals for the stats actually mean, so some notes on that and also on conclusions that can be drawn from them:
We get the responsiveness stats by sending each user of WriteToThem a questionnaire two weeks after their message is sent to their MP (or other representative), with a reminder after three weeks. The questionnaire asks them whether they’ve had a response from the MP, and, as a follow-up question, whether this is the first time they’ve ever contacted an elected representative.
Now, an estimate of the MP’s true response rate (that is, the probability that a random letter from a member of the public will receive an answer within 2–3 weeks) is obviously the number of respondents answering “yes” to the first questionnaire question, divided by the total number of responses to the question. (Assuming, that is, that the people who answer the question at all are representative of all the people who write to a given MP, that they answer honestly, etc. etc.) So, we can regard this as being a bit like an opinion poll: we ask a sample of constituents whether they got an answer, and extrapolate to estimate the total response rate.
Unfortunately in many cases we don’t have very many responses to the questionnaire (either because an MP’s constituency isn’t very wired so few constituents used our service, or we didn’t have accurate contact details for a significant part of the year so couldn’t put those people in touch with their MP, or because the MP was recently elected so hasn’t had much time to accumulate data). Over the course of they year we sent MPs about 30,000 messages, giving an average of about fifty per MP, but the number of messages sent to individual MPs varied quite a bit:
Obviously, the fewer messages we send to an MP, the fewer questionnaire responses about their performance we receive, and therefore the less accurate our estimate of how well they perform. To see why this is, consider a simple analogy. Imagine that a particular MP manages to respond to 50% of messages within 2–3 weeks. You can model that with a flip of a coin: for each message, flip the coin, and if it comes up heads, count that as a response received, and if tails, not received. Suppose you do that once: 50% of the time the coin comes up heads (MP did respond), and 50% of times tails (MP did not respond). If there’s only one message, therefore, you will always see a response rate of 0% or of 100%. With two messages there are three possibilities: 25% of the time you’ll get two tails, 50% of the time one head and one tail, and 25% of the time you’ll get two heads. So 50% of the time the estimated responsiveness is correct (1 / 2 = 50%), and 50% of the time it’s wrong. As the number of responses increases, accuracy improves; for instance, for an MP who has a response rate of 50% and 20 questionnaire responses, the probability that we’ll estimate various values for their responsiveness looks like this:
—now the probability that our estimate will be exactly correct (10 “yes” answers out of 20 questionnaires) is only 18%, but the probability that we’ll get an answer quite close to the true value (between 8 and 12 “yes”s, or from 40% to 60%) is quite high: about 74%. It’s 95% certain that we’ll see between 6 and 14 “yes”s, corresponding to an estimated response rate between 30% and 70%. To put it another way, if all MPs had twenty questionnaire responses, and all had a responsiveness of 50%, then for about 2.5% of them (16 MPs) we’d estimate that their responsiveness was worse than 30%, and for another 16 that it was better than 70%. The [30%–70%] range is called a “95% confidence interval”, and is an indication of how sure we are about the statistics we are publishing. If you don’t pay attention to the confidence intervals you will get a misleading impression from these statistics—we show the confidence intervals for a reason!
(For a comparison, a commercial opinion poll would typically use a sample of between 1,000 and 2,000 people, giving rather narrower confidence intervals, typically of ±2–3%. We can’t do that for individual MPs because we don’t have enough data, but for the data on the responsiveness of different types of representatives we have lots of data and can produce much more accurate numbers. For instance, our estimate of the overall responsiveness of MPs, which is based on tens of thousands of questionnaire responses, is 63%, and our confidence interval on this is much smaller than 1%.)
I’ll finish with something on the question of which parties’ MPs are best at answering constituents’ correspondence. Here’s a plot of our estimates of the response rates of MPs from some major parties:
(This ignores the Northern Ireland parties because (a) none of them has very many MPs, so we don’t have much data; and (b) I’d run out of colours and wouldn’t like to make an horrific faux pas. I’ve lumped together the Welsh and Scottish nationalists for much the same reasons. Apologies. The way to read the graph, by the way, is to consider each curve as telling you “what fraction of this party’s MPs have a responsiveness equal to or less than the value”.)
So what does this tell us? Tories are better at answering their mail, and everybody else does about equally well. (If you want the gory details, the distributions may be compared using a Kolmogorov-Smirnov test; only the distribution for Conservative MPs differs significantly to those of the other parties.)
So, I’m about to start work on loading the next copy of the BoundaryLine electoral geography data into MaPit, which will give us 100% coverage of county councillors and fix some problems which we weren’t able to work around when we did this after last year’s election. But this is a tedious job and so I’m not going to talk about it now.
Instead I’ll draw your attention to Ratty, our rate-limiting service, which is of general usefulness but (so far as we know) hasn’t been used by anyone outside mySociety. Our major use for it is in the anti-abuse rules in the WriteToThem back-end. I’m about to head out for lunch,so I won’t explain how the thing works in detail, but if this is the sort of thing that you think might be useful to you, leave a comment or drop a mail to firstname.lastname@example.org with any questions or comments. Like almost all our code, Ratty is licenced under the Affero GPL.
So, a silly post for today: Postcodeine. This is a British version of Ben Fry’s zipdecode, a “tool” for visualising the distribution of zipcodes in the United States. This is, as has been pointed out to me, wholly pointless, but it’s quite fun and writing it was an interesting exercise (it also taught me a little bit about AJAX, the web’s technology trend du jour). If you want the source code, it’s at the foot here; licence is the Affero GPL, as for all the other mySociety code.
(I should say, by the way, that I wrote this in my copious spare time. It’s copyright mySociety because I don’t have the right to use the postcode database myself.)
So, HassleMe launched today (despite mostly having been written just before Christmas). Good work by Etienne getting it all together. Today Matthew and I have been working on adding “instant-messenger” functionality to the site, which turns out to be a bit painful. Right now it seems like the most robust solution will be to use bitlbee, a proxy which allows you to interact with the various and wretched instant messenger protocols through the less varied and marginally less wretched IRC protocol.
Integrating a website with instant messenger is an interesting problem. I’m not yet sure how much of the experience of building sites which send and receive email will carry over. We’ll see….
And a follow-up to my last post: the population density and customary proximity APIs are now available in Gaze. The additional APIs are:
- WGS84 latitude, in decimal degrees
- WGS84 longitude, in decimal degrees
Return an estimate of the population density at (lat, lon), in persons per square kilometer, as a decimal number followed by a line feed.
- WGS84 latitude, in decimal degrees
- WGS84 longitude, in decimal degrees
- number of persons
- largest radius returned, in kilometers; optional; default 150
Return an estimate of the smallest radius around (lat, lon) containing at least number persons, or maximum, if that value is smaller, as a decimal number followed by a line feed.
Enjoy! Questions and comments to email@example.com, please.
… or, “how near is ‘nearby’?”
On PledgeBank we offer search and local alert features which will tell users about pledges which have been set up near them, the idea being that if somebody’s organising a street party in the next street over, you might well want to hear about it, but if it’s somebody a thousand miles away, you probably don’t.
At the moment we do this by gathering location data from pledge creators (either using postcodes, or location names via Gaze), and comparing it to search / alert locations using a fixed distance threshold — presently 20km (or about 12 miles). This works moderately well, but leads to complaints from Londoners of the form “why have I been sent a pledge which is TEN MILES away from me?” — the point being that, within London, people’s idea of how far away “nearby” things is is quite different from that of people who live in the countryside — they mean one tube stop, or a few minutes’ walk, or whatever. If you live in the countryside, “nearby” might be the nearest village or even the nearest town.
So, ages ago we decided that the solution to this was to find some population density data and use it to produce an estimate for what is “nearby”, defined as, “the radius around a point which contains at least N people”. That should capture the difference between rural areas and small and large towns.
(In fairness, the London issue could be solved by having the code understand north vs south of the river as a special case, and never showing North-Londoners pledges in South London. But that’s just nasty.)
Unfortunately the better solution requires finding population density data for the whole world, which is troublesome. There seem to be two widely-used datasets with global coverage: NASA SEDAC’s Gridded Population of the World, and Oak Ridge National Laboratory’s Landscan database. GPW is built from census data and information about the boundaries of each administrative unit for which the census data is recorded, and Landscan improves on this by using remote-sensing data such as the distribution of night-time lights, transport networks and so forth.
(Why, you might wonder, is Oak Ridge National Laboratory interested in such a thing? It is, apparently, “for estimating ambient populations at risk” from natural disasters and whatnot. That’s very worthy, but I can’t help but wonder whether the original motivation for this sort of work may have been a touch more sinister. But what do I know?)
Anyway, licence terms seem to mean that we can use the GPW data and we can’t use the Landscan data, which is a pity, since the GPW data is only really very good in its coverage of rich western countries which produce very detailed census returns on, e.g., a per-municipality basis. Where census returns are only available on the level of regions, the results are less good. Anyway, subject to that caveat, it seems to solve the problem. Here’s a map showing a selection of points, and the circles around them which contain about 200,000 people (that seems to be about the right value for N):
The API to access this will go into the Gaze interface, but it’s not live yet. I’ll document the RESTful API when it is.
One last note, which might be of use to people working with the GPW data in the future. GPW is a cell-based grid: each cell is a region lying between two lines of longitude and two lines of latitude, and within each cell three variables are defined: the population in the cell, the population density of the cell, and the land area of the cell. (This is one of those rare exceptions described in to Alvy Ray Smith’s rant, A Pixel Is Not A Little Square….) But note that the land area is not the surface area of the cell, and the population density is not the population divided by the surface area of the cell!
This becomes important in the case of small islands; for instance (a case I hit debugging the code) the Scilly Isles. The quoted population density for the Scilly Isles is rather high: somewhere between 100 and 200 persons/km2, but when integrating the population density to find the total population in an area, this is absolutely not the right value to use: the proper value there is the total population of a cell, divided by its total surface area. The reason for that is that when sampling from the grid to find the value of the integrand (the population density) you don’t know, a priori, whether the point you’re sampling at has landed on land or non-land, but the quoted population density assumes that you are always asking about the land. When integrating, the total population of each cell should be “smeared out” over the whole area of the cell. If you don’t do this then you will get very considerable overestimates of the population in regions which contain islands.