I’ve been working on PledgeBank quite a bit recently. As well as adding survey emails asking whether signers have done their pledge, and a feature for people to contact a pledge’s creator, I’ve been fixing numerous bugs that have sprung up along the way. For starters, people on the Isle of Man and the Channel Islands now get a much more helpful error if they try and enter their postcode anywhere on the site, rather than the confusing postcode not recognised they were getting previously.
Other errors I found turned out not to be with our code. The PledgeBank test suite (that we run before deploying the site to check it all still works) was throwing lots of warnings about “Parsing of undecoded UTF-8 will give garbage” when it got to the testing of our other language pages. Our code wasn’t doing anything special, and there were multiple places the warning came from – upgrading our libwww-perl removed one, and I’ve submitted bug reports to CPAN for the rest (having patched our copies locally – hooray for open source).
The Perl warnings were at least understandable, though. While tracking down why the site was having trouble sending a couple of emails, I discovered that we had a helper function splitting very long words up to help with word-wrapping – which when applied to some Chinese text was cutting a UTF-8 multibyte character in two and invalidating the text. No problem, I think, I simply have to add the “/u” modifier to PHP’s regular expression so that it matches characters and not bytes. This didn’t work, and after much playing had to submit a bug report to PHP – apparently in PHP “non-space character followed by non-space character” isn’t the same as “two non-space characters in a row”…
Unfortunately, PledgeBank is a pretty slow site. Generating the individual pledge page (done by mysociety/pb/web/ref-index.php) can take anything up to 150ms. That’s astonishingly slow, given the speed of a modern computer. What takes the time?
It’s quite hard to benchmark pages on a running web server, but one approach that I’ve found useful in the past is to use an analogue of phase-sensitive detection. Conveniently enough, all the different components of the site — the webserver, the database and the PHP process — run as different users, so you can easily count up the CPU time being used by the different components during an interval. To benchmark a page, then, request it a few times and compute the amount of CPU time used during those requests. Then sleep for the same amount of time, and compute the amount of CPU time used by the various processes while you were sleeping. The difference between the values is an estimate of the amount of CPU time taken servicing your requests; by repeating this, a more accurate estimate can be obtained. Here are the results after a few hundred requests to http://www.pledgebank.com/100laptop, expressed as CPU time per request in ms:
Subsystem User System apache ~0 ~0 PostgreSQL 55±9 6±4 PHP 83±8 4±4
(The code to do the measurements — Linux-specific, I’m afraid — is in mysociety/bin/psdbench.)
So that’s pretty shocking. Obviously if you spend 150ms of CPU time on generating a page then the maximum rate at which you can serve users is ~1,000 / 150 requests/second/CPU, which is pretty miserable given that Slashdot can relatively easily drive 50 requests/second. But the really astonishing thing about these numbers is the ~83ms spent in the PHP interpreter. What’s it doing?
The answer, it turns out, is… parsing PHP code! Benchmarking a page which consists only of this:
<? /* ... */ require_once '../conf/general'; require_once '../../phplib/db.php'; require_once '../../phplib/conditional.php'; require_once '../phplib/pb.php'; require_once '../phplib/fns.php'; require_once '../phplib/pledge.php'; require_once '../phplib/comments.php'; require_once '../../phplib/utility.php'; exit; ?>
reveals that simply parsing the libraries we include in the page takes about 35ms per page view! PHP, of course, doesn’t parse the code once and then run the bytecode in a virtual machine for each page request, because that would be too much like a real programming language (and would also cut into Zend’s market for its “accelerator” product, which is just an implementation of this obvious idea for PHP).
So this is bad news. The neatest approach to fixing this kind of performance problem is to stick a web cache like squid in front of the main web site; since the pledge page changes only when a user signs the pledge, or a new comment is posted, events which typically don’t occur anywhere near as frequently as the page is viewed, most hits ought to be servable from the cache, which can be done very quickly indeed. But it’s no good to allow the pledge page to just sit in cache for some fixed period of time (because that would be confusing to users who’ve just signed the pledge or written a comment, an effect familiar to readers of the countless “Movable Type” web logs which are adorned with warnings like, “Your comment may take a few seconds to appear — please don’t submit twice”). So to do this properly we have to modify the pledge page to handle a conditional GET (with an If-Modified-Since: or If-None-Match: header) and quickly return a “304 Not Modified” response to the cache if the page hasn’t changed. Unfortunately if PHP is going to take 35ms to process such a request (ignoring any time in the database), that still means only 20 to 30 requests/second, which is better but still not great.
(For comparison, a mockup of a perl program to process conditional GETs for the pledge page can serve each one in about 3ms, which isn’t much more than the database queries it uses take on their own. Basically that’s because the perl interpreter only has to parse the code once, and then it runs in a loop accepting and processing requests on its own.)
However, since we unfortunately don’t have time to rewrite the performance-critical bits of PledgeBank in a real language, the best we can do is to try to cut the number of lines of library code that the site has to parse on each page view. That’s reduced the optimal case for the pledge page — where the pledge has not changed — to this:
<? /* ... */ require_once '../conf/general'; require_once '../../phplib/conditional.php'; require_once '../../phplib/db.php'; /* Short-circuit the conditional GET as soon as possible -- parsing the rest of * the includes is costly. */ if (array_key_exists('ref', $_GET) && ($id = db_getOne('select id from pledges where ref = ?', $_GET['ref'])) && cond_maybe_respond(intval(db_getOne('select extract(epoch from pledge_last_change_time(?))', $id)))) exit(); /* ... */ ?>
— that, and a rewrite of our database library so that it didn’t use the gigantic and buggy PEAR one, has got us up to somewhere between 60 and 100 reqs/sec, which while not great is enough that we should be able to cope with another similar Slashdotting.
For other pages where interactivity isn’t so important, life is much easier: we can just emit a “Cache-Control: max-age=…” header, which tells squid that it can re-use that copy of the page for however long we specify. That means squid can serve that page at about 350reqs/sec; unfortunately the front page isn’t all that important (most users come to PledgeBank for a specific pledge).
There’s a subtlety to using squid in this kind of (“accelerator”) application which I hadn’t really thought about before. What page you get for a particular URL on PledgeBank (as on lots of other sites) vary based on the content of various headers sent by the user, such as cookies, preferred languages, etc.; for instance, if you have a login cookie, you’ll see a “log out” link which isn’t there if you’re an anonymous user. HTTP is set up to handle this kind of situation through the Vary: header, which the server sends to tell clients and proxies on which headers in the request the content of the response depends. So, if you have login cookies, you should say, “Vary: Cookie”, and if you do content-negotiation for different languages, “Vary: Accept-Language” or whatever.
PledgeBank has another problem. If the user doesn’t have a cookie saying which country they want to see pledges for, the site tries to guess, based on their IP address. Obviously that makes almost all PledgeBank pages potentially uncachable — the Vary: mechanism can’t express this dependency. That’s not a lot of help when your site gets featured on Slashdot!
The (desperately ugly) solution? Patch squid to invent a header in each client request, X-GeoIP-Country:, which says which country the client’s IP address maps to, and then name that in the Vary: header of the outgoing pledges. It’s horrid, but it seems to work.
The meeting day voting application (vote often!) that we’ve been mentioning everywhere all week is a new departure for mySociety. In a frantic bid to catch up with the cool kids, it’s our first deployed Ruby on Rails application. This happened because Louise Crow, who kindly volunteered to make it (thanks Louise!), felt like learning Rails. We used to have a policy of using any language, as long as it was open source and began with the letter P (Python/Perl/PHP…). This has now been extended to the letter R!
You can browse the source code in our CVS repository. One interesting thing about Rails applications is that they are structured things, a deployable directory tree. So are mySociety applications.
For example, take a look at PledgeBank’s directory. It’s a mini, well defined filesystem – the ‘web’ directory is the meat of the stuff, but note also ‘web-admin’ for the administrator tools. Include files are tucked away in ‘perllib’ and ‘phplib’, while script files nestle under ‘bin’. We keep configuration files (analogous to the Windows Registry, or /etc on Unix) under ‘conf’. Database schema files live in ‘db’.
And a rails application is much the same. But much much much more detailed. Some of those are extra directories which we also have, but only when we deploy, not in CVS (for example, log files). All in all they are surprisingly similar structures, which shows we’re either both on the right lines, or both on the wrong false trail.
Like making Frankenstein’s monster, poor Louise and I had to graft these two beasts together just to deploy this small application. For example, we have a standard configuration file format which we read from Perl, Python and PHP. The deploy system does useful things with it like check all entries are present, and generate the file for any sandbox from a template. To get round this, there’s an evil script, possibly the first time PHP has been used to make YAML. (And please don’t look at the thing that makes symlinks.)
We could have extended Rails to be able to read its configuration from our file format, but that would be a lot more work. And we could have discovered how to hack its log file system to write to the mySociety log file directory. But everything is so coupled, it doesn’t ever seem worth it. Any Rails apps we deploy will just have to be an even more confusing mass of directories, application trees inside application trees.
We decided to be particularly careful about the new WriteToThem statistics, and did lots of checks on the data. In particular, we wanted to make sure we didn’t unfairly impugn MPs whom we had had bad contact details for. It is possible that for a period of time we thought we had their details but we got them wrong, that we were sending messages to an incorrect address, and their constituents were (unknowingly unfairly) reporting them as unresponsive in the questionnaire.
So, I wrote a script which generates the statistics, and triest to spot such cases. During the 2005 period, it breaks each MPs time up into intervals according to when we changed our contact details for them. We can do this, because every change we make to contact details is recorded in dadem’s database (see the representatives_edited table).
I’m going to sound a bit like Donald Rumsfield here, but keep with me. For each interval, we either have good or bad contact details, and we either know or we don’t know that they are good or bad. If we know that they’re bad (e.g. we have no details at all), then that interval isn’t a problem. No messages will have been sent, WriteToThem will have apologised to constituents trying to send messages, and no questionnaires will have been sent out. Any questionnaire results we have from good intervals can still be used and will be fine.
The case when we think we have contact details is harder. The script does some simple checks to work out if they were really valid. For example, if there have been at least a couple of questionnaire responses, and none were affirmative, then it is a bit suspicious. The script a threshold of length of time of suspicious intervals, and outputs as “unknown” MPs which it thinks there may have been a problem with for long enough for it to matter.
Tom then heroically checked all those MPs. Some we’ve marked as “WriteToThem had possibly bad contact details for this MP” in the league table. For others, we managed to verify that the questionable email or fax that we had (either via the MPs own website, or by ringing up their office) was actually good. The script then spits out, of all things, PHP file, which you might find useful on your own websites. It contains the complete detailed results. Make sure you look at the “category” for each MP. That indicates if we had too little data, or bad contact details, amongst other things.
Why PHP? And why not update the stats in real time? We’ve decided to make new statistics just once a year. Firstly, this is much easier to describe, we can say for example on TheyWorkForYou (where the responsiveness data also appears) that it is for the ‘year 2005’. Secondly, it lets us do the manual checking, so we are more confident about our data. Thirdly, it’s good for publicity to announce the new statistics as a news story. And finally, it is much easier to manage an unchanging text file (e.g. the PHP file), stored forerver in CVS, than it would be an ephemeral table in a database somewhere.
After all that, we mailed or faxed all the still sitting MPs who scored 100% responsiveness, to congratulate them on a job well done. Greg Pope, Richard Page, Fraser Kemp, Thomas McAvoy, Bob Laxton, Mark Simmonds, Paul Stinchcombe, Dennis Turner, Nick Ainger, Alan Meale, Adrian Sanders, Tom Cox, Andrew Hunter, Robert Key, Andrew Selous, John Wilkinson, Paul Goodman, Gwyneth Dunwoody, David Evennett, Peter Atkinson, Andrew Bennett, George Young, Terry Lewis, Douglas Hogg, Patrick Cormack, Andrew Robathan, David Stewart, Colin Challen, Harry Barnes MPs and all your staff, congratulations! (That list includes ones who are no longer MPs, for example stood down at the General Election)
So, it’s my turn to write something here. Well, this’ll be short. As Francis mentions below, PledgeBank launched today, and everything’s gone reasonably smoothly, with the exception of some tedious PHP bug we haven’t tracked down yet. With any luck the new version of PHP will fix it, so there won’t be hours of painful debugging to do. (I could bore you for hours with my opinions of PHP — actually, I could probably shorten that quite a lot if I were allowed to swear, but this is a family-friendly ‘blog — but let’s just say that fixing PHP problems Is Not My Favourite Job.)
Instead I’ll say what I’m doing right now, which is beginning to add geographical lookup to PledgeBank. At the moment we ask pledge authors for a country (though it can either be “UK” or “global” at the moment), and, if they’re in the UK, a postcode. The idea is to “georeference” (i.e. look up the coordinates of) the postcode, though we don’t actually do that yet. So I’m modifying the database a bit to store coordinates (as a latitude/longitude, so that we don’t have to write a separate case for every wacky national coordinate grid) and generalise the notion of “country” so that we can let non-Brits actually put in their own countries when they create pledges.
Other things we’ve discovered today:
- People are confused by the “(suspicious signer?)” link next to signatures on each pledge page — several people thought that we were reporting our suspicions of the signer. You probably think that’s stupid, but if so that’s only because you’re familiar with sites that have this sort of retroactive moderation button everywhere. Actually it’s us that’s being stupid and we’re going to remove it until we have a better way to implement it — at the moment we think we’re mostly on top of the occasional joke/abusive signature.
- People are confused by the pledge signature confirmation mail, which currently reads (for instance),
Please click on the link below to confirm your signature on the pledge at the bottom of this email.
The pledge reads:
‘Phil Booth will refuse to register for an ID card and will donate £10 to a legal defence fund but only if 10,000 other people will also make this same pledge.’
— the PledgeBank.com team
We got several emails from people saying “your site has got my name wrong — I’m not Phil Booth”. The point is that Phil Booth wrote the pledge, so it’s in his name; the email reflects that. But that’s not obvious to the signer, and since the only name in the body of the email isn’t theirs, they think it’s got it wrong and complain. (To be fair, only three out of ~1,100 did, but that’s still bad.) This is the sort of problem we need user testing to spot: none of us saw anything wrong with the text when we were testing it. So we need to reword that.
- We’ve had several people email to say that they’d like to do versions of PledgeBank in their own countries, and we’d like to hear from anyone interested in localising the mySociety projects who has time, expertise or even just opinions to donate. If that’s you, please get in touch!
And probably some other stuff, but I said this post would be short….
Last night I should have gone to bed early, but these things being how they are I stayed up late having tea with my housemate and his friend. I wanted to get up early, because I knew a few things needed tidying before we started getting media coverage, so I set my alarm. I haven’t done that for work for years! So I’m a bit sleepy.
The most important early thing I did was make the front page featured pledges appear in a random order, for more fairness and serendipity. Late last night Chris had added code in to fuzzily find pledges which somebody has typed in. It uses the database to look for the number of common three letter substrings, so if you type in “http://www.pledgebank.com/suirname” it gives a nice error page leading you to go to “http://www.pledgebank.com/Suriname”. It’s pretty good, and all I had to do was tidy up the text a bit, and add it to the search page as well.
By that time everyone else was up, and the no2id people were publicising their pledge. We were all on IRC, and tailing various logfiles. There were quite a few minor tidy ups for us to make to the launch pledges that were made over the weekend, changing text and signup numbers for the creators a bit.
Someone spotted that the “all pledges” page had the wrong calculated count for one of the pledges. This was very odd, as it was right for all the others. I downloaded a fresh dump of the database to my local machine, where everything was fine. Meanwhile, Chris noticed the PHP server was crashing. After more investigation, we found a subtle bug was creating a corrupt PHP variable. Calling “gettype” on it caused the PHP process to stop with an error, and calling number_format crashed the whole thing. We’re still not sure quite what PHP bug caused this, and need to investigate it more. But we found a simple workaround which stopped it causing any more trouble.
You always find all the bugs when your traffic goes up! That’s why staged beta getting larger and larger, of which today is in many ways the next phase, is the way to go.
We’ve been spending the last few days adding a more comprehensive login/authentication system to the PledgeBank code. At the moment, PledgeBank checks your email address every action that you do. In the new system you can still get it to email you if you like, or if you prefer you can set a password. It will also use session cookies to remember that you are logged in. The plan is to use the better login system to let pledge creators do more things, like email signers during the campaign, and upload a photo to go with their pledge.
This has taken quite a radical overhaul of the codebase, and the database scheme. There’s now a “person” table, which really is an email address. Chris has made a lovely elegant system, where you can just call “person_signon” in some PHP code. Then it goes away, and makes sure they are authenticated. This might be immediate, if they are already logged in. It might require a password, or it might require emailing them. Whichever way, when they come back (possibly via a link in an email), it restores the request and goes back to the page which required authentication.
In total, this will almost be a net deletion of lines of code, when the existing token systems are fully removed. Meanwhile, I’m testing and debugging it like crazy. And we’ve got to work out how to deploy the code without breaking anyone mid-signing at the moment we upgrade it. Upgrading not just the engine but the transmission as well, while the car is running.
mySociety is looking for third and final core developer. We’re looking for a usability obsessed PHP developer – does this sound like anyone you know?