So, I’ve just had a shower and I’m waiting for Matthew and Tom to turn up. As time goes on, mySociety seems to get more geographically disparate, and I look forward to meeting my coworkers. Then for 1pm we’ll be heading to CB2 for the mySociety developers meeting. Feel free to come along any time afternoon or evening, whatever your skills or interest in mySociety.
I haven’t posted on here for ages, since October. I’ve been away on holiday quite a lot, and when I’ve not been away I’ve been busy, partly with systems administration. We’ve set up lots of servers in the last month for the E Petitions site. When you go from 3 servers to 7 servers, there’s another step change in sorting out systems administration tools. For example, I had to change the monitoring script so every server wouldn’t monitor every other. And I had to work out the quirks and bugs in the system we have for storing config files for different classes of server in CVS. Because we only had one class of server before.
I’ve also had to learn lots about server monitoring and load balancing. Things have slowed down a bit now, to maybe 10 hits per second. But a few weeks ago the road pricing petition was often getting 50 hits per second. I’ve never worked on a site with that level of traffic before. You find all the bugs in your code, all the missing indices in PostgreSQL, all the badly tweaked FastCGI parameters. I’ve been sucking knowledge off Chris like a sponge, so tools like strace and vmstat begin to become instinctive. Seemingly nobody offers a book or a course which teaches this stuff well – every server setup is different, everyone knows different ways to tune and profile. But maybe you can tell me different in the comments.
Louise has been busily working away on lots of things. Amongst that is a major change to WriteToThem, to let you write to all the members in a multi-member constituency in one go. The last day or two, she’s been installing the WriteToThem test code on one of our servers, when it has only run on my laptop before. This will be fantastic – hopefully can get Matthew to be bolder about making changes to WriteToThem, if he has a test script he can easily run (getting Matthew to be bold isn’t normally a problem, but he seems mildly less bold when it comes to the WriteToThem queue daemon).
Tom and I have also been busy on a second travel maps report for the DfT. More on that soon. Lots of running screen scraping jobs on TransportDirect which take days. On the subject of Tom, he seems to have got expert at “stacking meetings” next to each other. In one day last week he had 7 meetings!
Much of my August seems to have been absorbed with maintenance tasks.
For example, Chris and I spent a few days tightening up WriteToThem’s privacy. I made sure the privacy statement correctly describes what happens with backup files, and failed messages. I reduced the timeouts on how long we keep the body of failed messages. I made sure we delete old backup files of the WriteToThem database. I wrote scripts to run periodically to check that no bugs in our queueing demon can accidentally mean we keep the body of messages for longer than we say. I added a cron job to delete Apache log files older than a month for all our sites. As AOL know to their cost, the only really private data is deleted data.
Earlier in the month, I handled some WriteToThem support email for the first time in ages. We get a couple of hundred messages a week, which Matthew mainly slogs through. It’s good for morale to do it, as we get quite a lot of praise mail. It is also hard work, as you realise how complicated even our simple site and the Internet are, and it leads to fixing bugs and improving text on the site. I made a few improvements to our administration tools, and things like the auto-responder if people reply to the questionnaire, to try and reduce the amount of support email, and make it easier to handle.
I did some more work on the geographically cascading pledges (like this prototype one), but I’m still not happy with them. In the end, I realised that it is the structure of wording of the pledge that is the key problem. Our format of “If will A but only if N others will B” just isn’t easily adapted to get across that the pledge applies separately in different geographically areas. Working out how to fix that is one of the things we’ll brainstorm about in the Lake District (see below).
The last couple of days I’ve been configuring one of our new servers who is called Balti, and getting the PledgeBank test harness working on it. Until now, it has only been run on my laptop. This is partly heading towards making a proper test harness for the ePetitions site, running on a server so we properly test nothing can be broken before deploying a new version.
Matthew has wrapped up the TheyWorkForYou API now, and is working on Neighbourhood Fixit next. Chris has been doing lots more performance work for the e
Tom’s in Berlin at the moment, he gave a talk last night, and I think has been to see some people from Politik Digital. As we’ve been discussing on the mySociety email list, there’s an EU grant we’re likely to apply for in collaboration with them.
On Friday, we’re all going to the Lake district for a week, with some of the trustees and volunteers intermittently. We very conveniently and cheaply all work from home, so it’s good and necessary to meet up for a more sustained period of time at least once a year. Last year we were in Wales.
All mySociety’s servers are named after British food food and drink – tea, cake, pimms etc. A couple of weeks ago we finally set up haggis.ukcod.org.uk as a server for little unofficial projects of all sorts. It is shiny, brand new, bullet fast and looking for people in the mySociety volunteer community to love it.
The first user is one of our most regular volunteers, Sam Smith, who has migrated and upgraded his never-quite-officially-launched site TheGovernmentSays.com. If you like the way TheyWorkForYou can email you when politicians talk about words or phrases of interest to you, then you will most likely find this indespensible – the same functionality but for government press releases and news from across the public sector.
If you’d be interested in an account on Haggis in order to do some mySocietyish work, just let us know.
Unfortunately, PledgeBank is a pretty slow site. Generating the individual pledge page (done by mysociety/pb/web/ref-index.php) can take anything up to 150ms. That’s astonishingly slow, given the speed of a modern computer. What takes the time?
It’s quite hard to benchmark pages on a running web server, but one approach that I’ve found useful in the past is to use an analogue of phase-sensitive detection. Conveniently enough, all the different components of the site — the webserver, the database and the PHP process — run as different users, so you can easily count up the CPU time being used by the different components during an interval. To benchmark a page, then, request it a few times and compute the amount of CPU time used during those requests. Then sleep for the same amount of time, and compute the amount of CPU time used by the various processes while you were sleeping. The difference between the values is an estimate of the amount of CPU time taken servicing your requests; by repeating this, a more accurate estimate can be obtained. Here are the results after a few hundred requests to http://www.pledgebank.com/100laptop, expressed as CPU time per request in ms:
Subsystem User System apache ~0 ~0 PostgreSQL 55±9 6±4 PHP 83±8 4±4
(The code to do the measurements — Linux-specific, I’m afraid — is in mysociety/bin/psdbench.)
So that’s pretty shocking. Obviously if you spend 150ms of CPU time on generating a page then the maximum rate at which you can serve users is ~1,000 / 150 requests/second/CPU, which is pretty miserable given that Slashdot can relatively easily drive 50 requests/second. But the really astonishing thing about these numbers is the ~83ms spent in the PHP interpreter. What’s it doing?
The answer, it turns out, is… parsing PHP code! Benchmarking a page which consists only of this:
<? /* ... */ require_once '../conf/general'; require_once '../../phplib/db.php'; require_once '../../phplib/conditional.php'; require_once '../phplib/pb.php'; require_once '../phplib/fns.php'; require_once '../phplib/pledge.php'; require_once '../phplib/comments.php'; require_once '../../phplib/utility.php'; exit; ?>
reveals that simply parsing the libraries we include in the page takes about 35ms per page view! PHP, of course, doesn’t parse the code once and then run the bytecode in a virtual machine for each page request, because that would be too much like a real programming language (and would also cut into Zend’s market for its “accelerator” product, which is just an implementation of this obvious idea for PHP).
So this is bad news. The neatest approach to fixing this kind of performance problem is to stick a web cache like squid in front of the main web site; since the pledge page changes only when a user signs the pledge, or a new comment is posted, events which typically don’t occur anywhere near as frequently as the page is viewed, most hits ought to be servable from the cache, which can be done very quickly indeed. But it’s no good to allow the pledge page to just sit in cache for some fixed period of time (because that would be confusing to users who’ve just signed the pledge or written a comment, an effect familiar to readers of the countless “Movable Type” web logs which are adorned with warnings like, “Your comment may take a few seconds to appear — please don’t submit twice”). So to do this properly we have to modify the pledge page to handle a conditional GET (with an If-Modified-Since: or If-None-Match: header) and quickly return a “304 Not Modified” response to the cache if the page hasn’t changed. Unfortunately if PHP is going to take 35ms to process such a request (ignoring any time in the database), that still means only 20 to 30 requests/second, which is better but still not great.
(For comparison, a mockup of a perl program to process conditional GETs for the pledge page can serve each one in about 3ms, which isn’t much more than the database queries it uses take on their own. Basically that’s because the perl interpreter only has to parse the code once, and then it runs in a loop accepting and processing requests on its own.)
However, since we unfortunately don’t have time to rewrite the performance-critical bits of PledgeBank in a real language, the best we can do is to try to cut the number of lines of library code that the site has to parse on each page view. That’s reduced the optimal case for the pledge page — where the pledge has not changed — to this:
<? /* ... */ require_once '../conf/general'; require_once '../../phplib/conditional.php'; require_once '../../phplib/db.php'; /* Short-circuit the conditional GET as soon as possible -- parsing the rest of * the includes is costly. */ if (array_key_exists('ref', $_GET) && ($id = db_getOne('select id from pledges where ref = ?', $_GET['ref'])) && cond_maybe_respond(intval(db_getOne('select extract(epoch from pledge_last_change_time(?))', $id)))) exit(); /* ... */ ?>
— that, and a rewrite of our database library so that it didn’t use the gigantic and buggy PEAR one, has got us up to somewhere between 60 and 100 reqs/sec, which while not great is enough that we should be able to cope with another similar Slashdotting.
For other pages where interactivity isn’t so important, life is much easier: we can just emit a “Cache-Control: max-age=…” header, which tells squid that it can re-use that copy of the page for however long we specify. That means squid can serve that page at about 350reqs/sec; unfortunately the front page isn’t all that important (most users come to PledgeBank for a specific pledge).
There’s a subtlety to using squid in this kind of (“accelerator”) application which I hadn’t really thought about before. What page you get for a particular URL on PledgeBank (as on lots of other sites) vary based on the content of various headers sent by the user, such as cookies, preferred languages, etc.; for instance, if you have a login cookie, you’ll see a “log out” link which isn’t there if you’re an anonymous user. HTTP is set up to handle this kind of situation through the Vary: header, which the server sends to tell clients and proxies on which headers in the request the content of the response depends. So, if you have login cookies, you should say, “Vary: Cookie”, and if you do content-negotiation for different languages, “Vary: Accept-Language” or whatever.
PledgeBank has another problem. If the user doesn’t have a cookie saying which country they want to see pledges for, the site tries to guess, based on their IP address. Obviously that makes almost all PledgeBank pages potentially uncachable — the Vary: mechanism can’t express this dependency. That’s not a lot of help when your site gets featured on Slashdot!
The (desperately ugly) solution? Patch squid to invent a header in each client request, X-GeoIP-Country:, which says which country the client’s IP address maps to, and then name that in the Vary: header of the outgoing pledges. It’s horrid, but it seems to work.
Chris and I are rapidly tiring ourselves out with server configuration. Well, I speak for myself, but he can correct for himself in the comments. We’re moving everything from the one old server it has run on for the last year and a quarter (very) onto two new identical servers (bitter and tea). And at the same time we’ve been putting everything in CVS – every little piece of configuration and cron job that there is, for the servers and for the sites.
It’s amazing how many systems there are, and how many things to worry about. Security, backups, redundancy (we’re not too hot at that – recommendations for PostgreSQL mirror/cluster/live-backup type things welcome), admin authorisation, SSL, templated /etc files, users, groups, packages, cron, anonymous cvs, web statistics, service monitoring… It just goes on and on and on.
And that’s without listing the sites – HearFromYourMP, PledgeBank, NotApathetic, WriteToThem, mySociety (.org). And services – services.mysociety.org, gaze.mysociety.org, secure.mysociety.org, debian.mysociety.org, cvs.mysociety.org, rt.mysociety.org (request tracker, Matthew has been setting up recently).
But when we’re done, everything about the containment of our applications will be configured in CVS. We can install a new server in a trice (honestly!). New developer sandboxes can be configured in a few seconds. Everything is logged and backed up.
We’re gradually doing all of the above, but we’re not done yet.
(Shh, don’t tell anyone, but this post is really just so the bots find debian.mysociety.org, but I’m going to try and fill it with some other content so you don’t think I’m being too rude)
Debian’s software “packaging” system provides a big database of all the open source software in the world, and makes another smaller database of all the software installed on your computer. We’re using it on our new servers, which the sites are gradually migrating to now. When you’ve got security updates, multiple machines, and complex software dependencies, you need it.
Unfortunately, though it seems like the Debian people have packaged nearly all the software in the world, sometimes they miss things. Normally we’d just install them using the old Unix configure/make/make install. This time we’ve decided to do it properly, and make our own Debian packages. You can find them at debian.mysociety.org.
The advantage of this is that we can find out where any file on the system came from. We can easily upgrade multiple machines, and check that they all have the same software installed. This makes it much less likely that there’ll be bugs when you go to a corner of one of the websites, and get an error because a perl module wasn’t installed.
So far there are a few perl module .deb files in our repository, which the handy dh-make-perl builds easily from a perl module tarball. There’s also Xapian (a search engine library), which we use for quick lookups in Gaze (our gazeteer). That had already been packaged by the Xapian people, but for some reason I had to recompile it. Finally there’s one Python module, PyRTF, which makes Python modules, which I just packaged (probably badly).
Anyway, this post is here to make sure anybody searching for python2.3-pyrtf on Google will find something…
We’re looking for ways to make it easier for volunteers to get involved in mySociety. Like everything in real life this is mostly a question of openness and policy, but there are also a few technical steps we think would make life easier. One of these is to make it easier for us to hand over a test web server to a volunteer or a group of volunteers to develop code on, play with and generally break. At the moment that’s quite hard to do, because we use apache and all our sites are hosted on one machine (yes really — computers are fast and memory is cheap, though in day-to-day life you’d never notice that, because most of the IT industry is involved in developing “technology” — meaning, “programs that don’t work yet” — that are designed to make your computer slow again: Microsoft Windows, Java, modern web browsers, etc. etc.). Apache is monolithic and if one user breaks the configuration of their test site they can bring down all the sites hosted on the machine. Also, apache isn’t very good at crossing security boundaries (arguably that’s a fault of UNIX generally), so unless we’re prepared to give all the volunteers root (not acceptable for policy reasons) they need to hassle us to get things done (not acceptable to them). Indeed, to save time and admin hassle, IVotedForYouBecause was developed and is hosted elsewhere.
So the idea is to strip away all this crap by running lots of instances of apache, and giving one to any group of volunteers who want to play with one of our sites — all running under their own unprivileged UID — and then direct requests from the outside world through to the appropriate internal apache server via a public-facing proxy. The design I’m envisaging looks something like this:
(Actually it’s not clear to me that that diagram conveys anything you won’t have understood anyway, but there we go.) For the front-end server I’m using Squid, which is balky and overcomplicated, but supports one very handy feature which is invaluable in this setup: external URL rewriting scripts, which can be used to redirect requests that come through the cache to other resources. The classic application of this is to redirect requests for advertising and other pointless content to local resources so that they don’t take up bandwidth or break your web browser; in this case we’ll be using it to rewrite requests for certain publically-visible URLs (“http://fred.pledgebank.com/…” or whatever) into internal URLs which route to individual users’ apache servers (“http://127.0.0.1:8001/…”). One of the nice things about this is that it preserves the Host: header, and (with a further small hack) apache can be persuaded to pretend that requests weren’t proxied at all, so any back-end stuff that needs to know clients’ IP addresses (such as logging, etc.) can be used unmodified. On top of this, squid will cache responses (assuming that we aren’t lazy about the headers we emit on our own content), which may speed things up a bit for certain sites, though I suspect (with little evidence, and none I’m prepared to bore you with now) this won’t be very useful in practice for the types of sites we’re building.
Another attractive feature of this scheme is that it means that we’re not tied to apache: we could use lighttpd or something, if we wanted to. I doubt that a technical reason to do that will arise in practice, but every minute I spend fighting apache configurations is a minute closer to chucking the bloody thing and picking another web server.
So, it’s the usual story: you start off trying to work around the brokenness in one bit of software, and then all sorts of exciting possibilities suddenly open up. At least, that’s one way to look at it.
We’ve just got our first proper server all to ourselves, and Chris Lightfoot and Francis Irving are moving existing stuff over to it. The main news, though, it that you’re invited to our…