1. WhatDoTheyKnow growing pains (and Ruby memory leaks)

    WhatDoTheyKnow keeps growing and growing, sucking people in from Google as its archive of maybe 8.5% of Freedom of Information requests gets more and more detailed.

    Graph of number of FOI requests made using WhatDoTheyKnow over time

    There’s round about 8Gb of unfettered Government data in the core database, plus a whole bunch more for indexing and caching. For comparison, TheyWorkForYou (which now goes back to 1935) has 12Gb. And it’s catching up on traffic also – WhatDoTheyKnow has about half the number of visitors as TheyWorkForYou.

    Unfortunately, this new found traffic has led to performance problems. You might have seen errors when using WhatDoTheyKnow in the last week or two. This post is firstly an apology for that. Thank you for your patience. Hopefully it is fixed now – do let us know if you get problems still. And secondly it is some techy stuff about debugging such problems in Ruby on Rails…

    When WhatDoTheyKnow started failing, we did the obvious things to start with – moving the database to a separate server, and moving some other services off the same server, to give WDTK more room to breathe. It still kept breaking.

    None of my server monitoring tools shed any very clear light as to the problem. I upgraded to the latest version of Passenger, the best Rails deployment tool I’ve seen yet. It’s pretty good, but still not mature enough for my liking. I was still getting the same problems with it, but reporting tools like passenger-memory-stats were really helpful.

    Eventually I worked out that it was to do with memory use of the Rails processes. Individual ones would leap up to 1Gb, and never drop back down. If several did, the server (with 4Gb of RAM) would start swapping and grind to a halt. The world of Ruby and Rails memory monitoring software is patchwork at best, and in the end I found the simplest tools the most useful. Here’s some:

    • I found some Rails processes were getting jammed, and not dieing even when I restarted Apache. I think in the end this was due to the Passenger spawning method, and our use of the Xapian Ruby module. Running Passenger in RailsSpawnMethod conservative mode made things much more robust.
    • Monit, which in a previous life had a job holding up vital structural pillars of buildings with duct tape, makes you feel dirty. Actually it is really useful. Given I couldn’t quickly fix the problem, Monit let me at least reduce the suffering for people trying to use the site meanwhile. Here’s the rule I used, which gives Apache a kick every time server memory use is too high. It was firing every 5 or 10 minutes…
      check system localhost
          if memory > 3500 MB then exec "/usr/sbin/apache2ctl graceful"
    • I found memory_profiler on a blog. It helps you find the kind of memory leak where you unintentionally continue to reference an object you don’t use any more. With a specialist subject of string objects. This led to a fix to do with declaring static arrays in classes vs. modules, which I still don’t really understand. But it wasn’t the cause of the big 1Gb memory munching, there were no large enough leaks of this sort.
    • The record_memory function in WDTK’s application controller came from another blog. It’s handy as it shows you how much of the system memory in the Ruby process each request causes an increase by. With caveats, this was the best way for me to identify the most damaging requests (search results, and certain public body pages). And it also brought focus on the actual problem – the peak memory use during a request. That’s really important, because Ruby’s memory manager never returns memory to the operating system… The Gb leaps in memory use were because of temporary memory used during certain requests, which the Ruby memory manager then never frees later.
    • I made a bunch of functions culminating in allocated_string_size_around_gc. This was really useful in use with the “just add lots of print statements and fiddle” school of debugging. Not everyone’s favourite school, but if your test code can’t catch it, one I often end up using (it gets really involved rarely enough that it doesn’t seem worth setting up an interactive debugger). It led me to various peak memory savings, such as calling “text.gsub!” rather than “text = text.gsub” while removing (email addresses and private information) from FOI request responses, which help quite a bit when dealing with multi-megabyte attachments.
    • Finally, I used the overlooked debugging tool, and the one you should never rely on, being common sense. That is, common sense informed by days of careful use of all the other tools. In order to quickly show text extracts when searching, WDTK stores the extracted attachment text in the database. A few of these attachments are quite large, and led to 50Mb fields, often several of which were being loaded and processed in one page request. That this would cause a high peak of memory use all became just obvious to me some time yesterday. I checked that that was the case, and this morning, I changed it to use the full text for indexing, but to at most keep 1Mb for use in snippets. So sometimes now you won’t get a good search extract for queries, but it is rare, and it will at least still return the right result.

    I’ve more work to do, I think there are quite a few other quick wins, all of which are making the site faster too. I’m quite happy that WhatDoTheyKnow also has a bunch more test code as a result of all this.

    On the other hand, what a disappointing disaster for open source languages beginning with P/R (as opposed to J). Yes, the help and tools were just about there to work it out, but would seem primitive if you’d used say Java’s Memory Analyzer. Indeed somebody over on StackOverflow suggested running your site in JRuby and using exactly that tool…

  2. How Mapumental works

    Here is a diagram of how the backend of Mapumental works. Take it in the spirit that Chris Lightfoot set when he made a similar diagram for the No. 10 petitions site – although many such diagrams are useless, hopefully this one contains useful information.

    (Click on the diagram for a large version)

    Below, I’ve explained what the main components are, and some interesting things about them.

    Everything can, at least in theory, run on lots of servers. Currently we are only actually using one server for web requests, because of problems with HAProxy. We’re runnning isodaemons on two different servers.

    Basic web application – it started out as raw Python, but the more Matthew hacks on it the more Django libraries he pulls in. Soon it’ll be indistinguishable from a Django app. When someone enters a new postcode, it adds it to the work queue in the PostgreSQL database, then refreshes waiting for the job to be finished. Then it displays the flash application (made by Stamen), set up to load the appropriate tile layers.

    Tile server and cache – This uses the Python-based TileCache, calling Geospatial Data Abstraction Library (GDAL) to help render the tiles from points. It was originally written by Stamen, and expanded by mySociety. GDAL isn’t perfect, it doesn’t have fancy enough algorithms for my liking. e.g. Using a median rather than a weighted mean.

    Isodaemons – These are controlled by a Python script, but the bulk of the code is custom written in C++. Slightly crazily, this can find the quickest route by public transport for each of 300,000 journeys from every station in the UK to a particular station, arriving at a particular time, in 10 to 30 seconds.

    I had no idea how to do this, but luckily I live in Cambridge, UK. It’s a city fit to bursting with computer scientists. Many of the jobs are dull, and need little computing, never mind science – like writing interface layers for SQL server. So if you have a real interesting problem it’s easy to get help!

    The universal advice was to use Dijkstra’s algorithm, which needed a bit of adaptation to work efficiently over space-time, rather than just space. Normally it is used for planning routes round a map, but public transport isn’t like that, you have to arrive in time for each particular train, so time affects what journeys you can take.

    I originally wrote it in Python, which was not only too slow, but used up far far too much RAM. It could never have loaded the whole dataset in. However, the old Python code is still run by the test script, to double check the C++ code against. It is also still used to make the binary timetable files, see below.

    Travel times, 1 binary file / postcode – I briefly attempted to insert 300,000 rows into PostgreSQL for each postcode looked up, but it was obvious it wasn’t going to scale. Going back to basics, it now just saves the time taken to travel to each station in a simple binary file – two bytes for each station, 600k in total. The tile server then does random access lookups into that file, as it renders each tile. It only needs to look up the values for the stations it knows are on/near the tile.

    There’s various other bits:

    • cron jobs for sending out invites
    • converting timetable data from ATCO-CIF to the binary format
    • loading static layer data into the database
    • precaching every tile for static datasets
    • Squid and Apache and FastCGI both sit in front of the web applications
    • for speed, we cache the mapping background tiles from Cloudmade
    • when zoomed out, there is code to cull which stations are used to draw tiles
    • of course, a bunch of test code

    Thanks to everyone who helped make Mapumental, we couldn’t have done it without lots of clever people.

    I realise the above is a sketchy overview, so please ask questions in the comments, and I’ll do my best to answer them.

  3. Report submission edits

    A number of people report dog fouling through FixMyStreet, using slightly more… colloquial language. A number of councils have strict obscenity filters, blocking anything containing swearing. As I’m a pragmatist and not that interested in campaigning against councils blocking legitimate emails from their citizens (feel free!), FixMyStreet simply changes any “dog shit” reference to “dog poo”. This works well for everyone.

    Recently, the infamous Intellectual Property Manager from Portakabin™ Limited got in touch to complain about a couple of reports on FixMyStreet containing the words “portacabin” or “portaloo”. Again, as a pragmatist, I’m not really interested in whether users using trade marks or trade mark variants in a generic way on a problem report actually constitutes trade mark infringment (actually, I’d guess not), I just want legal people to go away and not waste our precious resources. So from now on, any report containing portakabin or similar will become [portable cabin], and portaloo will become [portable loo].

    For anyone who’s interested, this is accomplished through a simple regular expression, that looks for porta followed by 0 or more spaces, then cabin, kabin, or loo, and sticks “ble” in the middle.

  4. RIP Angie Martin 1974-2009

    It is with overwhelming sadness that I write to tell our community that Angie Martin, mySociety’s fourth core developer, has died. She was taken from us by the cancer that she had been fighting since soon after we hired her less than two years ago.

    Possessed of an almost unbelievably upbeat personality, Angie brought not only her formidable Perl skills, but her blazing warmth of character to our team. In remission during our yearly retreat in January this year, she combined laughter with a typically tough line of questioning on ideas she thought insufficiently robust. With typical disgregard for cool, her CV noted that she was “known to enjoy wrangling regular expressions on a Sunday Morning”. She didn’t see any contradiction between being a successful woman and a geek, throwing herself wholeheartedly into the Mac-toting, perlmonger ethos. She even brought her husband Tommy with her, who became a significant volunteer.

    Given her habit of plain speaking, it is pointless to pretend that Angie was able to make the contribution to mySociety’s users or codebase that she wanted to. What she achieved in terms of difficult coding during recovery from chemotherapy was incredible, breathtaking – but she wanted to change the world. It now falls to the rest of us, and our supporters, to live up to the expectations she embodied, to continue to push every day, using skills like those that she had to help people with everyday problems. We now have to ask ‘What would Angie do?’, as well as ‘What would Chris do?’. It is a lot to live up to.

    She was a mySociety core developer: I hope that meant as much to her as it meant for me to have her as one of my coders.  Remember and Respect.

    Updated: Angie changed her surname upon getting married, a couple of months ago. I have just read she wanted to be remembered as Angie Martin, and so I have made that change.

    Updated 21 7 2009: Tommy has just told me that those wishing to may memorial donations should send them to Hospice at Home.

  5. Register of Members’ Financial Interests

    As a new edition has just been released, and I’ve had to tweak the parser to cope with the new highlighting, it’s a good time to write a brief article on TheyWorkForYou’s handling of the House of Commons Register of Members’ Financial Interests (Register of Members’ Interests as was before the current edition). Way back in the day, a scraper/parser was written (by either Julian or Francis) that monitors the Register pages on www.parliament.uk for new editions, and downloads and broadly parses the HTML into machine-readable data. The XML produced can be found at http://ukparse.kforge.net/parldata/scrapedxml/regmem/ – TheyWorkForYou then pulls in this XML into its database, and makes the latest data available on every MP’s page.

    However, as it’s been scraping/parsing the Register since 2000, we can do more than that. Each MP’s page contains a link to a page giving the history of their entry in the Register – when things were added, removed, or changed. You can also view the differences between one edition of the Register and the next, or view a particular edition in a prettier form than the official site. There’s a central page containing everything Register-related at http://www.theyworkforyou.com/regmem/

  6. FixMyStreet iPhone app

    We’ve had reports that our FixMyStreet iPhone app is crashing on iPhone 3.0, and so have withdrawn it from the App Store until we are able to find out what’s wrong and fix it. I’m afraid I don’t know when that will be, as it’s all rather busy at present – if anyone has the skills and would like to volunteer to help, the code is available and should just import into XCode. I can supply some crash logs too.

  7. TheyWorkForYou Redesign

    Richard Pope has been redesigning mySociety’s biggest site TheyWorkForYou.com for a couple of months.

    He’s done a heroic job, as has Matthew with his epic import of Hansard data from 1935 onwards.  TheyWorkForYou is a much better site for their combined work recently. We’ll be writing more on the historic stuff soon.

    There are a few things I’d like from you as a member of the mySociety community:

    1. Please say a big thanks to Richard. This was not an easy or relaxing task at all, and he’s done it brilliantly. Just check a Lords debate to see the attention to detail. We are a very lucky organisation to have him, as he’s always in demand.

    2. Please give some constructive criticism on how it could be even better (please note, focussing on design here, we already have a load of feature priorities to deliver).

    3. Anyone who could help supply a redesigned logo, or some nicely processed parliamentary-themed artwork to sit in the background grey-boxes on the homepage would be doing a very Good Deed for mySociety.

    And lastly, please do pledge to become a TheyWorkForYou Patron, so we can keep doing things like this in the future!

  8. April Fools’ Day Council changes

    They could perhaps have picked a better day, as it was quite serious – at the stroke of midnight on the 1st of April, 37 district councils and 7 county councils in England ceased to exist, replaced by 9 new unitary authorities. This means people in Durham, Northumberland, Cornwall, Shropshire, Wiltshire, Chesire, and Bedfordshire only have one principal local authority to deal with now. The Wikipedia article on the changes has more information on the background to this change.

    Obviously this meant some work for WriteToThem and FixMyStreet, both of which require up-to-date local council information. Our database of voting areas, MaPit, has “generations”, so we can keep old areas around for various historical purposes. So firstly, I created a new generation and updated all the areas that weren’t affected to the new generation. Next, six of the new unitary authorities (all the counties except Cheshire and Bedfordshire, plus Bedford) share their boundaries and wards with the coterminous councils they’re replacing, so for them it was a simple matter of updating those councils to be unitary authorities.

    That left Bedfordshire and Cheshire. I created areas for the three new councils (Cheshire West and Chester, Cheshire East, and Central Bedfordshire), and transferred across the relevant wards from the old county councils – basically a manual process of working out the list of correct ward IDs.

    WriteToThem was now dealt with, but FixMyStreet needed a little more work. The councils that no longer existed had understandably disappeared from the all reports table, so I had to modify the function that fetches the list of councils to optionally return historical areas so they could be included. And lastly, FixMyStreet needs a way of mapping a point on a map to the relevant council. For this, it needs to know the area covered by a council, which was missing for the new authorities I’d manually created. Thankfully, each of the three new authorities are made up of the areas of either 2 or 3 district councils (e.g. Cheshire East is the area covered by Congleton, Macclesfield, and Crewe and Nantwich), so I just had to write a script that stuck those areas together to create the area of the new council. It all seems to work, and I’m sure our users will be in touch if it doesn’t 🙂

    So goodbye to Alnwick, Bedfordshire, Berwick-upon-Tweed, Blyth Valley, Bridgnorth, Caradon, Carrick, Castle Morpeth, Cheshire, Chester, Chester-le-Street, Congleton, Crewe and Nantwich, Derwentside, Durham City, Easington, Ellesmere Port and Neston, Kennet, Kerrier, Macclesfield, Mid Bedfordshire, North Cornwall, North Shropshire, North Wiltshire, Oswestry, Penwith, Restormel, Salisbury (which is getting a new town council), Sedgefield, Shrewsbury and Atcham, South Bedfordshire, South Shropshire, Teesdale, Tynedale, Vale Royal, Wansbeck, Wear Valley, and West Wiltshire. RIP.

  9. FixMyStreet RSS

    FixMyStreet has a lot of RSS feeds. There’s one for every one-tier council (170), one for every ward of every one-tier council (another 5,044), two for every two-tier (county and district) council (544), and two for every ward of every two-tier council (20,296) – two per two-tier council because you might want either problems reported to one council of a two-tier set-up in particular, or all reports within the council’s boundary.

    Then there’s an RSS feed every 162m across Great Britain in a big grid, returning all reports within a radius of that point, the radius by default being automatically determined by that point’s population density, but customisable to any distance if preferred. That’s, at a very rough approximation assuming Great Britain is a rectangle around its extremities, which it’s not, 19 million RSS feeds, lots of which will admittedly be very similar. 🙂

    Every single one of those feeds can be subscribed to by email instead if that’s preferable to you, and are all accessible through a simple interface at http://www.fixmystreet.com/alert.

    However, none of these RSS feeds was suitable for the person who emailed from a Neighbourhood Watch site and said that all they had was a postcode and they wanted to display a feed of reports from FixMyStreet. Given you could obviously look up a FixMyStreet map by postcode, it did seem odd that I hadn’t used the same code for the RSS feeds. Shortly thereafter, this anomaly was fixed, and if you now go to a URL of the form http://www.fixmystreet.com/rss/pc/postcode you will be redirected to the appropriate local reports feed for that postcode (I could say that adds another 1.7 million RSS feeds to our lot, but given they’re only redirects, that’s not strictly true). And after a couple more emails, I also added pubDate fields to the feeds which should make displaying in date order easier.

    It’s great to see our RSS feeds being used by other sites – other examples I’ve recently come across include Brent Council integrating FixMyStreet into their mapping portal (select Streets, then FixMyStreet), or the Albert Square and St Stephen’s Association listing the most recent Stockwell problems in their blog sidebar. If you’ve seen any notable examples, do leave them in the comments.

  10. PledgeBank Facebook application disabled

    Unfortunately, I’ve had to disable the PledgeBank Facebook application. It used to let you sign and share pledges from within Facebook.

    Facebook recently changed their platform (again!), breaking our code for sending success/failure messages. Obviously, it is no good signing up to a pledge if you don’t get informed when it succeeds.

    I tried to fix it, but couldn’t work out how to do so quickly. We don’t have the time and money at the moment to chase after this, so I’ve disabled the application entirely. Links to PledgeBank pages on Facebook now redirect to pledgebank.com.

    Hopefully it’ll be back one day – do send us emails if you miss it (or money if you have a large pledge that really needs it!). I think there may be a better solution with a simpler interface – the current application tried too hard to reimplement all of PledgeBank within Facebook. And besides, we should be supporting OpenSocial now it exists. It’s an open standard, Facebook isn’t.

    Technical details: We used infinite session keys to send notifications from cron jobs. Quite reasonably, this no longer works. However, I couldn’t find out what to use instead. I think Facebook should respect backwards compatibility of its APIs a lot more, and if it breaks it they should give clear instructions about what to use instead. This does put me off ever wanting to develop anything on their platform again.