WhatDoTheyKnow keeps growing and growing, sucking people in from Google as its archive of maybe 8.5% of Freedom of Information requests gets more and more detailed.
There’s round about 8Gb of unfettered Government data in the core database, plus a whole bunch more for indexing and caching. For comparison, TheyWorkForYou (which now goes back to 1935) has 12Gb. And it’s catching up on traffic also – WhatDoTheyKnow has about half the number of visitors as TheyWorkForYou.
Unfortunately, this new found traffic has led to performance problems. You might have seen errors when using WhatDoTheyKnow in the last week or two. This post is firstly an apology for that. Thank you for your patience. Hopefully it is fixed now – do let us know if you get problems still. And secondly it is some techy stuff about debugging such problems in Ruby on Rails…
When WhatDoTheyKnow started failing, we did the obvious things to start with – moving the database to a separate server, and moving some other services off the same server, to give WDTK more room to breathe. It still kept breaking.
None of my server monitoring tools shed any very clear light as to the problem. I upgraded to the latest version of Passenger, the best Rails deployment tool I’ve seen yet. It’s pretty good, but still not mature enough for my liking. I was still getting the same problems with it, but reporting tools like passenger-memory-stats were really helpful.
Eventually I worked out that it was to do with memory use of the Rails processes. Individual ones would leap up to 1Gb, and never drop back down. If several did, the server (with 4Gb of RAM) would start swapping and grind to a halt. The world of Ruby and Rails memory monitoring software is patchwork at best, and in the end I found the simplest tools the most useful. Here’s some:
- I found some Rails processes were getting jammed, and not dieing even when I restarted Apache. I think in the end this was due to the Passenger spawning method, and our use of the Xapian Ruby module. Running Passenger in RailsSpawnMethod conservative mode made things much more robust.
- Monit, which in a previous life had a job holding up vital structural pillars of buildings with duct tape, makes you feel dirty. Actually it is really useful. Given I couldn’t quickly fix the problem, Monit let me at least reduce the suffering for people trying to use the site meanwhile. Here’s the rule I used, which gives Apache a kick every time server memory use is too high. It was firing every 5 or 10 minutes…
check system localhost if memory > 3500 MB then exec "/usr/sbin/apache2ctl graceful"
- I found memory_profiler on a blog. It helps you find the kind of memory leak where you unintentionally continue to reference an object you don’t use any more. With a specialist subject of string objects. This led to a fix to do with declaring static arrays in classes vs. modules, which I still don’t really understand. But it wasn’t the cause of the big 1Gb memory munching, there were no large enough leaks of this sort.
- The record_memory function in WDTK’s application controller came from another blog. It’s handy as it shows you how much of the system memory in the Ruby process each request causes an increase by. With caveats, this was the best way for me to identify the most damaging requests (search results, and certain public body pages). And it also brought focus on the actual problem – the peak memory use during a request. That’s really important, because Ruby’s memory manager never returns memory to the operating system… The Gb leaps in memory use were because of temporary memory used during certain requests, which the Ruby memory manager then never frees later.
- I made a bunch of functions culminating in allocated_string_size_around_gc. This was really useful in use with the “just add lots of print statements and fiddle” school of debugging. Not everyone’s favourite school, but if your test code can’t catch it, one I often end up using (it gets really involved rarely enough that it doesn’t seem worth setting up an interactive debugger). It led me to various peak memory savings, such as calling “text.gsub!” rather than “text = text.gsub” while removing (email addresses and private information) from FOI request responses, which help quite a bit when dealing with multi-megabyte attachments.
- Finally, I used the overlooked debugging tool, and the one you should never rely on, being common sense. That is, common sense informed by days of careful use of all the other tools. In order to quickly show text extracts when searching, WDTK stores the extracted attachment text in the database. A few of these attachments are quite large, and led to 50Mb fields, often several of which were being loaded and processed in one page request. That this would cause a high peak of memory use all became just obvious to me some time yesterday. I checked that that was the case, and this morning, I changed it to use the full text for indexing, but to at most keep 1Mb for use in snippets. So sometimes now you won’t get a good search extract for queries, but it is rare, and it will at least still return the right result.
I’ve more work to do, I think there are quite a few other quick wins, all of which are making the site faster too. I’m quite happy that WhatDoTheyKnow also has a bunch more test code as a result of all this.
On the other hand, what a disappointing disaster for open source languages beginning with P/R (as opposed to J). Yes, the help and tools were just about there to work it out, but would seem primitive if you’d used say Java’s Memory Analyzer. Indeed somebody over on StackOverflow suggested running your site in JRuby and using exactly that tool…
One thing I should add that we did early on… Lots of the traffic was, as ever, search engine robots.
To try and minimise problems for real uses, we slowed down Google’s bot with webmaster tools and added a Crawl-delay to robots.txt to slow down bing’s bot.
I also added lots of pages to robots.txt, to stop search engines crawling them. These were pages that had no real new content, usually because they were pages for performing an action. e.g. the page to add an annotation to each requset (/annotate/*).
I’m just undoing the crawl delays now…
I’m not sure why you claim this is a disaster for Perl, Python, PHP and other free software languages…
Are you using ferret for the search side?
I have seen that ferret can be very memory hungry and using the ferret gem in a rails app can make it bloat out.
1 Comment: Dump Rails, try Django.
Dave – I agree that the weak Ruby virtual machine was a large cause of my troubles. I’m not totally convinced the debugging tools for Perl, Python or PHP are substantially better than Ruby, even if they are slightly better. Certainly we had lots of trouble in the early days with, say, FastCGI and those languages.
Perhaps what I was trying to say was that things could be better organised across LAMP-style stacks in general. At the moment for such stacks there are lots of adhoc programs wired together that you can sort of debug things with. I’d like a stable platform that tells you simply exactly the information you need to know.
Peter – We’re using Xapian and acts_as_xapian. I wrote the latter for WhatDoTheyKnow, because acts_as_solr wasn’t good enough. Details in this blog post http://www.mysociety.org/2008/07/17/acts_as_xapian/ – the Ruby Xapian bindings did cause problems with Passenger’s spawning model, which we had to make conservative. It doesn’t seem to cause memory trouble though (Xapian doesn’t allocate much memory, it uses memory mapped files and relies on the kernel for in-RAM speed caching).
Dan – Django was not as clearly good 2 years ago when we started writing WhatDoTheyKnow, although it is what I’d use if I started again now. Rails was genuinely innovative, and well marketed with a community that has lots of good things about it. On the other hand I think Ruby, Rails, and 3rd party libraries and tools are in general much more poorly software engineered than the Python equivalents.
Obviously we’re not going to rewrite it now. That would be a crazy waste of time (second systems effect). It might look like a simple website, but inside it is surprisingly complicated – there is lots going on that happens invisibly. And lots and lots of slowly accumulated knowledge now embodied in the source code.
You really should try JRuby just to see if it solves your memory issues. We make much more efficient use of memory than the standard implementation, and there are sites running JRuby on multi-GB heaps without any issues.
Thanks Charles for that vote for trying JRuby. I think I’m unnecessarily put off because I’m not used to Java (from the days when it wasn’t open source, so didn’t have nice Debian packages).
I’m also worried that it’ll be hard to set up an environment as nice as mod_ruby (e.g. running Rails as a particular Unix user).
I’ll give it a go if I get stuck again in future!
Francis,
You should read:
http://www.engineyard.com/blog/2009/thats-not-a-memory-leak-its-bloat/
And definitely reach for New Relic or Scout.
— michael
Is it possible to give the reason for using Ruby for this DB ? for example could you not have used php mysql ?
Was there a special reason for using Ruby ? for a production site as this.
I decided to try Ruby on Rails, mainly because I hoped it would make it easier for other people to contribute code. This is because it has a standard file structure, and way to run the code.
PHP/MySQL applications are very hard for people to configure and get going. e.g. No standard way for database migrations etc.
Yes, I could have used Cake or Django – but 2.5 years ago they weren’t as obvious a choice as trying Rails. I think it was a good choice. I’ve learnt a lot about frameworks from using Rails. I don’t think it is especially good, but then again I haven’t found anything I think is better (Django is in some ways, not in others).
In practice, it hasn’t created more code contributions. Partly the Rails app is still hard to install (getting example data, setting up Xapian and various other bits). Partly it is just had to get code contributions to web applications.