mySociety

Home

mySociety Blog

Blog posts in the category Developers

Looking back: our experience of the Google Summer of Code

Written by on in Components, Developers

Summer may seem like a long time ago, but despite the cold outside, we’ve been looking back over our participation in Google’s Summer of Code project. It’s almost enough to warm us up!

This post is an attempt to record the process from our point of view. We hope it will be useful for other organisations considering participating next year, and for students who want to know more about how the scheme works.

What is Google Summer of Code?

It’s a programme sponsored by Google’s philanthropic arm, giving students the chance to experience real-life coding on open source software.

The scheme is open to students all over the world, who are then paired up with open source organisations like us. The students gain paid work experience and mentoring; the organisations gain willing workers and some fresh new perspectives; the world gains some more open source code to use or develop further.

Everyone’s a winner, basically.

The beginnings

2012 was our first year on the programme: once we had been accepted on the scheme, we were given two student slots – the maximum allowed for a first-time organisation.

Given mySociety’s wide suite of codebases, there were several projects that could have benefited. We listed all our ideas, and let people apply for the ones they found appealing.

Goodness, there were a lot of applicants! It was very heartening to discover that there is such an enthusiastic community of young coders all around the world – even if it did take us a long time to sift through them all and make our choices.

You might remember our post back in May, when we announced that we’d made our choices. We were delighted to get working with Dominik from Germany and Chetan from India.

The project

As things turned out, our students ended up working on a project that wasn’t even on our original list: PopIt, our super-easy ‘people and positions’ software.

That’s because once we spoke to our chosen students, we realised they had the skills that could really help us forge ahead with this project – and once we discussed it with them, they were keen. So PopIt it was.

Logistics

Germany and India are a bit of a commute away, but fortunately development work can be managed remotely. We know this particularly well at mySociety: our core team work from home and are scattered across the UK.

The only difference here was the 6+ hour time difference between us and India: it was important to be rigorous about checking in at times when Chetan would be awake!

We communicated via IRC (instant chat), email, and occasionally Skype, and it all worked well.

Edmund, the team member chosen to be mentor, broke the required tasks down into big pieces so that the students would have realistic work units of several days each.

What was achieved

PopIt is primarily a tool for helping people create and run parliamentary monitoring websites (like TheyWorkForYou) with minimal coding knowledge/effort, though we anticipate that it will have many other uses too.

Our students spent the first half of the summer learning and improving the PopIt codebase. Once they were confident in it, they created their own sites using PopIt as a datasource to test the API, and, hopefully, create a valuable reference resource for the community.

Dominik added a migration tool to PopIt, which lets you upload data as a CSV. This means that you can start a site with a database of names, positions and dates at its heart – within seconds.

His test site was a professors’ database (the code is here and the site is here). Dom also wrote some helpful posts on the dev blog like this one.

Chetan created an image proxy that lets us serve images in a smart way that makes sense for APIs. His test site was for Indian representatives (here’s the code, and the site is here).

Neither site is being maintained now, which just confirms that it is harder to run a site than to start it. This is not a failing, though. The creation of these sites, along with Chetan and Dom’s feedback, helped us understand where improvements needed to be made. In the course of one summer, PopIt became much more mature.

Looking back on the Summer of Code

Edmund attended a follow-up ‘mentors’ summit’ at the Googleplex in California – he found it very helpful to compare notes with other organisations and find out what had worked best for them all, and he made some good contacts too.

Assuming we get the chance again, would we participate in 2013? Our experience was very positive, but as yet we are undecided, purely because of the fluid nature of our workflow: we don’t yet know whether time and resources will permit.

Obviously, we have enjoyed great benefits from the scheme, but that has depended on quite a bit of input from our side, and we need to be sure that we can ensure that happens again.

Edmund has compiled a list of advice, from the practical (ask students to treat the placement like a full-time job; test coding skills before acceptance) to the desirable (a weekly blog post from participants; make sure you over-estimate the time you’ll spend mentoring). If you’re thinking of participating next year, he’d be happy to pass on his tips for ensuring that you, and your assigned students, get the best out of the Google Summer of Code. Just drop him a line.

Installing FixMyStreet and MapIt

Written by on in Developers, FixMyStreet, Launches, MapIt, Technical

A photo of some graffiti saying "SIMPLE"

One of the projects we’ve been working on at mySociety recently is that of making it easier for people to set up new versions of our sites in other countries.  Something we’ve heard again and again is that for many people, setting up new web applications is a frustrating process, and that they would appreciate anything that would make it easier.

To address that, we’re pleased to announce that for both FixMyStreet and MapIt, we have created AMIs (Amazon Machine Images) with a default installation of each site:

You can use these AMIs to create a running version of one of these sites on an Amazon EC2 instance with just a couple of clicks. If you haven’t used Amazon Web Services before, then you can get a Micro instance free for a year to try this out.  (We have previously published an AMI for Alaveteli, which helped many people to get started with setting up their own Freedom of Information sites.)

Each AMI is created from an install script designed to be used on a clean installation of Debian squeeze or Ubuntu precise, so if you have a new server hosted elsewhere, you can use that script to similarly create a default installation of the site without being dependent on Amazon:

In addition, we’ve launched new sites with documentation for FixMyStreet and MapIt, which will tell you how to customize those sites and import data if you’ve created a running instance using one of the above methods.

These documentation sites also have improved instructions for completely manual installations of either site, for people who are comfortable with setting up PostgreSQL, Apache / nginx, PostGIS, etc.

Another notable change is that we’re now supporting running FixMyStreet and MapIt on nginx, as an alternative to Apache, using FastCGI and gunicorn respectively.

We hope that these changes make it easier than ever before to reuse our code, and set up new sites that help people fix things that matter to them.

Photo credit: duncan

Job Advert: Developers

Written by on in Developers, Job adverts, Technical

This vacancy is now filled.

How would you like to be a coder in an organisation that is as determined to make a difference in the world as it is to be a truly high quality, engineer-led software team?

mySociety is that organisation. We’re a project of a registered charity, currently running award-winning civic and democratic websites like TheyWorkForYou.com and FixMyStreet.com, and we’re looking to grow our already-celebrated development team by several new members over the next six months.

We’re looking for people with at least two years experience (professional or keen amateur) in at least one of Python, Ruby, Perl, PHP, C++, Javascript or Adobe Flex, and who have ambitions to learn more languages in the future.

We’re looking for developers willing to commit to full or mostly-full time positions (no freelancers, sorry) and who are up for a career change that will see them stay with us for a little while. You’ll get to work with volunteers, mix commercial and charitable projects, and travel far and wide. Plus, you can work from wherever you live (in the UK), and we pay salaries from £28k to £50k depending on skills.

Most of all, we’re looking for coders who look at the services we have built so far and think “I wish I’d been on that project”. Projects you’ll likely be working on over the next few months include (but are not limited to):

  • A/B testing and conversion tracking of our charitable sites
  • Commercial spinoffs from FixMyStreet
  • Mapumental
  • Enhancements to TheyWorkForYou and WhatDoTheyKnow
  • Commercial development for clients

And if you’ve any questions, please post them in the comments below so we can share the answers.

New features on MaPit

Written by on in Developers, Launches, Technical

li { margin-bottom: 1em; }

We’ve added a variety of new features to our postcode and point administrative area database, MaPit, in the past month – new data (Super Output Areas and Crown dependency postcodes), new functionality (more geographic functions, council shortcuts, and JSONP callback), and most interestingly for most people, a way of browsing all the data on the site.

  • Firstly, we have some new geographic functions to join touches – overlaps, covered, covers, and coverlaps. These do as you would expect, enabling you to see the areas that overlap, cover, or are covered by a particular area, optionally restricted to particular types of area. ‘coverlaps’ returns the areas either overlapped or covered by a chosen area – this might be useful for questions such as “Tell me all the Parliamentary constituencies fully or partly within the boundary of Manchester City Council” (three of those are entirely covered by the council, and two overlap another council, Salford or Trafford).
  • As you can see from that link, nearly everything on MaPit now has an HTML representation – just stick “.html” on the end of a JSON URI to see it. This makes it very easy to explore the data contained within MaPit, linking areas together and letting you view any area on Google Maps (e.g. Rutland Council on a map). It also means every postcode has a page.
  • From a discussion on our mailing list started by Paul Waring, we discovered that the NSPD – already used by us for Northern Ireland postcodes – also contains Crown dependency postcodes (the Channel Islands and the Isle of Man) – no location information is included, but it does mean that given something that looks like a Crown dependency postcode, we can now at least tell you if it’s a valid postcode or not for those areas.
  • Next, we now have all Lower and Middle Super Output Areas in the system; thanks go to our volunteer Anna for getting the CD and writing the import script. These are provided by ONS for small area statistics after the 2001 census, and it’s great that you can now trivially look up the SOA for a postcode, or see what SOAs are within a particular ward. Two areas are in MaPit for each LSOA and MSOA – one has a less accurate boundary than the other for quicker plotting, and we thought we might as well just load it all in. The licences on the CD (Conditions of supply of SOA boundaries and Ordnance Survey Output Area Licence) talk about a click-use licence, and a not very sraightforward OS licence covering only those SOAs that might share part of a boundary with Boundary-Line (whichever ones those are), but ONS now use the Open Government Licence, Boundary-Line is included in OS OpenData, various councils have published their SOAs as open data (e.g. Warwickshire), and these areas should be publicly available under the same licences.
  • As the UK has a variety of different types of council, depending on where exactly you are, the postcode lookup now includes a shortcuts dictionary in its result, with two keys, “council” and “ward”. In one-tier areas, the values will simply by the IDs of that postcode’s council and ward (whether it’s a Metropolitan district, Unitary authority, London borough, or whatever); in two-tier areas, the values will again be dictionaries with keys “district” and “council”, pointing at the respective IDs. This should hopefully make lookups of councils easier.
  • Lastly, to enable use directly on other sites with JavaScript, MaPit now sends out an “Access-Control-Allow-Origin: *” header, and allows you to specify a JSON callback with a callback parameter (e.g. put “?callback=foo” at the end of your query to have the JSON results wrapped in a call to the foo() function). JSONP calls will always return a 200 response, to enable the JavaScript to access the contents – look for the “error” key to see if something went wrong.

Phew! I hope you find this a useful resource for getting at administrative geographic data; please do let us know of any uses you make of the site.

Outlook attachments now viewable in WhatDoTheyKnow

Written by on in Developers, News, Technical, WhatDoTheyKnow

When a bit of government forwards or attaches emails using Outlook, they get sent using a special, strange Microsoft email format. Up until now, WhatDoTheyKnow couldn’t decode it. You’d just see a weird attachment on the response to your Freedom of Information request, and probably not be able to do anything with it.

Peter Collingbourne got fed up with this, and luckily for us, he can code too. He forked our source code repository, and made a nice patch in his own copy of it.

He then told us about it, and I merged his changes into the main WhatDoTheyKnow code, tested them out on my laptop, then made them live. It all work perfectly first time. Peter even added the new dependency on vpim to WhatDoTheyKnow conf/packages.

Now if you go to an Outlook attachment on WhatDoTheyKnow,
such as this one you’ll just see the files, and be able to download them, and view them as HTML as normal. They’ll also get indexed by the search (although I need to do a rebuild for that for it to work with old requests).

Thanks Peter!

If you want to have a go making an improvement to a mySociety site, you can get the code for most of them from our github repositories. For some sites, there’s an INSTALL.txt file explaining how to get a development environment set up. Let us know if you do anything – even incremental improvements to installation instructions are really useful. And new, useful, features like Peter’s are even more so.

What are the two sorts of Cloud infrastructure called?

Written by on in Developers, Mapumental, Technical, Thoughts

I’ve been doing lots of research around “cloud computing” recently, so we can change how Mapumental works and take it out of private beta.

One thing that’s struck me is that there doesn’t seem to be a proper, industry standard name to distinguish what to me are two fundamentally different sorts of “cloud computing”. I’m focusing here entirely on cloud services for programmers (let’s leave what it means to end users or businesses for another day).

Here are my own names and descriptions of them:

1) Cloud hardware server provision (Cloud HSP)
Low level APIs for making and destroying (virtual) servers, and loading machine images onto them. e.g. Amazon Elastic Compute Cloud, Rackspace Cloud Servers, Eucalyptus’s EC2 bits. Basically, what Eucalyptus v 1.5 can do and what libcloud should do. (By analogy, this is the assembly language of cloud computing)

2) Cloud developer service provision (Cloud DSP) A service that a developer accesses with one name and a simple API, and behind the scenes it scales for him, automatically. e.g. Amazon Queue Service, Rackspace Cloud Files. (By analogy, this layer is the C programming language of cloud computing)

[as an aside, Google AppEngine is an interesting one. It is definitely in the Cloud DSP category, but I think it is larger than that - it is a whole set of APIs all in that category. Something like Google DataStore is a single Cloud DSP, albeit one apparently only accessible within AppEngine apps]

It’s possible to use a Cloud HSP (assembly language), along with a bunch of your own software or open source software, to build new Cloud DSPs (C code). Right now this is pretty hard – even quite well known open source distributed datasbases like CouchDB still need scripting to even make them replicate. The code that makes and destroys servers and gives the service one name, needs manually stringing with quite new bits of wire (things like scalr and Wackamole).

For this reason, I’m reluctant for mySociety to get into the “making our own Cloud DSP out of Cloud HSP” game. It feels to me like a suck of time, and like we wouldn’t be able to guarantee without lots of careful and expensive testing that it would scale. I’m more tempted to use the commercial Cloud DSP services where possible, even though they are proprietary. But use them via our own abstraction layer, so we can change as we need to. Of course, we have some C++ code (the public transport route finder), so will have to use the Cloud HSP API to get that going, perhaps with Amazon’s Auto Scaling. But it can jolly well use AQS and S3 to talk to other services.

So, what do you think about the names Cloud HSP/DSP? Are there already existing names for the distinction that I’m making? Is it a useful distinction for you? Can you think of better names?

WhatDoTheyKnow growing pains (and Ruby memory leaks)

Written by on in Developers, Technical, WhatDoTheyKnow

WhatDoTheyKnow keeps growing and growing, sucking people in from Google as its archive of maybe 8.5% of Freedom of Information requests gets more and more detailed.


(Graph of number of FOI requests made using WhatDoTheyKnow over time; click for larger version)

There’s round about 8Gb of unfettered Government data in the core database, plus a whole bunch more for indexing and caching. For comparison, TheyWorkForYou (which now goes back to 1935) has 12Gb. And it’s catching up on traffic also – WhatDoTheyKnow has about half the number of visitors as TheyWorkForYou.

Unfortunately, this new found traffic has led to performance problems. You might have seen errors when using WhatDoTheyKnow in the last week or two. This post is firstly an apology for that. Thank you for your patience. Hopefully it is fixed now – do let us know if you get problems still. And secondly it is some techy stuff about debugging such problems in Ruby on Rails…

When WhatDoTheyKnow started failing, we did the obvious things to start with – moving the database to a separate server, and moving some other services off the same server, to give WDTK more room to breathe. It still kept breaking.

None of my server monitoring tools shed any very clear light as to the problem. I upgraded to the latest version of Passenger, the best Rails deployment tool I’ve seen yet. It’s pretty good, but still not mature enough for my liking. I was still getting the same problems with it, but reporting tools like passenger-memory-stats were really helpful.

Eventually I worked out that it was to do with memory use of the Rails processes. Individual ones would leap up to 1Gb, and never drop back down. If several did, the server (with 4Gb of RAM) would start swapping and grind to a halt. The world of Ruby and Rails memory monitoring software is patchwork at best, and in the end I found the simplest tools the most useful. Here’s some:

  • I found some Rails processes were getting jammed, and not dieing even when I restarted Apache. I think in the end this was due to the Passenger spawning method, and our use of the Xapian Ruby module. Running Passenger in RailsSpawnMethod conservative mode made things much more robust.
  • Monit, which in a previous life had a job holding up vital structural pillars of buildings with duct tape, makes you feel dirty. Actually it is really useful. Given I couldn’t quickly fix the problem, Monit let me at least reduce the suffering for people trying to use the site meanwhile. Here’s the rule I used, which gives Apache a kick every time server memory use is too high. It was firing every 5 or 10 minutes…
    check system localhost
        if memory > 3500 MB then exec "/usr/sbin/apache2ctl graceful"
  • I found memory_profiler on a blog. It helps you find the kind of memory leak where you unintentionally continue to reference an object you don’t use any more. With a specialist subject of string objects. This led to a fix to do with declaring static arrays in classes vs. modules, which I still don’t really understand. But it wasn’t the cause of the big 1Gb memory munching, there were no large enough leaks of this sort.
  • The record_memory function in WDTK’s application controller came from another blog. It’s handy as it shows you how much of the system memory in the Ruby process each request causes an increase by. With caveats, this was the best way for me to identify the most damaging requests (search results, and certain public body pages). And it also brought focus on the actual problem – the peak memory use during a request. That’s really important, because Ruby’s memory manager never returns memory to the operating system… The Gb leaps in memory use were because of temporary memory used during certain requests, which the Ruby memory manager then never frees later.
  • I made a bunch of functions culminating in allocated_string_size_around_gc. This was really useful in use with the “just add lots of print statements and fiddle” school of debugging. Not everyone’s favourite school, but if your test code can’t catch it, one I often end up using (it gets really involved rarely enough that it doesn’t seem worth setting up an interactive debugger). It led me to various peak memory savings, such as calling “text.gsub!” rather than “text = text.gsub” while removing (email addresses and private information) from FOI request responses, which help quite a bit when dealing with multi-megabyte attachments.
  • Finally, I used the overlooked debugging tool, and the one you should never rely on, being common sense. That is, common sense informed by days of careful use of all the other tools. In order to quickly show text extracts when searching, WDTK stores the extracted attachment text in the database. A few of these attachments are quite large, and led to 50Mb fields, often several of which were being loaded and processed in one page request. That this would cause a high peak of memory use all became just obvious to me some time yesterday. I checked that that was the case, and this morning, I changed it to use the full text for indexing, but to at most keep 1Mb for use in snippets. So sometimes now you won’t get a good search extract for queries, but it is rare, and it will at least still return the right result.

I’ve more work to do, I think there are quite a few other quick wins, all of which are making the site faster too. I’m quite happy that WhatDoTheyKnow also has a bunch more test code as a result of all this.

On the other hand, what a disappointing disaster for open source languages beginning with P/R (as opposed to J). Yes, the help and tools were just about there to work it out, but would seem primitive if you’d used say Java’s Memory Analyzer. Indeed somebody over on StackOverflow suggested running your site in JRuby and using exactly that tool…

How Mapumental works

Written by on in Developers, Mapumental

Here is a diagram of how the backend of Mapumental works. Take it in the spirit that Chris Lightfoot set when he made a similar diagram for the No. 10 petitions site – although many such diagrams are useless, hopefully this one contains useful information.

If you haven’t seen Mapumental yet, first take a look at the video, and sign up for the private beta.

mapumental-early-architecture
(Click on the diagram for a large version)

Below, I’ve explained what the main components are, and some interesting things about them.

Everything can, at least in theory, run on lots of servers. Currently we are only actually using one server for web requests, because of problems with HAProxy. We’re runnning isodaemons on two different servers.

Basic web application – it started out as raw Python, but the more Matthew hacks on it the more Django libraries he pulls in. Soon it’ll be indistinguishable from a Django app. When someone enters a new postcode, it adds it to the work queue in the PostgreSQL database, then refreshes waiting for the job to be finished. Then it displays the flash application (made by Stamen), set up to load the appropriate tile layers.

Tile server and cache – This uses the Python-based TileCache, calling Geospatial Data Abstraction Library (GDAL) to help render the tiles from points. It was originally written by Stamen, and expanded by mySociety. GDAL isn’t perfect, it doesn’t have fancy enough algorithms for my liking. e.g. Using a median rather than a weighted mean.

Isodaemons – These are controlled by a Python script, but the bulk of the code is custom written in C++. Slightly crazily, this can find the quickest route by public transport for each of 300,000 journeys from every station in the UK to a particular station, arriving at a particular time, in 10 to 30 seconds.

I had no idea how to do this, but luckily I live in Cambridge, UK. It’s a city fit to bursting with computer scientists. Many of the jobs are dull, and need little computing, never mind science – like writing interface layers for SQL server. So if you have a real interesting problem it’s easy to get help!

The universal advice was to use Dijkstra’s algorithm, which needed a bit of adaptation to work efficiently over space-time, rather than just space. Normally it is used for planning routes round a map, but public transport isn’t like that, you have to arrive in time for each particular train, so time affects what journeys you can take.

I originally wrote it in Python, which was not only too slow, but used up far far too much RAM. It could never have loaded the whole dataset in. However, the old Python code is still run by the test script, to double check the C++ code against. It is also still used to make the binary timetable files, see below.

Travel times, 1 binary file / postcode – I briefly attempted to insert 300,000 rows into PostgreSQL for each postcode looked up, but it was obvious it wasn’t going to scale. Going back to basics, it now just saves the time taken to travel to each station in a simple binary file – two bytes for each station, 600k in total. The tile server then does random access lookups into that file, as it renders each tile. It only needs to look up the values for the stations it knows are on/near the tile.

There’s various other bits:

Thanks to everyone who helped make Mapumental, we couldn’t have done it without lots of clever people.

I realise the above is a sketchy overview, so please ask questions in the comments, and I’ll do my best to answer them.

RIP Angie Martin 1974-2009

Written by on in Developers, Events, News

It is with overwhelming sadness that I write to tell our community that Angie Martin, mySociety’s fourth core developer, has died. She was taken from us by the cancer that she had been fighting since soon after we hired her less than two years ago.

Possessed of an almost unbelievably upbeat personality, Angie brought not only her formidable Perl skills, but her blazing warmth of character to our team. In remission during our yearly retreat in January this year, she combined laughter with a typically tough line of questioning on ideas she thought insufficiently robust. With typical disgregard for cool, her CV noted that she was “known to enjoy wrangling regular expressions on a Sunday Morning”. She didn’t see any contradiction between being a successful woman and a geek, throwing herself wholeheartedly into the Mac-toting, perlmonger ethos. She even brought her husband Tommy with her, who became a significant volunteer.

Given her habit of plain speaking, it is pointless to pretend that Angie was able to make the contribution to mySociety’s users or codebase that she wanted to. What she achieved in terms of difficult coding during recovery from chemotherapy was incredible, breathtaking – but she wanted to change the world. It now falls to the rest of us, and our supporters, to live up to the expectations she embodied, to continue to push every day, using skills like those that she had to help people with everyday problems. We now have to ask ‘What would Angie do?’, as well as ‘What would Chris do?’. It is a lot to live up to.

She was a mySociety core developer: I hope that meant as much to her as it meant for me to have her as one of my coders.  Remember and Respect.

Updated: Angie changed her surname upon getting married, a couple of months ago. I have just read she wanted to be remembered as Angie Martin, and so I have made that change. Read this tribute on the Lasso list.

Updated 21 7 2009: Tommy has just told me that those wishing to may memorial donations should send them to Hospice at Home.

RSS feeds

Categories

Recent Posts