Here is a diagram of how the backend of Mapumental works. Take it in the spirit that Chris Lightfoot set when he made a similar diagram for the No. 10 petitions site – although many such diagrams are useless, hopefully this one contains useful information.
(Click on the diagram for a large version)
Below, I’ve explained what the main components are, and some interesting things about them.
Everything can, at least in theory, run on lots of servers. Currently we are only actually using one server for web requests, because of problems with HAProxy. We’re runnning isodaemons on two different servers.
Basic web application – it started out as raw Python, but the more Matthew hacks on it the more Django libraries he pulls in. Soon it’ll be indistinguishable from a Django app. When someone enters a new postcode, it adds it to the work queue in the PostgreSQL database, then refreshes waiting for the job to be finished. Then it displays the flash application (made by Stamen), set up to load the appropriate tile layers.
Tile server and cache – This uses the Python-based TileCache, calling Geospatial Data Abstraction Library (GDAL) to help render the tiles from points. It was originally written by Stamen, and expanded by mySociety. GDAL isn’t perfect, it doesn’t have fancy enough algorithms for my liking. e.g. Using a median rather than a weighted mean.
Isodaemons – These are controlled by a Python script, but the bulk of the code is custom written in C++. Slightly crazily, this can find the quickest route by public transport for each of 300,000 journeys from every station in the UK to a particular station, arriving at a particular time, in 10 to 30 seconds.
I had no idea how to do this, but luckily I live in Cambridge, UK. It’s a city fit to bursting with computer scientists. Many of the jobs are dull, and need little computing, never mind science – like writing interface layers for SQL server. So if you have a real interesting problem it’s easy to get help!
The universal advice was to use Dijkstra’s algorithm, which needed a bit of adaptation to work efficiently over space-time, rather than just space. Normally it is used for planning routes round a map, but public transport isn’t like that, you have to arrive in time for each particular train, so time affects what journeys you can take.
I originally wrote it in Python, which was not only too slow, but used up far far too much RAM. It could never have loaded the whole dataset in. However, the old Python code is still run by the test script, to double check the C++ code against. It is also still used to make the binary timetable files, see below.
Travel times, 1 binary file / postcode – I briefly attempted to insert 300,000 rows into PostgreSQL for each postcode looked up, but it was obvious it wasn’t going to scale. Going back to basics, it now just saves the time taken to travel to each station in a simple binary file – two bytes for each station, 600k in total. The tile server then does random access lookups into that file, as it renders each tile. It only needs to look up the values for the stations it knows are on/near the tile.
There’s various other bits:
- cron jobs for sending out invites
- converting timetable data from ATCO-CIF to the binary format
- loading static layer data into the database
- precaching every tile for static datasets
- Squid and Apache and FastCGI both sit in front of the web applications
- for speed, we cache the mapping background tiles from Cloudmade
- when zoomed out, there is code to cull which stations are used to draw tiles
- of course, a bunch of test code
Thanks to everyone who helped make Mapumental, we couldn’t have done it without lots of clever people.
I realise the above is a sketchy overview, so please ask questions in the comments, and I’ll do my best to answer them.
This week has been quite bitty. I’ve been doing more work on the Freedom of Information site, have been getting into the swing of Ruby on Rails. Once you’ve learnt its conventions, it is quite (but not super) nice.
As far as languages are concerned, Ruby seems identical in all interesting respects to Python. It’s like learning Spanish and Italian. Both are super languages. Ruby has nice conventions like exclamation marks at the end of function names to indicate they alter the object, rather than return the value (e.g. .reverse!). But then Python has a cleaner syntax for function parameters. It is swings and roundabouts.
Rails has lots of ways of doing things which we already have our own ways of doing for other sites. The advantage of relearning them, is that other people know them too. So Louise was able to easily download and run the FOI site, and make some patches to it. Which would have been much harder if it was done like our other sites. Making development easier is vital – for a long time I’ve wanted a web-based cleverly forking web application development wiki. But while I dream about that, Rails packaging everything you need to run the app in a standard way in one directory that quite a few people know how to use, helps.
Other things… I’ve been helping Richard set up GroupsNearYou on our live servers, it should be ready for you to play with soon. It looks super nice, and is easy to use. I’ve had some work to do with recruitment. And catching up on general customer support email for TheyWorkForYou and PledgeBank. I’ve also been updating the systems administration documentation on our internal wiki, so others can work out how to run our servers.
The meeting day voting application (vote often!) that we’ve been mentioning everywhere all week is a new departure for mySociety. In a frantic bid to catch up with the cool kids, it’s our first deployed Ruby on Rails application. This happened because Louise Crow, who kindly volunteered to make it (thanks Louise!), felt like learning Rails. We used to have a policy of using any language, as long as it was open source and began with the letter P (Python/Perl/PHP…). This has now been extended to the letter R!
You can browse the source code in our CVS repository. One interesting thing about Rails applications is that they are structured things, a deployable directory tree. So are mySociety applications.
For example, take a look at PledgeBank’s directory. It’s a mini, well defined filesystem – the ‘web’ directory is the meat of the stuff, but note also ‘web-admin’ for the administrator tools. Include files are tucked away in ‘perllib’ and ‘phplib’, while script files nestle under ‘bin’. We keep configuration files (analogous to the Windows Registry, or /etc on Unix) under ‘conf’. Database schema files live in ‘db’.
And a rails application is much the same. But much much much more detailed. Some of those are extra directories which we also have, but only when we deploy, not in CVS (for example, log files). All in all they are surprisingly similar structures, which shows we’re either both on the right lines, or both on the wrong false trail.
Like making Frankenstein’s monster, poor Louise and I had to graft these two beasts together just to deploy this small application. For example, we have a standard configuration file format which we read from Perl, Python and PHP. The deploy system does useful things with it like check all entries are present, and generate the file for any sandbox from a template. To get round this, there’s an evil script, possibly the first time PHP has been used to make YAML. (And please don’t look at the thing that makes symlinks.)
We could have extended Rails to be able to read its configuration from our file format, but that would be a lot more work. And we could have discovered how to hack its log file system to write to the mySociety log file directory. But everything is so coupled, it doesn’t ever seem worth it. Any Rails apps we deploy will just have to be an even more confusing mass of directories, application trees inside application trees.
(Shh, don’t tell anyone, but this post is really just so the bots find debian.mysociety.org, but I’m going to try and fill it with some other content so you don’t think I’m being too rude)
Debian’s software “packaging” system provides a big database of all the open source software in the world, and makes another smaller database of all the software installed on your computer. We’re using it on our new servers, which the sites are gradually migrating to now. When you’ve got security updates, multiple machines, and complex software dependencies, you need it.
Unfortunately, though it seems like the Debian people have packaged nearly all the software in the world, sometimes they miss things. Normally we’d just install them using the old Unix configure/make/make install. This time we’ve decided to do it properly, and make our own Debian packages. You can find them at debian.mysociety.org.
The advantage of this is that we can find out where any file on the system came from. We can easily upgrade multiple machines, and check that they all have the same software installed. This makes it much less likely that there’ll be bugs when you go to a corner of one of the websites, and get an error because a perl module wasn’t installed.
So far there are a few perl module .deb files in our repository, which the handy dh-make-perl builds easily from a perl module tarball. There’s also Xapian (a search engine library), which we use for quick lookups in Gaze (our gazeteer). That had already been packaged by the Xapian people, but for some reason I had to recompile it. Finally there’s one Python module, PyRTF, which makes Python modules, which I just packaged (probably badly).
Anyway, this post is here to make sure anybody searching for python2.3-pyrtf on Google will find something…
Slightly late; I was “hassle”d yesterday, but was at my school’s 21st anniversary, with Terry Waite, some other past students, all the current students, teachers past and present, and a lot of balloons. So I’ve been working some of today instead, which worked out quite well, as it was beautifully sunny yesterday and pouring with rain today.
I fixed a number of bugs in various places, including one that meant all pledges would expire a day early and various problems with the reporting abuse process. I also renamed NotApathetic’s “best of” page to “busiest”, as they’re not the same thing. 😉 The PledgeBank RSS feed of new pledges will hopefully be available from the live site soon (I’ve added the HTML to make the little orange icon appear in Firefox) – if you can’t wait until then.
I think Tom wants me to work next on user-defined flyers, which will involve adding to the poster generation code all the code we removed when we moved from using text in the POST to fetching it from the database. 🙂 Not sure of the details involved, so await direction. Looks like it might involve learning RTF generation in Python, though; hope that’s possible…