Here is a diagram of how the backend of Mapumental works. Take it in the spirit that Chris Lightfoot set when he made a similar diagram for the No. 10 petitions site – although many such diagrams are useless, hopefully this one contains useful information.
(Click on the diagram for a large version)
Below, I’ve explained what the main components are, and some interesting things about them.
Everything can, at least in theory, run on lots of servers. Currently we are only actually using one server for web requests, because of problems with HAProxy. We’re runnning isodaemons on two different servers.
Basic web application – it started out as raw Python, but the more Matthew hacks on it the more Django libraries he pulls in. Soon it’ll be indistinguishable from a Django app. When someone enters a new postcode, it adds it to the work queue in the PostgreSQL database, then refreshes waiting for the job to be finished. Then it displays the flash application (made by Stamen), set up to load the appropriate tile layers.
Tile server and cache – This uses the Python-based TileCache, calling Geospatial Data Abstraction Library (GDAL) to help render the tiles from points. It was originally written by Stamen, and expanded by mySociety. GDAL isn’t perfect, it doesn’t have fancy enough algorithms for my liking. e.g. Using a median rather than a weighted mean.
Isodaemons – These are controlled by a Python script, but the bulk of the code is custom written in C++. Slightly crazily, this can find the quickest route by public transport for each of 300,000 journeys from every station in the UK to a particular station, arriving at a particular time, in 10 to 30 seconds.
I had no idea how to do this, but luckily I live in Cambridge, UK. It’s a city fit to bursting with computer scientists. Many of the jobs are dull, and need little computing, never mind science – like writing interface layers for SQL server. So if you have a real interesting problem it’s easy to get help!
The universal advice was to use Dijkstra’s algorithm, which needed a bit of adaptation to work efficiently over space-time, rather than just space. Normally it is used for planning routes round a map, but public transport isn’t like that, you have to arrive in time for each particular train, so time affects what journeys you can take.
I originally wrote it in Python, which was not only too slow, but used up far far too much RAM. It could never have loaded the whole dataset in. However, the old Python code is still run by the test script, to double check the C++ code against. It is also still used to make the binary timetable files, see below.
Travel times, 1 binary file / postcode – I briefly attempted to insert 300,000 rows into PostgreSQL for each postcode looked up, but it was obvious it wasn’t going to scale. Going back to basics, it now just saves the time taken to travel to each station in a simple binary file – two bytes for each station, 600k in total. The tile server then does random access lookups into that file, as it renders each tile. It only needs to look up the values for the stations it knows are on/near the tile.
There’s various other bits:
- cron jobs for sending out invites
- converting timetable data from ATCO-CIF to the binary format
- loading static layer data into the database
- precaching every tile for static datasets
- Squid and Apache and FastCGI both sit in front of the web applications
- for speed, we cache the mapping background tiles from Cloudmade
- when zoomed out, there is code to cull which stations are used to draw tiles
- of course, a bunch of test code
Thanks to everyone who helped make Mapumental, we couldn’t have done it without lots of clever people.
I realise the above is a sketchy overview, so please ask questions in the comments, and I’ll do my best to answer them.
Debate pages that have at least one timestamped speech (such as the previously mentioned last week’s Prime Minister’s Questions) have a video fixed to the bottom right hand corner (if your browser is recent enough) showing that debate. While playing the video, the currently playing speech is highlighted with a yellow background, and you can start watching from any timestamped speech by clicking the “Watch this” link by any such speech. So how does all that work?
I’m very proud of this feature, I wasn’t sure it would be possible, and it’s very exciting. 🙂
Talking of our busy timestampers, I’ve also been busy making improvements (and fixing bugs) to the timestamping interface to make things easier for them. As well as warnings when it looks like two people are timestamping the same debate at the same time, various invisible things have been changed, such as using other people’s timestamps to make the start point for future timestamps on the same day more accurate. I also added a totaliser, using the Google Chart API, for which you simply have to provide image size and percentage complete.
Approaching 45% of our entire archive of video timestamped, with the totaliser approaching the chartreuse 🙂
Our video is streamed via progressive HTTP, using lighttpd and mod_flv_streaming. This works by having keyframe metadata at the start of the FLV (Flash video) file (we add ours using yamdi as that doesn’t load the whole file into memory first), which maps times within the video to byte positions within the file. When someone drags the position slider, or presses a skip button, the player actually changes the source of the video to something like file.flv?start=<byte position> which starts a new download from that point in the video. This means you can seek to parts of the video not yet downloaded, which is definitely a required feature.
The video is split up into programme chunks, according to BBC Parliament’s schedule, so each Oral Questions will (approximately) be its own video chunk, and the main debates will be a couple of chunks. By default, the video player will show a screengrab from the start of the video, as that’s all that’s available when it first loads (you have to load the start of the FLV file to fetch the keyframe metadata in order to move anywhere else 🙂 ). I wanted the player to show a relevant screengrab before you hit Play, so came up with the slightly messy workaround of setting the volume to 0, seeking and playing the video for under a second in order to start it from the new point and show the video, then stopping it and resetting the volume. It works most of the time 🙂
Some of our video chunks have jumps in them, due to problems in downloading the original WMV stream. The timestamping interface has a link for people to let us know of such problems, so that we can mark the relevant speeches as missing video and not have them be offered to future timestampers. One valiant volunteer, Tim, let us know about two such videos, but with the added oddity that if you let them play, they would happily carry on past their “end” point, but this made timestamping those speeches quite difficult.
I started investigating, firstly noting that both videos should have been 6 hours long, but were both listed as 1:20:24, which I thought was a bit of an odd coincidence. After reading the FLV file specification, it turned out that 32-bit millisecond timestamps in FLV are split into two – first the low 24 bits, then the high 8 bits. 2^24 = 16,777,216, which in milliseconds is 4 hours, 39 minutes, 37 seconds, which is pretty much exactly what the two videos’ durations were short by! All the timestamps in our FLV files were not setting the high byte, so after 4:39:37, they were wrapping round to 0 (and thus 6 hours became 1:20:24ish).
Our video processing consists of four major steps – the downloading script uses ffmpeg to convert each 75 minute chunk from WMV to MPEG; then nightly processing uses ffmpeg again to convert the right bits of these MPEG files to FLV, mencoder to join the relevant FLV files into one FLV chunk, then yamdi to add the metadata. My first try at a solution was to alter yamdi to increment the high byte itself, which fixed the duration display and let you seek to high times, but when you tried to go to e.g. 5 hours, the video started playing from the right point but the video thought it was playing from 20 minutes in. This would obviously confuse timestamping!
As the FLV files produced by ffmpeg were all under 75 minutes long, they couldn’t have the problem. It turned out we were running an old version of mencoder, and updating that and converting all our long video files fixed the problem. Phew 🙂
Join us later today for my third short technical talk on TheyWorkForYou video, where I’ll explain how our Flash application talks to the HTML and vice-versa to enable the “Watch this” and highlighting of speeches.
TheyWorkForYou video timestamping has been launched, over 40% of available speeches have already been timestamped, and (hopefully) all major bugs have been fixed, so I can now take a short breather and write this short series of more technical posts, looking at how the front end bits I wrote work and hang together.
Let’s start with the most obvious feature of video timestamping – the video player itself. 🙂 mySociety is an open-source shop, so it was great to discover that (nearly all of) Adobe Flex is available under the Mozilla Public Licence. This meant I could simply download the compiler and libraries, write some code and compile it into a working SWF Flash file without any worries (and you can do the same!).
To put a video component in the player is no harder than including an <mx:VideoDisplay> element – set the source of that, and you have yourself a video player, no worrying about stream type, bandwidth detection, or anything else. 🙂 You can then use a very useful feature called data binding to make lots of things trivial – for example, I simply set the value of a horizontal slider to be the current playing time of the video, and the slider is then automatically in the right place at all times. On the downside, VideoDisplay does appear to have a number of minor bugs (the most obvious one being where seeking can cause the video to become unresponsive and you have to refresh the page; it’s more than possible it’s a bug in my code, of course, but there are a couple of related bugs in Adobe’s bug tracker).
As well as the buttons, sliders and the video itself, the current MXML contains two fades (one to fade in the hover controls, one to fade them out), one time formatter (to format the display of the running time and duration), and three web services (to submit a timestamp result, delete a mistaken timestamp, and fetch an array of all existing timestamps for the current debate). These are all called from various places within the ActionScript when certain events happen (e.g. the Now button or the Oops button is clicked).
Compiling is a simple matter of running mxmlc on the mxml file, and out pops a SWF file. It’s all straightforward, although a bit awkward at first working again with a strongly-typed, compiled language after a long time with less strict ones 🙂 The documentation is good, but it can be hard to find – googling for [flex3 VideoDisplay] and the like has been quite common over the past few weeks.
Tomorrow I will talk about moving around within the videos and some bugs thrown up there, and then how the front end communicates with the video in order to highlight the currently playing speech – for example, have a look at last week’s Prime Minister’s Questions.
Over the weekend, and this morning, I’ve been updating the Public Whip and TheyWorkForYou database of MPs. This was much easier this year thanks, amazingly to Macromedia Flash. The BBC have a fantastic animated constituency map which is made in flash. When you click on a constituency it gives you the results. Now, one little know thing about flash is that it uses XML to communicate with the server. This means that any data it downloads must be in XML.
Further investigating reveals that you can download results data from URLs like http://news.bbc.co.uk/1/shared/vote2005/flash_map/resultdata/200.xml. So I wrote a script to download these and convert them to XML. May I present a list of all the new MPs. Now, just need to wait for them to start talking and voting so we can build up a record of them. Meanwhile, on to some mySociety work; fixing up WriteToThem…
Praise be to Macromedia!