It’s the first full working day for the new facility to annotate Freedom of Information (FOI) requests on WhatDoTheyKnow, and people have been hard at it.
Mr Ormerod points out that private information isn’t necessarily so private if someone has died, so perhaps the exemption the MOD used shouldn’t apply.
Trevor R Nunn has posted three annotations (e.g. this one) to show that his three FOI requests are being treated as one. The annotations facility is great for handling edge cases like this, which don’t happen often enough to be worth explicitly adding to the code, but need some mention.
And finally Edward Betts has processed the list of post boxes retrieved by FOI into a more structured data format, and posted up a link to it. Exactly the kind of collaboration I love to see!
And that’s just this morning!
One of the special pieces of magic in TheyWorkForYou is its email alerts, sending you mail whenever an MP says a word you care about in Parliament. Lots of sites these days have RSS, and lots have search, but surprisingly few offer search based email alerts. My Mum trades shares on the Internet, setting it to automatically buy and sell at threshold values. But she doesn’t have an RSS reader. So, it’s important to have email alerts.
So naturally, when we made WhatDoTheyKnow, search and search based email alerts were pretty high up the list, to help people find new, interesting Freedom of Information requests. To implement this, I started out using acts_as_solr, which is a Ruby on Rails plugin for Solr, which is a REST based layer on top of the search engine Lucene.
I found acts_as_solr all just that bit too complicated. Particularly, when a feature (such as spelling correction) was missing, there were too many layers and too much XML for me to work out how to fix it. And I had lots of nasty code to make indexing offline – something I needed, as I want to safely store emails when they arrive, but then do the risky indexing of PDFs and Word documents later.
The last straw was when I found that acts_as_solr didn’t have collapsing (analogous to GROUP BY in SQL). So I decided to bite the bullet and implement my own acts_as_xapian. Luckily there were already Xapian Ruby bindings, and also the fabulous Xapian email list to help me out, and it only took a day or two to write it and deploy it on the live site.
If you’re using Rails and need full text search, I recommend you have a look at acts_as_xapian. It’s easy to use, and has a diverse set of features. You can watch a video of me talking about WhatDoTheyKnow and acts_as_xapian at the London Ruby User Group, last Monday.
TheyWorkForYou now finds whenever an old version of Hansard is referenced (which they do by date and column number, e.g. Official Report, 29 February 2008, column 1425) and turns the citation into a link to a search for the speeches in that column on that date. This only really became feasible when we moved server, upgraded Xapian, and added date and column number metadata (among others), allowing much more advanced and focussed searching – the advanced search form gives some ideas. Perhaps in future we’ll be able to add some crowd-sourcing game to match the reference to the exact speech, much like our video matching (nearly 80% of our archive done!).
Kudos to Google and Yahoo! for spotting this change within a couple of days, as they’re now so busy crawling everything for changes that they’re slowing the whole website down…
If you enter your postcode on TheyWorkForYou and it’s Scottish or Northern Irish, you’re now presented with your MSPs and MLAs as well as your MP, which makes sense given the site covers their Parliament and Assembly respectively. You also get an extra tab in the navigation linking through to Your MSPs or MLAs. In order to do this, I needed a quick way of determining if a postcode was Northern Irish or Scottish. Northern Ireland was easy, as all postcodes there begin with BT. I assumed Scotland was also easy, which turned out to be true apart from the TD postcode area that straddled the border like a mail-sorting Niagara Falls. After some very dull investigation, I eventually worked out that e.g. most of TD15 is in England, but (amongst others) TD15 1X* is in Scotland, except for TD15 1XX which is apparently back in England. The final result was the postcode_is_scottish() function in postcode.inc, which (hopefully) correctly determines if a given postcode is Scottish or not – perhaps someone else will find it useful.
Debate pages that have at least one timestamped speech (such as the previously mentioned last week’s Prime Minister’s Questions) have a video fixed to the bottom right hand corner (if your browser is recent enough) showing that debate. While playing the video, the currently playing speech is highlighted with a yellow background, and you can start watching from any timestamped speech by clicking the “Watch this” link by any such speech. So how does all that work?
I’m very proud of this feature, I wasn’t sure it would be possible, and it’s very exciting.
Talking of our busy timestampers, I’ve also been busy making improvements (and fixing bugs) to the timestamping interface to make things easier for them. As well as warnings when it looks like two people are timestamping the same debate at the same time, various invisible things have been changed, such as using other people’s timestamps to make the start point for future timestamps on the same day more accurate. I also added a totaliser, using the Google Chart API, for which you simply have to provide image size and percentage complete.
Approaching 45% of our entire archive of video timestamped, with the totaliser approaching the chartreuse
- The Flash player
- Highlighting the current speech
Our video is streamed via progressive HTTP, using lighttpd and mod_flv_streaming. This works by having keyframe metadata at the start of the FLV (Flash video) file (we add ours using yamdi as that doesn’t load the whole file into memory first), which maps times within the video to byte positions within the file. When someone drags the position slider, or presses a skip button, the player actually changes the source of the video to something like file.flv?start=<byte position> which starts a new download from that point in the video. This means you can seek to parts of the video not yet downloaded, which is definitely a required feature.
The video is split up into programme chunks, according to BBC Parliament’s schedule, so each Oral Questions will (approximately) be its own video chunk, and the main debates will be a couple of chunks. By default, the video player will show a screengrab from the start of the video, as that’s all that’s available when it first loads (you have to load the start of the FLV file to fetch the keyframe metadata in order to move anywhere else ). I wanted the player to show a relevant screengrab before you hit Play, so came up with the slightly messy workaround of setting the volume to 0, seeking and playing the video for under a second in order to start it from the new point and show the video, then stopping it and resetting the volume. It works most of the time
Some of our video chunks have jumps in them, due to problems in downloading the original WMV stream. The timestamping interface has a link for people to let us know of such problems, so that we can mark the relevant speeches as missing video and not have them be offered to future timestampers. One valiant volunteer, Tim, let us know about two such videos, but with the added oddity that if you let them play, they would happily carry on past their “end” point, but this made timestamping those speeches quite difficult.
I started investigating, firstly noting that both videos should have been 6 hours long, but were both listed as 1:20:24, which I thought was a bit of an odd coincidence. After reading the FLV file specification, it turned out that 32-bit millisecond timestamps in FLV are split into two – first the low 24 bits, then the high 8 bits. 2^24 = 16,777,216, which in milliseconds is 4 hours, 39 minutes, 37 seconds, which is pretty much exactly what the two videos’ durations were short by! All the timestamps in our FLV files were not setting the high byte, so after 4:39:37, they were wrapping round to 0 (and thus 6 hours became 1:20:24ish).
Our video processing consists of four major steps – the downloading script uses ffmpeg to convert each 75 minute chunk from WMV to MPEG; then nightly processing uses ffmpeg again to convert the right bits of these MPEG files to FLV, mencoder to join the relevant FLV files into one FLV chunk, then yamdi to add the metadata. My first try at a solution was to alter yamdi to increment the high byte itself, which fixed the duration display and let you seek to high times, but when you tried to go to e.g. 5 hours, the video started playing from the right point but the video thought it was playing from 20 minutes in. This would obviously confuse timestamping!
As the FLV files produced by ffmpeg were all under 75 minutes long, they couldn’t have the problem. It turned out we were running an old version of mencoder, and updating that and converting all our long video files fixed the problem. Phew
Join us later today for my third short technical talk on TheyWorkForYou video, where I’ll explain how our Flash application talks to the HTML and vice-versa to enable the “Watch this” and highlighting of speeches.
- The Flash player
- Highlighting the current speech
TheyWorkForYou video timestamping has been launched, over 40% of available speeches have already been timestamped, and (hopefully) all major bugs have been fixed, so I can now take a short breather and write this short series of more technical posts, looking at how the front end bits I wrote work and hang together.
Let’s start with the most obvious feature of video timestamping – the video player itself. mySociety is an open-source shop, so it was great to discover that (nearly all of) Adobe Flex is available under the Mozilla Public Licence. This meant I could simply download the compiler and libraries, write some code and compile it into a working SWF Flash file without any worries (and you can do the same!).
To put a video component in the player is no harder than including an <mx:VideoDisplay> element – set the source of that, and you have yourself a video player, no worrying about stream type, bandwidth detection, or anything else. You can then use a very useful feature called data binding to make lots of things trivial – for example, I simply set the value of a horizontal slider to be the current playing time of the video, and the slider is then automatically in the right place at all times. On the downside, VideoDisplay does appear to have a number of minor bugs (the most obvious one being where seeking can cause the video to become unresponsive and you have to refresh the page; it’s more than possible it’s a bug in my code, of course, but there are a couple of related bugs in Adobe’s bug tracker).
As well as the buttons, sliders and the video itself, the current MXML contains two fades (one to fade in the hover controls, one to fade them out), one time formatter (to format the display of the running time and duration), and three web services (to submit a timestamp result, delete a mistaken timestamp, and fetch an array of all existing timestamps for the current debate). These are all called from various places within the ActionScript when certain events happen (e.g. the Now button or the Oops button is clicked).
Compiling is a simple matter of running mxmlc on the mxml file, and out pops a SWF file. It’s all straightforward, although a bit awkward at first working again with a strongly-typed, compiled language after a long time with less strict ones The documentation is good, but it can be hard to find – googling for [flex3 VideoDisplay] and the like has been quite common over the past few weeks.
Tomorrow I will talk about moving around within the videos and some bugs thrown up there, and then how the front end communicates with the video in order to highlight the currently playing speech – for example, have a look at last week’s Prime Minister’s Questions.
- The Flash player
- Highlighting the current speech
I only met Harry Metcalfe a few weeks ago, when he was volunteering for the Open Rights Groups.
Since then he’s dazzled us with his completely single handed production of TellThemWhatYouThink, a site which draws most central government consultations into one site. The main citizen benefits from this is that you can get email or RSS alerts when a department decides to hold a consultation on an issue that you care about. As per usual it’s also starting with horrid nonstandard data in a zillion formats and turning it into nice structured data for everyone else to play with too.
It’s not a mySociety project, he’s just using our friends and family server, but he’s made it look more like a mySociety site than even we could manage. Kudos Harry!
PS Harry, along with both of the last two new volunteers to do major pieces of coding for mySociety, is in the third year of his PhD. I think I can see a pattern emerging…
Chris Lightoot died a year ago today (or yesterday, by a few minutes).
I’m just sitting here reading the very first emails I ever got from him, back in 2003. Within the first few mails he’s invented and hacked up the idea that is now Richard Pope’s PlanningAlerts.com, coded and developed the idea that persuaded YouGov to donate vast amounts of free polling data to form PoliticalSurvey2005.com (a wider understanding of which would greatly help in the US election if the methodology was only applied there) and in this post he’s foreseen the Google maps mashup craze and offered it on a plate to the Ordnance Survey to pioneer, two years before Google started.
The invention and brilliance comes so thick and fast reading these mails that I now realise that I’d persuaded myself over the year that I’d mis-remembered quite how insanely creative he was, trying to correct for rose-tinted lenses. But he was a proper, bona fide, no-holds-barred cantankerous genius. Most days I think about Chris at least once: I try to make sure we live up to his standards (he wouldn’t have tolerated my use of ‘But’ at the start of the last sentence, for example). Reading these mails tonight drives home the scale of what we all lost, amongst our friends, on the Internet and in society at large. It aches to contemplate.
You may remember that back in 2006 mySociety published some maps showing how long it took to commute places via public transport.
We’ve just made some more which have some lovely new features we reckon you’ll probably like a lot.
If you’d like to see more maps like this in your area, please ask your local transport authority to get in touch with us, or nudge these people
PS As always, Francis Irving remains a genius.