TheyWorkForYou now finds whenever an old version of Hansard is referenced (which they do by date and column number, e.g. Official Report, 29 February 2008, column 1425) and turns the citation into a link to a search for the speeches in that column on that date. This only really became feasible when we moved server, upgraded Xapian, and added date and column number metadata (among others), allowing much more advanced and focussed searching – the advanced search form gives some ideas. Perhaps in future we’ll be able to add some crowd-sourcing game to match the reference to the exact speech, much like our video matching (nearly 80% of our archive done!). 🙂
Kudos to Google and Yahoo! for spotting this change within a couple of days, as they’re now so busy crawling everything for changes that they’re slowing the whole website down… 😉
Seems Yahoo parses the robots file in an interesting way. Firstly, our file was wrong for not including the Disallow lines for Yahoo as well – explaining why Yahoo was browsing the /user. But not why it was going quickly still.
Matthew’s patch to our robots.txt:
https://secure.mysociety.org/cvstrac/chngview?cn=12284
For what it’s worth, quotations in the Scottish Parliament text on TheyWorkForYou should mostly link to the quoted text by using a different mechanism – the parser tries to match substrings of the quotation in the speeches from the particular day that’s referenced. This isn’t generally useful in the same way, of course, but I thought it worked surprisingly well, and is perhaps worth mentioning in this context.
“Perhaps in future we’ll be able to add some crowd-sourcing game to match the reference to the exact speech”
I don’t think you need to do this (in most cases, anyway) based on the experience of automatic matching of quotations in the SP parser.
The problem is there is no quote – they simply say “I refer the hon. Gentleman to the previous answer I gave [Official Report, 29 February 2008, column 1425]” – so there’s no way to know which speech/answer within that column they mean automatically (it could try and match e.g. speaker name, even subject, but that’s hard graft for not much gain).