-
We’ve used machine learning to make practical improvements in the search on CAPE – our local government climate information portal.
The site contains hundreds of documents and climate action plans from different councils, and they’re all searchable.
One aim of this project is to make it easier for everyone to find the climate information they need: so councils, for example, can learn from each other’s work; and people can easily pull together a picture on what is planned across the country.
The problem is that these documents often use different terms to talk about the same basic ideas – meaning that using the search function requires an expert understanding of which different keywords to search for in combination.
Using machine learning, we’ve now made it so the search will automatically include related terms. We’ve also improved the accessibility of individual documents by highlighting which key concepts are discussed in the document.
How machine learning helps
We’re already using machine learning techniques as part of our work clustering similar councils based on emissions profile, but we hadn’t previously looked at how machine learning approaches could be applied to big databases of text like CAPE.
As part of our funding from Quadrature Climate Foundation, we were supported to take part in the Faculty Fellowship – where people transitioning from academic to industrial data science jobs are partnered with organisations looking to explore how machine learning can benefit their work.
Louis Davidson joined us for six weeks as part of this programme. After a bit of exploration of the data, we decided on a project looking at this problem of improving the search, as there was a clear way a machine learning solution could be applied: using a language model to identify key concepts that were present across all the documents. You can watch Louis’ end of project presentation on YouTube.
Moving from similar words to similar concepts
Louis took the documents we had and used a language model (in this case, BERT) to produce ‘embeddings’ for all the phrases they contained.
When language models are trained on large amounts of text, this changes the internal shape of the model so that text with similar meanings ends up being ‘closer’ to each other inside the model. An ‘embedding’ is a series of numbers that represent this location. By looking at the distance between embeddings, we can identify groups of similar terms with similar meanings. While a more basic text similarity approach would say that ‘bat’ and ‘bag’ are very similar, a model that sorts based on meaning would identify that ‘bat’ and ‘owl’ are more similar.
This means that without needing to re-train the model (because you’re not really concerned with what the model was originally trained to do), you can explore the similarities between concepts.
There are approaches to this that store a “vector database” of these embeddings which can be directly searched – but we’ve gone for a simpler approach that doesn’t require a big change to how CAPE was already working.
Using the documents we have, we automatically identified (and manually selected a group of) common concepts that are found across a range of documents – and the original groups of words that relate to those concepts.
When a search is made we now consult this list of similar phrases, and search for these at the same time. This gives us a practical way of improving our existing processes without adding new technical requirements when adding new documents or searching the database.
Because we now have this list of common concepts, we are also pre-searching for these concepts to provide, for each document, links to where that concept is discussed within it. With this change, the contents of individual documents are more visible, with it easier to quickly identify interesting contents depending on what you are interested in.
Potential of machine learning for mySociety
Our other websites, like TheyWorkForYou and WhatDoTheyKnow, similarly have a large amount of text that this kind of semantic search can make more accessible — and we can already see how they might be useful to those relying on data around climate and the environment WhatDoTheyKnow in particular has huge amounts of environmental information fragmented across replies to hundreds of different authorities.
Generative AI and machine learning have huge potential to help us make the information we hold more accessible. At the same time, we need to understand how to incorporate new techniques into our services in a way that is sustainable over time.
Through experiments like this with CAPE, we are learning how to think about machine learning, which problems we have that it applies to, and understand new skills we need to work with it. Thanks to Louis, and his Faculty advisors for his work and their support on this project.
Sign up for climate updates from mySociety
Image: Ravaly on Unsplash.
-
There’s a common theme to a lot of mySociety sites: enter your postcode, see something that relates to you.
From FaxYourMP—the mySociety project so old it predates mySociety itself (paradox!)—through to TheyWorkForYou, FixMyStreet, and WriteToThem, as well as a few of our commercial projects like Mapumental and Better Care, we’ve discovered that asking for a visitor’s location is a super effective way of unlocking clear, relevant information for them to act on.
So perhaps it shouldn’t have come as a surprise that, while doing some regular monitoring of traffic on this website, we noticed a fairly significant number of people attempting to search for things like postcodes, MP names, and the topics of recent debates.
Random sample of search terms, July–December 2017 animal sentience corbyn germany CR0 2RH theresa may facebook EN3 5PB fire ruth davidson HG5 0UH eu withdrawal bill diane abbott By default, the search box on this site delivered results from our blog post archive (it goes all the way back to 2004 don’t you know!)… which is pretty much what you’d expect if you know how we do things here at mySociety. We have this centralised website to talk about ourselves as an organisation; then each of our projects such as TheyWorkForYou or FixMyStreet is its own separate site.
But, looking at these search terms, it was pretty clear that an awful lot of people don’t know that… and, when you think about it, why should they?
The most obvious solution would just have been to direct visitors towards the individual sites, so they could repeat their searches there. Job done.
But we figured, why inconvenience you? If you’ve made it this far, we owe it to you to get you the information you need as quickly as possible.
Handily, we’ve got rather good at detecting valid postcodes when our users enter them, so programmatically noticing when a user was searching for a location wasn’t hard. And equally handily, TheyWorkForYou offers a powerful API that lets developers exchange a user’s postcode for detailed data about the boundaries and representatives at that location.
What do you get when you combine the two? Automatic search suggestions for TheyWorkForYou, FixMyStreet, and WriteToThem, when you enter your postcode on www.mysociety.org.
The search page is also aware of the most frequently searched-for MPs on our site, and will offer a direct link to their TheyWorkForYou profile if you search for their names.
And finally, if you search for something other than a postcode, we give you a single-click way to repeat your search, automatically, on TheyWorkForYou, opening up decades of parliamentary transcripts to you, with a single tap of your finger.
It’s not a big, glamorous feature. But it’s something we know will come in useful for the few hundred people who search our site every week—possibly without their ever noticing this little bit of hand-holding as we steer them across to the site they didn’t even know they wanted. And most importantly, it should introduce a few more people to the wealth of data we hold about the decision-makers in their lives.
Header image, Flickr user Plenuntje, CC BY-SA 2.0
-
Fifty years ago, in 1964, the causal link between smoking and lung cancer was confirmed by the Surgeon General in the US.
That year saw many debates in Parliament on topics that have since become very familiar: the question of whether the tax on cigarettes should be raised; whether cigarettes should be advertised on television, whether smoking should be allowed in public places, and whether warnings should be printed on packets.
Rich and fascinating stuff for any social historian – and it’s all on TheyWorkForYou.com.
Hansard is an archive
Hansard, the official record of Parliament, is a huge historic archive, and whatever your sphere of interest, it is bound to have been debated at some point.
Browsing through past debates is a fascinating way of learning what the nation was feeling: worries, celebrations, causes for sorrow – all are recorded here.
How to use TheyWorkForYou to browse historic debates
TheyWorkForYou contains masses of historic information: House of Commons debates back to 1935, for example, and details of MPs going back to around 1806. You can see exactly what the site covers here.
There are various ways to search or browse the content. Start with the search box on the homepage – it looks like this:
You can do a simple search right from this page, or choose ‘more options’ below the search box to refine your search.
We’ll look at those advanced options later, but let’s see what happens when you input a simple search term like ‘smoking’.
Here (above) are my search results, with my keyword helpfully highlighted.
By default, search results are presented in reverse chronological order, with the most recent results first. If you are particularly interested in historical mentions, you may wish to see the older mentions first.
That’s easy – just click on the word ‘oldest’ after ‘sorted by date’:
You’ll notice a few other options here:
- Sort by relevance orders your results with the most relevant ones first, as discerned by our search engine. This will give you those speeches with the most mentions of your keyword ahead of those where it is only mentioned once or twice.
- Show use by person displays a list of people who have mentioned your keyword, with the most frequent users at the top. This can be fascinating for games such as “who has apologised the most?” or “who has mentioned kittens most often?”
Click through any of the names, and you’ll see all the speeches where that person mentioned your keyword.
Advanced search
That’s a good start – but what if there are too many search results, and you need some way to refine them? You’ll notice from my screenshots above that there are (at the time of writing) over 10,000 mentions of smoking.
That’s where Advanced Search comes in. You can access it from a few places:
- The ‘more options’ link right next to the search box on search results pages (see image below)
- The ‘more options’ link below the search box on the homepage (see image below)
- Or just navigate directly to our dedicated Advanced Search page (see image below)
Whichever way you arrive at it, the Advanced Search page helps you really get to the content you’re interested in.
The pink box on the right gives you some tips for effective searching.
For example, just as with Google, you can search for exact phrases by putting your search term within quotation marks. Otherwise, your results will contain every speech where all your words are mentioned, even if they’re not together. For phrases like “high street”, this could make a real difference.
Even if you are only searching for a single word, you can put it in quotation marks to restrict the use of ‘stemming’ – so, for example, a search for the word house will also return results containing houses, housing and housed, unless you put it in quotation marks.
You can exclude words too: this can be useful for minimising the number of irrelevant results. So, for example, you might want to find information about the town of Barking, but find that many of your results are debates about dogs. Simply enter the search term “barking” -dogs. The minus sign excludes the word from your search.
In the main body of the page, you’ll also see options to restrict your search to within certain dates, or a specific speaker, or a department, section (eg Scottish Parliament or Northern Ireland Assembly) and even political party.
Get stuck in
The best way to see what you can find is to dig in and give it a go. If your search doesn’t work for you the first time, you can always refine it until it does.
Let us know if you find anything interesting!
Image: National Archives (No Known Restrictions)