We’ve used machine learning to make practical improvements in the search on CAPE – our local government climate information portal.
The site contains hundreds of documents and climate action plans from different councils, and they’re all searchable.
One aim of this project is to make it easier for everyone to find the climate information they need: so councils, for example, can learn from each other’s work; and people can easily pull together a picture on what is planned across the country.
The problem is that these documents often use different terms to talk about the same basic ideas – meaning that using the search function requires an expert understanding of which different keywords to search for in combination.
Using machine learning, we’ve now made it so the search will automatically include related terms. We’ve also improved the accessibility of individual documents by highlighting which key concepts are discussed in the document.
How machine learning helps
We’re already using machine learning techniques as part of our work clustering similar councils based on emissions profile, but we hadn’t previously looked at how machine learning approaches could be applied to big databases of text like CAPE.
As part of our funding from Quadrature Climate Foundation, we were supported to take part in the Faculty Fellowship – where people transitioning from academic to industrial data science jobs are partnered with organisations looking to explore how machine learning can benefit their work.
Louis Davidson joined us for six weeks as part of this programme. After a bit of exploration of the data, we decided on a project looking at this problem of improving the search, as there was a clear way a machine learning solution could be applied: using a language model to identify key concepts that were present across all the documents. You can watch Louis’ end of project presentation on YouTube.
Moving from similar words to similar concepts
Louis took the documents we had and used a language model (in this case, BERT) to produce ‘embeddings’ for all the phrases they contained.
When language models are trained on large amounts of text, this changes the internal shape of the model so that text with similar meanings ends up being ‘closer’ to each other inside the model. An ‘embedding’ is a series of numbers that represent this location. By looking at the distance between embeddings, we can identify groups of similar terms with similar meanings. While a more basic text similarity approach would say that ‘bat’ and ‘bag’ are very similar, a model that sorts based on meaning would identify that ‘bat’ and ‘owl’ are more similar.
This means that without needing to re-train the model (because you’re not really concerned with what the model was originally trained to do), you can explore the similarities between concepts.
There are approaches to this that store a “vector database” of these embeddings which can be directly searched – but we’ve gone for a simpler approach that doesn’t require a big change to how CAPE was already working.
Using the documents we have, we automatically identified (and manually selected a group of) common concepts that are found across a range of documents – and the original groups of words that relate to those concepts.
When a search is made we now consult this list of similar phrases, and search for these at the same time. This gives us a practical way of improving our existing processes without adding new technical requirements when adding new documents or searching the database.
Because we now have this list of common concepts, we are also pre-searching for these concepts to provide, for each document, links to where that concept is discussed within it. With this change, the contents of individual documents are more visible, with it easier to quickly identify interesting contents depending on what you are interested in.
Potential of machine learning for mySociety
Our other websites, like TheyWorkForYou and WhatDoTheyKnow, similarly have a large amount of text that this kind of semantic search can make more accessible — and we can already see how they might be useful to those relying on data around climate and the environment WhatDoTheyKnow in particular has huge amounts of environmental information fragmented across replies to hundreds of different authorities.
Generative AI and machine learning have huge potential to help us make the information we hold more accessible. At the same time, we need to understand how to incorporate new techniques into our services in a way that is sustainable over time.
Through experiments like this with CAPE, we are learning how to think about machine learning, which problems we have that it applies to, and understand new skills we need to work with it. Thanks to Louis, and his Faculty advisors for his work and their support on this project.
Sign up for climate updates from mySociety
Image: Ravaly on Unsplash.