-
Most discussion and usage of LLMs is focused on high profile closed models such as OpenAI’s ChatGPT family, and Google’s Gemini – which are widely available and integrated into a range of existing products and services.
Because these are closed models, access and hosting of the models is controlled by the companies that create them. This presents a dilemma for civic tech organisations who believe in open source – where important parts of their processes can disappear into black boxes beyond your control. These may work well/be affordable today, but creates new risks. Specific models might become unavailable, there might be changes in pricing, and this represents lock-in to specific providers.
Open LLM models provide an alternative approach. In a familiar issue from open source licensing, there are different ways in which a model can be ‘open’. Open weights models have the final structure of the model released and can be run on your own hardware (Meta’s Llama model is an example of this). Fully open models have the underlying (open licenced) training data released, as well as the recipes and evaluation systems used in their training. AI2’s OLMo family of models and the recent Swiss AI institute’s Apertus model are examples of these. Somewhere in between these are approaches like IBM’s Granite models, where the model is released as open weights and the data was licensed to be able to train on (addressing copyright issues) but is not publicly accessible.
What are weights? Basically a model can be understood as a big network of connections – where the ‘weights’ are how strong (and influential) a connection is. What’s happening in the training process is a refinement of these weights as a result of being exposed to the training data. The weights at the end of the process are the trained model, and can be shared and used by others. But if you also have the training data and process, you can recreate the model step-by-step, with a clear audit trail of what’s in it.
Any kind of open weight model is practically appealing because they unlock new ways to work with private data without sharing with third parties, and create more flexibility around infrastructure. For instance, we currently use a fine-tuned version of Llama to help flag immigration correspondence in WhatDoTheyKnow.
Fully open models are ethically appealing because they avoid the issues of models that have been trained on copyrighted data. Their existence is a challenge to an AI policy debate where countries must trade-off the rights of creators against the benefits of AI as sold by a handful of companies. They fit well with our open source ethos – and understanding more about how to use them practically helps give us options to improve our own services, and contribute to wider arguments about responsible use of AI.
This blog post is a write-up of several practical experiments in using the 7b parameters variation of OLMo-2 both locally on a laptop GPU and remotely using HuggingFace’s inference endpoints.
Using OLMo-2 locally
Our purpose in running something locally is to be able to process sensitive information that should not leave our infrastructure. In this case, using OLMo-2 to create human-readable representations of clusters from WriteToThem survey responses. While users are asked not to include personal information in this survey, enough do that we need to treat the basic dataset as having personal information that should not be shared.
We used llama-cpp (and the associated python bindings) to run the local model. An alternative local approach is to use ollama to run a local server. The reason for using llama-cpp in this case is that ollama doesn’t always seem to pick up that less well known models can use ‘tools’ correctly (which is required for structured data output). Another benefit is having it run in process rather than as a separate server is the script can turn on and off the resource intensive bit (although there’s a corresponding start up time) rather than needing a separate server process to run.
Setting up the libraries
Installing llama-cpp in a way that can use the GPU is not straightforward. This set of instructions for Windows 11/Nvidia GPU mostly worked for me. I additionally needed to add an extra DLL directory before importing from llama_cpp because there’s a DLL folder that the library wasn’t yet referencing.
Big picture, WheelNext is a project to try and make installing correct versions of the library easier across different OS/GPU combinations. In the meantime, setting up a local machine is a bit fiddly.
Downloading model information
Llama-cpp uses GGFU files – which have all the weights in a single file. There are libraries to convert from the transformers format – but this is often made available by model publishers on HuggingFace.
Downloading the model can be done using the huggingface_hub command line too (here using uv).
uvx –from huggingface-hub hf download allenai/OLMo-2-1124-7B-Instruct-GGFU olmo-2-1124-7B-instruct-Q4_0.gguf –local-dir models
This is pulling down a quantised version – which has the same number of parameters – but the values of the weights have been significantly rounded down. This tends to have much less decrease in quality than the corresponding decrease in file/memory size (why? Broadly high fidelity here is useful for adjusting in training which will happen in small shifts, but when you have something working the general structure is good enough) – and this fits it just inside the ability of my laptop’s GPU.
This download can also just be done in code:
from llama_cpp import Llama
from functools import lru_cache
@lru_cache
def get_llm():
return Llama.from_pretrained(
repo_id=“allenai/OLMo-2-1124-7B-Instruct-GGUF”,
filename=“olmo-2-1124-7B-instruct-Q4_0.gguf”,
)
Structured data output
To get structured data out of the model, Pydantic AI can be used with Outline to query the llama cpp model.
This:
- makes it easier to define Pydantic data structures that should be returned.
- makes it easier to swap between local/remote models by swapping the model passed to the agent, but otherwise using a common API.
Hosted OLMo-2 model
An advantage of any open weights model is being able to run it on a range of infrastructure (and being able to change the infrastructure later).
In this case, I had a use case where we wanted to do transformations on already public data (the appropriateness of linking to a specific Wikipedia page from a specific sentence in a parliamentary debate) – and so there was no privacy/security issue for the purposes of the experiment. We are doing further exploration about how we can make this kind of use compliant with our wider legal and privacy commitments.
Because OLMo-2 is not a commonly used model, there isn’t an inference service that offers it directly as an option (which would be most efficient – as you’re being charged for tokens while the underlying infrastructure is shared between many users). Instead, you need to create a private server that can manage the model.
Creating an endpoint
Hugging Face Inference Endpoints is the approach I used here – that lets you provision an endpoint connected to a specific model. I’m using the same model as I used locally.
Depending on the properties of the model – the minimum GPU required will be suggested. This model was coming up about $0.8 an hour. Running the 13b parameter version of the model was about $2 an hour. There are options to run on AWS, Azure and Google Cloud in different regions (although processing data in the EU/UK is a requirement – this limits some of the GPU options).
The scale-to-zero time is adjustable down to about 15 minutes. It takes a few minutes to load up from this. In principle, if the access token is scoped correctly – the huggingface_hub library can handle pausing and unpausing the endpoint (or even programmatically creating one), if some more control here is wanted.
Structured data output
This endpoint works well using some of the example HuggingFace connections for PydanticAI. Something I had to adjust was adding an adapter to reduce complex json schemas (e.g. anything with multiple model types, enums, etc) from using ‘$defs’ to just being a normal structure because the Hugging Face text-generation-inference interface can’t handle them.
I have an example of creating a model that Pydantic AI will accept here – the missing config bits are a token associated with the account and the url of the endpoint created.
So in principle this means we can have an endpoint that gives us access to a GPU based model for an hour a day at a reasonable price – while we could at a later point swap out to use a local model without adjusting the general logic of the application. This is well suited to our current anticipated uses in batched backend processes, but would be less efficient if it needed to be responsive around the clock.
Reflecting on the results
Compared to previous projects using the OpenAI API, a key thing to note is it is slower and more fiddly on the infrastructure at hand. I was only using the 7b parameter model, while the 32b parameter model is the one that evaluates closer to GPT-4o mini. As such, prompts needed to be a bit more detailed on what was required. Similarly, a combination of the hardware and not being able to run queries in parallel over a wider infrastructure mean the process takes longer.
But this is also like comparing cake to a well balanced meal – the benefits of an open model are not just philosophical but practical. With a bit more work on the prompt you can get useful results on a laptop with no dependency on third-party services. That brings into scope a range of use cases that OpenAI is not suitable for.
Even where, such as in the Wikipedia example, there are no privacy issues in using OpenAI, making it easy to swap in an open model makes it much easier to evaluate the effect of using an open model. It will now be relatively straightforward to quickly substitute OLMo-2 into PydanticAI flows using other models and get a baseline feeling for effectiveness. Even where you might choose to use a closed model in a specific instance, it is very useful to work in such a way that you are not locked in to that model and could switch away in future.
Similarly, having a working process for a non-mainstream model like OLMo-2 makes it easier to explore other models like Apertus. As this has been trained on a wider range of non-English languages it could provide a more dependable component in LLM integration with the core Alavateli software – which powers Freedom of Information platforms across a range of languages.
Understanding open models as a practical approach helps contribute more widely to policy conversations around AI – and where trade-offs and impacts are inherent to the nature of the technology, or are a consequence of how they are currently controlled and produced.
Open models are always likely to lag slightly behind the frontier models, but they are already incredibly useful technologies compared to what was possible a few years ago. We want to understand more about how we can practically make use of these models – and help make sure the future of LLMs are shaped by ethical considerations about their training and use – rather than accepting them on the terms of the dominant tech giants.
Header image: Photo by Zhang Zi Han on Unsplash
-
Recently we wrote about why we’re now listing APPGs in TheyWorkForYou. This blog post goes into more detail about the technical process we use to gather who is a member of an APPG.
We have two methods of getting the memberships of APPGs. The first is finding if it’s already published on their website. The second is using Parliament’s rules to ask the APPG contact for the list. So we need to a) find all the APPG websites, and b) see if they publish members lists c) if not, ask for the list and d) get those lists into a consistent format.
Data that is fragmented and not in the format we want is a fairly common civic tech problem. The solution is to write a ‘scraper’ that reads the content of a website and has a process for converting it to a more structured format.
This works well when dealing with only a few sources (e.g. the memberships of the UK’s parliaments only needs a few different scrapers), or where a common format is being used (e.g. many local government websites use similar providers). In the case of APPGs, there is no common template being used. We just have a set of a few hundred websites that may (or may not) contain a list of names.
Rather than a traditional scraper, we have built an agentic AI/LLM approach that is more flexibly able to extract memberships from websites. The end result is a tool with a careful sequencing of manual and automated steps, injecting human review in structured ways. Rather than an “AI makes mistakes” disclaimer, we built a structured process to check elements efficiently one group at a time, that can lock off errors before proceeding to the next stage. This was also an experiment in using LLMs to write scraper tools, as well as some of the tools needed for the manual review steps.
Practically, this was an effective way of getting the information we needed that turned a very hard problem into one that we can dependably run regularly. It also suggests more generally useful ways of approaching fragmented data problems (more on this at the end of the post).
Building agentic approaches
An ‘agent’ is often poorly defined, but broadly it’s a language model interface is given tools (specific functions), a task, and an output data structure, and it loops between these until it gives a result.
To build agentic functions, we used the PydanticAI framework, which acts as a connector between the prompt, input data, the data structure of the output data, functions the agent has access to, and any bespoke validation of the results. The end result is a function that accepts structured input, and returns structured output, relatively painlessly.
Although this example is using OpenAI’s GPT models, in future experiments we use the PydanticAI approach to connect to open source models (the framework is designed to be model-agnostic). In principle this means that this project could in future switch the underlying provider used.
Process
Step 1: Writing a scraper
The first thing we needed to do was to get the official data from Parliament’s APPG register into a more structured form.
You can see an example of this page for the Africa APPG. This is a good task for a traditional scraper, but would also have been a fiddly problem. Using ChatGPT, we gave it an extract of the HTML, and asked for a Pydantic data structure and script to convert the data. This worked pretty well, with some tweaking to the format over time. When errors emerged in different APPGs – passing the error and an understanding of what should have happened back to the Copilot agent (using a Claude model) led to working fixes. In using the coding agent the key decision was deciding which bit of the project to be opinionated about – and this has mostly meant being very explicit about data structures (and validation to ensure they’re correct), and more relaxed about the pipes that connect things up.
Step 2: Adding categories to APPGs
From the official data, we only know if an APPG is a county or subject area group. We want to make it a bit more explorable by breaking this down into categories.
In the spirit of experimenting with LLMs, we copied all subject areas APPGs names and purpose statements into one of OpenAI’s reasoning models and asked for 10-20 sub-categories. It came back with 20 and they looked reasonable.
We then created a small functionless agent interface, giving it the title and purpose of a specific APPG, and returning a list of potential categories (preferring one, but allowing all that seem relevant).
Spot-checking these, they seem reasonable and for the purpose of breaking down the big list a bit – this is a good step up. This means, we can quickly see the APPGs that are likely to be relevant to environmental matters.
Step 3: Finding missing websites
Some APPGs list their external website – some do not. Here we use AI tools as part of the workflow, to find those missing sites (which may not exist).
We created an agent function with access to a web search tool (tavity), a function to check if the URL is valid, and a prompt to help identify the correct site. This creates a loop to search and identify a good candidate for the website.
At this point, there is a manual check that prompts the user to review each site one-by-one before confirming it as a valid site. 45/74 sites identified in the first wave were valid. Invalid websites were news articles, APPGs in other parliaments, or sites for previous iterations of that APPG.
This is not comprehensive and we and our volunteers found some more manually after the fact – but it is an interesting trial in finding data starting only with a search engine.
Step 4: Find published members
The final step is to get a list of members (if published) off these websites. We need a really flexible approach for this. Names might be in a structured list, but they can also be in one paragraph. They might be on a members page, the home page, or spread over three pages. There is no consistency to fall back on.
Here, we created an agent with a function that can fetch a web page and convert it to markdown. Using this recursively, the prompt instructs the agent to find the most relevant page (in some cases pages) that could contain membership information, and return a data structure of the members (MPs, Lords, Other). This returned over 5,000 names in the data format provided.
The big risk at this point is that having been asked for a list of MPs, it makes some up. The validation we use for this is to check if each name in the list is present within the HTML content of the page it was extracted from. If there’s an error, it runs again and will give up rather than use an incorrect list. There is some possibility for misinterpretation – but this prevents outright fabrication. Errors flagged here tended to be when the LLM has fixed formatting meaning the text no longer matches exactly against the page.
The key problem here is one that a human would have too – some APPG lists are out of date. Here I added an extra flag detecting a list containing people who had left Parliament that then needed a manual review. In other cases, this was sometimes picking up lists that were not membership lists. We made some adjustments to the prompt after picking up attendees at the AGM – which is not wrong, but incomplete.
Step 5. Manual data
As our main blog post talks about, we then needed to contact APPGs directly for lists that were not published. This presented a new problem: what we got back was a combination of spreadsheets and emails with different levels of detail – some including party details in other columns, some not.
Our solution was to have a Google Doc that just has each list formatted under a heading with the APPG title – we could just copy and paste information into this.
This file is then downloaded as markdown and converted into a list of names. There are a few tweaks to clean up leading numbers, and identify the name component of the line. Again, this step was substantially written via prompt – giving the LLM examples of the problem data, and that would create regular expressions to clean the data into the basic list of names we needed.
Step 6: Tidy members information
What we want to do next is get from a list of names to a list of TheyWorkForYou unique IDs.
We have a library that helps reconcile names to IDs, but a challenge here is that there are a huge range of spelling mistakes (sometimes to an extent where you could not actually work out the correct MP).
What we needed was a quick tool to compare the input name against our list of known names and suggest near matches. Here we again turned to the coding agent, posing the problem, providing some snippets to interact with our existing library, and letting it craft a command line interface.
This fairly quickly gave a good interface for reviewing spelling problems (which was later refined to auto-match below a certain threshold). This helper tool is not especially complicated, but as something with a clear input and output, isolated from the rest of the flow, was a good candidate for testing using Copilot to create the function. In choosing what to spend time on, this would not otherwise have been a priority – but brought a useful feature into scope.
Result
The end result of this process is fairly effective – with a series of steps we can repeat every six weeks when a new APPG register is released to check for new webpages for new APPGs, or to recheck previously scanned pages.
The efficient sequencing of steps means that manual review happens on similar tasks in sequence, rather than checking each APPG through all steps.
In general, I’m pretty happy with the results of this, it made a project that would otherwise only have been possible with a big (and fairly boring for participants) crowdsourcing effort possible.
One of the problems we have to deal with a lot is fragmented public data, when relevant data is scattered all over the place and is a lot of work to bring back together. Here we found AI tools that were both useful in discovery of a component of the data, and in reconciling to a common standard.
The “AI scrapes then verifies content is present” approach worked well here but would struggle with more complex problems. For instance, if we really needed to be sure we were extracting a correct party label alongside a name, knowing that ‘Labour’ was present on the page wouldn’t be as helpful.
Building on this, the AI-written scraper code worked pretty well. If properly sandboxed (pydantic-ai has support for running python in a sandbox using pyodide), transformation code could be written to convert data between different sets of headers without running the data itself through an LLM to convert it. This potentially helps with some of the fragmented data problems of reconciling compatible but different schemas. LLM-involved approaches have a real potential to create new datasets through easier discovery and joining of data.
This is a way we can use new technology to make a dataset possible, but also it would be much easier if Parliament gathered and published this in the first place. The equivalent Cross Party Groups in the Scottish Parliament just make a downloadable file of all memberships in their open data portal. We need to think about how new technological approaches are not just propping up bad transparency – but part of encouraging better transparency all the way upstream.
Header image: Photo by Susan Holt Simpson on Unsplash
-
The government is making a significant investment into AI in public services, and systems are changing apace.
AI is increasingly being deployed in every department of government, both national and local, and often through systems procured from external contractors.
In a recent article for Public Technology, mySociety’s Chief Executive Louise Crow flags that we urgently need to update our transparency and accountability mechanisms to keep pace with the automation of state decision-making.
This rapid adoption needs scrutiny: not only because significant amounts of money are being spent; but also because we’re looking at a new generation of digital systems in which the rules of operation are, by their very nature, opaque.
To see Louise’s thoughts on what needs to change, and why, as this new technological era unfolds, read the full piece here.
If you find it of interest, you may also wish to watch this recent event at the Institute for Government, The Freedom of Information Act at 25, where Louise was one of six speakers reflecting on the future of transparency in the UK.
—
Image: Alex Socra
-
mySociety was founded on one seismic technological change: the arrival of the internet, bringing radical new possibilities to the ways in which we engage with democracy.
Now we’re seeing a second upheaval, just as potentially explosive: the wide adoption of generative AI and machine learning tools — particular kinds of artificial intelligence — not least by the UK government, who have made a commitment to see AI “mainlined into the veins of the nation”.
From the visible and novel, like ‘AI bot’ MPs; to the hidden and less-interrogated, like the algorithms that drive decision-making around benefits; to the new capabilities around working with large text datasets that we ourselves are experimenting with at mySociety: artificial intelligence is changing the way democracy works.
We’ve been thinking about AI for some time, as have our colleagues around the world — TICTeC 2025 had a strong strand of pro-democracy organisations showcasing how they are using new technologies to hold authorities to account and support public engagement; alongside developers showing the tools that aim to make the government more responsive.
AI is coming to democracy, whether we like it or not. In many places, it’s already here.
But there are implementations in which it can be highly beneficial to us all; and ways in which it can present a clear and present danger to democracy.
It benefits everyone if there is a high level of understanding of both the challenges and the opportunities of AI in government. Democratic decision makers need to understand digital tech in order to legislate effectively around it, to develop and procure it effectively.
This is not just so that they can deliver services more efficiently, but also to ensure that they retain the legitimacy of democratic government by using tech and AI in a way that ensures transparency and accountability, preserves public trust and allows the public to understand and participate in the decisions that affect their lives.
Reflections for our time
Over the next few months, we’ll be sharing our own thoughts and experience — alongside invited guest writers who are thinking about how AI interacts with democratic processes and institutions, and how to make that better — in a series of short pieces.
These will examine the different ways that AI is affecting the things we care about here at mySociety:
- Transparent, informed, responsive democratic institutions
- Politicians and public servants who work for the public interest
- Democratic equality for citizens: equal access to information, representation and voice
- A flourishing civil society ecosystem
- The effective and principled use of digital technologies
- Action from politicians to match the evidence of the climate crisis and the level of public concern
- Better communication between politicians and the public, creating space for climate action.
Stay informed
If you’d like to get updates in your inbox, make sure you’ve checked ‘artificial intelligence’ as an interest on our newsletter sign-up form (if you already receive our newsletters, don’t worry – so long as you use the same email address, this will just update your preferences. Just make sure you’ve ticked everything you’re interested in).
By also completing the ‘how do you identify yourself’ section, you’ll help us send you the most relevant material: that means guidance if you work in government or build tech; data and our analysis if you’re a researcher; tools for holding authorities to account if you are an individual or work in civil society, and so on.
—
Image: Adi Goldstein
-
If you were one of the 100+ people who joined us for today’s webinar, you’ll already know it was hugely informative and timely.
We packed three fascinating speakers into the course of one hour-long session on using FOI to understand AI-based decision making by public authorities. Each brought so many insights that, even if you were there, you may wish to watch it all over again.
Fortunately, you can! We’ve uploaded the video to YouTube, and you can also access Morgan’s slides on Google Slides, here and Jake’s as a PDF, here (Jake actually wasn’t able to display his slides, so this gives you the chance to view them alongside his presentation, should you wish).
Morgan Currie of the University of Edinburgh kicked things off with a look at her research ‘Algorithmic Accountability in the UK’, and especially how opaque the Department of Work and Pensions (DWP)’s use of automation for fraud detection has been, over the years.
Morgan explains the techniques used to gain more scrutiny of these decision-making and risk assessment processes, with much of the research based on analysing FOI requests made by others on WhatDoTheyKnow, which of course are public for everyone to see.
Secondly, in a pre-recorded session, Gabriel Geiger from Lighthouse Reports gave an overview of their Suspicion Machines Investigation which delves into the use of AI across different European welfare systems. Shockingly, but sadly not surprisingly, the investigation found code that was predicting which recipients of benefits are most likely to be committing fraud, with an inbuilt bias against minoritised people, women and parents — multiplied for anyone who falls into more than one of those categories.
Gabriel also outlined a useful three-tiered approach to this type of investigation, which others will be able to learn from when instigating similar research projects.
Our third speaker was Jake Hurfurt of Big Brother Watch, who spoke of the decreasing transparency of our public bodies when it comes to AI-based systems, and the root causes of it: a lack of technical expertise among smaller authorities and the contracting of technology from private suppliers. Jake was in equal parts eloquent and fear-inducing about what this means for individuals who want to understand the decisions that have been made about them, and hold authorities accountable — but he also has concrete suggestions as to how the law must be reformed to reflect the times we live in.
The session rounded off with a brief opportunity to ask questions, which you can also watch in the video.
Presented in collaboration with our fellow transparency organisations AccessInfo Europe and Frag Den Staat, this session was an output of the ATI Community of Practice.
—
Image: Michael Dziedzic
-
I’ve written before about how we’re thinking about “low resource” use of Large Language Models (LLMs) — and where some of the benefits of LLMs can be captured without entering the “dependent on external API” vs “need new infrastructure to run internally” trade-offs.
One of the use cases we have for LLMs is categorisation: across parliamentary data in TheyWorkForYou, and FOI data in WhatDoTheyKnow we have a lot of unstructured text that it would be useful to assign structured labels to, for either public facing or internal processes.
This blog post is a write up of an experiment (working title: RuleBox) that uses LLMs to create classification rules, which can then be run on traditional computing infrastructure without dependence on external APIs. This allows large-scale text categorisation to run quickly and cheaply on traditional hardware without ongoing API dependencies.
Categorising Early Day Motions
We have a big dataset of parliamentary Early Day Motions (EDMs), which are formally ‘draft motions’ for parliamentary discussion but effectively work as an internal petition tool where MPs can signal their interest or support in different areas.
For our tools like the Local Intelligence Hub (LIH) we highlight a few EDMs as relevant to indicating if an MP has a special interest in an area of climate/environmental work. We want to keep these up to date better, and to have a pipeline that’s flexible for future versions of the LIH that might focus on different sectors. We want to be able to tag existing and new EDMs depending if they relate to climate/environmental matters, or other domains of interest.
A very simple approach would just be to plug into the OpenAI API and store some categories each day, but this is giving us a dependency and ongoing cost. What we’ve experimented with instead is an approach where we use the OpenAI API to bootstrap a process. We’ve used the commercial LLM to add categories to a limited set of data, and then seen how we can use that to create rules to categorise the rest.
Machine learning and text classification
Regular expressions and text processing rules
The “traditional” way of classifying lots of text automatically is to use text matching or regular expressions.
Regular expressions are a special format for defining when a set of text matches a pattern (which might be “contains one of these words” or “find the thing that is structured like an email address”).
The advantage of this approach is that you can see the rules you’ve added and at this point the underlying technical implementations are really fast. The disadvantage is that you might need to add a lot of edge cases manually, and regular expression syntax is not always clear to understand.
Machine learning
The use of “normal” machine learning provides a new tool. Here, models that have already been trained on a big dataset of the language are then fine-tuned to map input texts to provided categories.
The theory of what is happening here is that in order to accurately “predict the next word”, language models need to have developed internal structures that map to different flows and structures in the text. As such, if you cut off the final “predicting the next word” step, and replace it with a “what category” step, those internal structures can be usefully repurposed to this task.
As such, machine learning based text classifiers can be more flexible. They are picking up patterns like “this flavour of word is in proximity to this flavour of word” that would be difficult to manually code for. The downside is that they are a black box, and it is hard to understand what it has done to make a classification decision. They are also more resource intensive and slower to categorise large datasets — but still fundamentally possible to run on traditional hardware.
LLMs
The next wave is LLMs, which take the same basic concept and massively increase the data and the size of the model. Here, rather than replacing the “next word” step, the LLM is trained on a datasets that contain both instructions and the results of following those instructions. This makes zero-shot classification possible. Without retraining, a model can be given a text and a list of labels and it can assign the label.
This remains a (now massive) black box, but errors in category assignment can be improved by adjusting the instructions. The new downsides over smaller machine learning models is the much larger size of the model hugely increases the cost of self-hosting and creates dependencies on external companies providing models. If you use proprietary models (that are regularly updated and deprecated) this creates problems for reproducible processes.
Rulebox approach
The Rulebox approach combines aspects of both approaches. One of the things that LLMs are quite good at is writing code to solve stated problems. Here we’re doing a version of that: providing text and a category, and asking it to produce a set of regular expressions that should assign this category.
This has its unique set of pros and cons: you are still bound by the underlying problem of regular expressions that they are matching on text rather than the vibes of the text (which language models are better at). But you have massively reduced the labour time needed to create the huge set of rules, and once you have these they can be applied at speed on traditional hardware.
This is part of a focus on “low resource” use of LLMs – where we want to think about where we can get the most value out of new technology, in a way that avoids dependence or hugely increased capacity.
The process
We used an OpenAI-based process to assign labels to a set of 2,000 EDMs (1000 each for a training and validation dataset).
We then created a basic structure for holding regular expression rules using Pydantic for the underlying data structure of the collection of rules. For each rule, this either allows a list of regex expressions that are AND (all must match) or OR (one must match) — with the option of NOT rules that will negate a positive match.
Once we have the holder for a set of rules, and a dataset with a set of labels, we can start to calculate mismatches between what the rules say, and the result. Running this in a loop with steps that query an LLM helps refine the result.
The steps are:
- Calculate mismatches between ground truth labels, and assigned labels: finding both missing labels and incorrect labels.
- AI: For each missing label, create a new regex rule that would assign the correct label.
- AI: For each incorrect label, adjust and replace the regex rules that triggered this label.
- Repeat until no missing or incorrect labels.
PydanticAI is used to interface with the OpenAI API. This includes not just using pydantic to validate the returning data structure, but some extra validation rules that the resulting rules match the text that was being input. So for instance, if a rule is being generated to assign a label to a piece of text, if the generated rule fails to match the input text, this failure is passed back to trigger a retry.
The initial attempt at this got stuck in a loop creating rules that were too general, and trying to narrow them down. At this point, we cut the categories down to just a few we were really interested in, and after that performed better, expanded out to eight where it felt like keyword categories should perform reasonably well (or at least successfully generate rules). This ends up with 1,500 regular expressions to assign eight categories.
Applying the rules
Once we have the rules, we know they work for the training dataset, but how useful are they in general?
Using the validation dataset, we can see the following differences:
- Correct labels: 230
- Missing labels: 73
- Incorrect labels: 41
- Items where no labels were assigned: 808 / 1000 total items
Reviewing these, incorrect labels generally felt fair enough – these tended to be examples that contained obvious keywords related to the environment, but were part of longer lists where the labelling process did not judge it as one of the focuses of the text. The missing labels were more of a problem, where 33 of the missing labels were environmental ones. Expanding the training data should improve this, but there is always just going to be a long tail that’s missed.
Something else we experimented with at this stage was moving the process that applied the rules from Python to Rust (using an LLM to translate a basic version of the Python mechanics). This cut the time taken to categorising 13,000 EDMs from 2 minutes to 4 seconds. The benefit of this isn’t just being fast on this dataset, but that much more complicated rulesets would not be a big slowdown.
What have we learned
In general this is an approach worth investigating further as a bridge between several useful features: with it, we are able to translate an initial high intensity of LLM into a process that can be run fast on traditional hardware, and importantly is not a black box in terms of how it assigns labels.
It doesn’t completely carry over the benefits of LLMs:it is better for smaller, more precise categories. It really needs a good theory on why a keyword approach would be a good way of categorising something. It might be a good transitional approach for a few years while options stabilise around more open models with lower resource requirements.
Next steps
The next steps on this are to expand the training data a bit and start seeing if we can practically make use of the categories assigned, or if the accuracy causes problems.
Depending how this goes, we can revisit the initial experiment code and tidy it up into a more general classifying tool. This could tackle other classification problems we have that might be suitable, and we could make the tool more widely available. An advantage of this kind of approach (as our previous work around vector search) is it is the kind of project where “a technically-minded volunteer helped us to create a tool” might help organisations without creating significant new dependencies or new infrastructure requirements.
We also want to think about where hybrid approaches might be useful. For instance, in these datasets, most items are not labelled at all. A fast first pass that identifies potential items could then switch to an LLM approach to knock out false positives from the data. Similarly, once we have a smaller pool of environmentally-linked items, further subclassification using LLMs is much more viable.
Our general approach is to try and identify the things that LLMs can do uniquely well, and build them into overall processes that tame some of the things that worry us about AI in general. Here we are exploring how we have focused on the use of LLMs, resulting in new processes that are both fast and efficient. For more about our approach, read our AI framework.
Photo by Marc Sendra Martorell on Unsplash
- Calculate mismatches between ground truth labels, and assigned labels: finding both missing labels and incorrect labels.
-
TICTeC, our Impacts of Civic Technology conference, has been running since 2015. Over the years, we’ve seen shifts within both tech and democracy that have been reflected as priority topics: from the foundational (and evergreen) question of ‘how can you assess the value of civic technology if you don’t measure its impacts?’, to the rise of authoritarian ‘strong man’ leaders across the world, to a surge of enthusiasm for what blockchain can do around civic tech.
As each of these topics rise to the top of the civic tech community consciousness, TICTeC has provided a natural place to air questions, concerns and solutions.
This year, of course, the foundation-shaking issue is AI. Compared to 2024, when the technology was just beginning to be applied in our field, there’s been a maturing of the discussion, and much more concrete engagement with both the opportunities and the challenges that AI brings around government, truth, trust and delivery.
Our job is to make sure we steer towards the good — or, to phrase it in alignment with mySociety’s own aims, to examine how to engage critically and transparently with AI to create a fair and safe society.
AI across TICTeC 2025
The theme of AI was woven through the conference: where it wasn’t the primary topic itself, it coloured our thinking and had relevance everywhere.
Sessions dealing primarily with AI could be divided into three broad angles:
- Since AI is already making inroads into governance systems, how can we ensure it is used well?
- How have AI’s capabilities been harnessed to make civic tech tools, improve functionality or increase efficiency, and how’s that going?
- Can tools counter the problems that AI presents around truth and trust?
Let’s look at each of these in turn.
AI and democratic governance
Both of our keynote speakers were keen to point out the need for oversight and citizen participation as AI is rapidly adopted across government systems.
Marietje Schaake, whose presentation you can rewatch here, warned of the dangers of private tech firms holding more power than our constitutional democracies, thanks to the limitless profits to be made from this new technology; while Fernanda Campagnucci (presentation here) advocated for citizens to be allowed into the decision-making processes not just around governance itself, but in the making of the tools that facilitate it.
We also heard from the people at the frontline of governance. An instructive session from Westminster Foundation For Democracy and the Hellenic Parliament (not recorded) quizzed participants on how comfortable they would be in easing the administrative burden of parliaments by allowing AI to help categorise, filter and even answer letters from citizens. Would our opinion change if we knew, for example, that there was a backlog of 40,000 messages to representatives?
In a session deeply rooted in the realities of running a local authority during a period of tech acceleration, Manchester City Council explained that in a city where 450,000 people don’t even use the internet, it is crucial to ensure AI is being used ethically and to communicate how it affects citizens’ lives: “Whether or not you choose to interact with AI there’s no way of opting out – AI based decision making is happening around you.”
Three speakers from the Civic Tech Field Guide laid out the case for audits on how AI is being used in your own community, showing how anyone can do it, and Felix Sieker from Bertelsmann Stiftung made a strong argument for public AI, with proper accountability and democratic oversight, rather than the power being concentrated in a handful of private firms — something that is already being developed in several different forms, including by Mozilla.
MIT GOV/LAB ran a workshop (not recorded) in which we could chat with a simulation of a person from the future about the effects of a climate policy, then decide whether or not we would implement that policy once we had a human account of its results. This is part of ongoing research into helping to break deadlocks in policy decision-making.
How AI is already being used in civic tech
Both Code for Pakistan and Tainan Sprout showed how they’ve deployed AI to allow citizens to query dense policy documentation and get answers that are easy to understand
Demos talked about the work they’ve been doing around a new AI-powered digital deliberation process called Waves, hoping to ‘do democracy differently’ in our current crisis of mistrust.
Dealing with AI and misinformation
Camino Rojo from Google Spain showcased new tools, some of which are shortly to be rolled out, to help counteract misinformation. In particular, these allow users to check whether or not media displayed in search results was artificially generated. At the moment, the onus lies with the image generator to provide this information. Strict guidelines apply, in particular, to those advertising around sensitive areas such as elections.
AI and mySociety
In the final session of the conference, we presented the various ways that we’ve been exploring how AI can support mySociety’s work. You can rewatch this session in full here.
We have been guided by our own AI framework, in which we set out the six ethical principles by which we adhere when adopting this (or any) new technology. In essence, these can be boiled down to the single sentence: “We should use AI solutions when they are the best way of solving significant problems, are compatible with our wider ethical principles and reputation, and can be sustainably integrated into our work.”
In other words, we are not working backwards from the existence of AI to see what we could do with it, but approaching from the question of what we want to achieve, and then examining whether AI would aid us to do so more efficiently.
In this session you can discover how we’ve used AI to more effectively deal with problems in bulk, and make information easier for everyone to access across our work in Transparency; hear thoughts on how, for our work in Democracy, and especially the recent WhoFundsThem project, we’ve found that a human approach is sometimes needed — but that there are some tasks that AI can make easier here.
For the future we’re thinking about AI as it might apply to WriteToThem not to burden representatives with more mail, but perhaps communications of a higher quality.
Overall, we’re keeping a wary eye open for how AI will almost certainly be (and already is?) muddying the ability to trust the provenance of information — especially given that mySociety is essentially a ‘resupplier’ of data from public authorities and Parliament.
In a LinkedIn post, our Democracy Lead Alex got at the core of the challenges ahead of us all in the civic tech field, when he said: “Different kinds of technologies make different kinds of futures easier – and what we’re trying to do with pro-democratic tech is to make democratic futures easier. But the opposite is obviously [possible], and AI has arrived at the right time to merge aesthetically and ideologically with authoritarian regimes.
“A core to the spirit of civic tech is persuasion by demonstration – and to me TICTeC is a wonderful distillation of that spirit of both imagining better things, and doing the work to show what’s possible.”
And on that thought, we will roll up our sleeves and work towards the version of the future that is better for everyone.
—
We’re leading the conversation on AI and democratic decision making —
and we need your help.
mySociety was founded more than two decades ago to help democratic governance deliver on the raised expectations of the internet era.
We are in a period in which the relationship between tech and government is more entangled and fraught than ever. We’re stepping up, but we can only do so with your support. Please do consider making a donation.
-
Artificial intelligence and machine learning seem to be everywhere at the moment – every day there’s a new story about the latest smart assistant, self-driving car or the impending take over of the world by robots. With FixMyStreet having recently reached one million reports, I started wondering what kind of fun things could be done with that dataset.
Inspired by a recent post that generated UK place names using a neural network, I thought I’d dip my toes in the deep learning sea and apply the same technique to FixMyStreet reports. Predictably enough the results are a bit weird.
I took the titles from all the public reports on fixmystreet.com as the training data, and left the training process to run overnight. The number crunching was pretty slow and the calculations had barely reached 5% in the morning. I suspect the training set was a bit too large, at over 1M entries, but end result still gives enough to work with.
The training process produces checkpoints along the way, which you can use to see how the learning is progressing. After 1000 iterations the model was starting to be aware that it should use words, but didn’t really know how to spell them:
Mertricolbes Ice does thrown campryings Sunky riking proper, badger verwappefing cars off uping is! Finst Knmp Lyghimes Jn fence Moadle bridge is one descemjop
After 15000 iterations it’s starting to get the hang of real words, though still struggling to form coherent sentences.
Untaxed cacistance. Broken Surface in ARRUIGARDUR. Widdy movering Cracked already nail some house height avenue. Light not worky I large pot hole Dumped shood road nod at street. Grim Dog man Ongorently obstructing sofas. This birgs. Serious Dirches
After 68000 iterations there seems to be enough confusion in the training data that things start to go south again with the default parameters:
Urgely councille at jnc swept arobley men. They whention to public bend to street? For traffic light not working
Tweaking the ‘temperature’ of the sampling process produces increasingly sensible results:
Large crumbling on pavement Potholes all overgrown for deep pothole Very van causing the road Very deep potholes on pavement Weeds on the pavement Several potholes in the road Rubbish Dumped on the road markings Potholes on three away surface blocking my peride garden of the pavement Potholes and rubbish bags on pavement Poor road sign damaged Poor street lights not working Dog mess in can on road bollard on pavement A large potholes and street light post in middle of road
As well as plenty of variations on the most popular titles:
Pot hole Pot hole on pavement Pot holes and pavement around Pot holes needings to path Pothole Pothole dark Pothole in road Pothole/Damaged to to weeks Potholes Potholes all overgrown for deep pothole Potholes in Cavation Close Potholes in lamp post Out Potholes in right stop lines sign Potholes on Knothendabout Street Light Street Lighting Street light Street light fence the entranch to Parver close Street light not working Street light not working develter Street light out opposite 82/00 Tood Street lights Street lights not working in manham wall post Street lights on path Street lights out
It also seems to do quite well at making up road names that don’t exist in any of the original reports (or in reality):
Street Light Out - 605 Ridington Road Signs left on qualing Road, Leave SE2234 4 Phiphest Park Road Hasnyleys Rd Apton flytipping on Willour Lane The road U6!
Here are a few of my favourites for their sheer absurdity:
Huge pothole signs Lack of rubbish Wheelie car Keep Potholes Mattress left on cars Ant flat in the middle of road Flytipping goon! Pothole on the trees Abandoned rubbish in lane approaching badger toward Way ockgatton trees Overgrown bush Is broken - life of the road. Poo car Road missing Missing dog fouling - under traffic lights
Aside from perhaps generating realistic-looking reports for demo/development sites I don’t know if this has any practical application for FixMyStreet, but it was fun to see what kind of thing is possible with not much work.
—
Image: Scott Lynch (CC by/2.0)