Recently we wrote about why we’re now listing APPGs in TheyWorkForYou. This blog post goes into more detail about the technical process we use to gather who is a member of an APPG.
We have two methods of getting the memberships of APPGs. The first is finding if it’s already published on their website. The second is using Parliament’s rules to ask the APPG contact for the list. So we need to a) find all the APPG websites, and b) see if they publish members lists c) if not, ask for the list and d) get those lists into a consistent format.
Data that is fragmented and not in the format we want is a fairly common civic tech problem. The solution is to write a ‘scraper’ that reads the content of a website and has a process for converting it to a more structured format.
This works well when dealing with only a few sources (e.g. the memberships of the UK’s parliaments only needs a few different scrapers), or where a common format is being used (e.g. many local government websites use similar providers). In the case of APPGs, there is no common template being used. We just have a set of a few hundred websites that may (or may not) contain a list of names.
Rather than a traditional scraper, we have built an agentic AI/LLM approach that is more flexibly able to extract memberships from websites. The end result is a tool with a careful sequencing of manual and automated steps, injecting human review in structured ways. Rather than an “AI makes mistakes” disclaimer, we built a structured process to check elements efficiently one group at a time, that can lock off errors before proceeding to the next stage. This was also an experiment in using LLMs to write scraper tools, as well as some of the tools needed for the manual review steps.
Practically, this was an effective way of getting the information we needed that turned a very hard problem into one that we can dependably run regularly. It also suggests more generally useful ways of approaching fragmented data problems (more on this at the end of the post).
Building agentic approaches
An ‘agent’ is often poorly defined, but broadly it’s a language model interface is given tools (specific functions), a task, and an output data structure, and it loops between these until it gives a result.
To build agentic functions, we used the PydanticAI framework, which acts as a connector between the prompt, input data, the data structure of the output data, functions the agent has access to, and any bespoke validation of the results. The end result is a function that accepts structured input, and returns structured output, relatively painlessly.
Although this example is using OpenAI’s GPT models, in future experiments we use the PydanticAI approach to connect to open source models (the framework is designed to be model-agnostic). In principle this means that this project could in future switch the underlying provider used.
Process
Step 1: Writing a scraper
The first thing we needed to do was to get the official data from Parliament’s APPG register into a more structured form.
You can see an example of this page for the Africa APPG. This is a good task for a traditional scraper, but would also have been a fiddly problem. Using ChatGPT, we gave it an extract of the HTML, and asked for a Pydantic data structure and script to convert the data. This worked pretty well, with some tweaking to the format over time. When errors emerged in different APPGs – passing the error and an understanding of what should have happened back to the Copilot agent (using a Claude model) led to working fixes. In using the coding agent the key decision was deciding which bit of the project to be opinionated about – and this has mostly meant being very explicit about data structures (and validation to ensure they’re correct), and more relaxed about the pipes that connect things up.
Step 2: Adding categories to APPGs
From the official data, we only know if an APPG is a county or subject area group. We want to make it a bit more explorable by breaking this down into categories.
In the spirit of experimenting with LLMs, we copied all subject areas APPGs names and purpose statements into one of OpenAI’s reasoning models and asked for 10-20 sub-categories. It came back with 20 and they looked reasonable.
We then created a small functionless agent interface, giving it the title and purpose of a specific APPG, and returning a list of potential categories (preferring one, but allowing all that seem relevant).
Spot-checking these, they seem reasonable and for the purpose of breaking down the big list a bit – this is a good step up. This means, we can quickly see the APPGs that are likely to be relevant to environmental matters.
Step 3: Finding missing websites
Some APPGs list their external website – some do not. Here we use AI tools as part of the workflow, to find those missing sites (which may not exist).
We created an agent function with access to a web search tool (tavity), a function to check if the URL is valid, and a prompt to help identify the correct site. This creates a loop to search and identify a good candidate for the website.
At this point, there is a manual check that prompts the user to review each site one-by-one before confirming it as a valid site. 45/74 sites identified in the first wave were valid. Invalid websites were news articles, APPGs in other parliaments, or sites for previous iterations of that APPG.
This is not comprehensive and we and our volunteers found some more manually after the fact – but it is an interesting trial in finding data starting only with a search engine.
Step 4: Find published members
The final step is to get a list of members (if published) off these websites. We need a really flexible approach for this. Names might be in a structured list, but they can also be in one paragraph. They might be on a members page, the home page, or spread over three pages. There is no consistency to fall back on.
Here, we created an agent with a function that can fetch a web page and convert it to markdown. Using this recursively, the prompt instructs the agent to find the most relevant page (in some cases pages) that could contain membership information, and return a data structure of the members (MPs, Lords, Other). This returned over 5,000 names in the data format provided.
The big risk at this point is that having been asked for a list of MPs, it makes some up. The validation we use for this is to check if each name in the list is present within the HTML content of the page it was extracted from. If there’s an error, it runs again and will give up rather than use an incorrect list. There is some possibility for misinterpretation – but this prevents outright fabrication. Errors flagged here tended to be when the LLM has fixed formatting meaning the text no longer matches exactly against the page.
The key problem here is one that a human would have too – some APPG lists are out of date. Here I added an extra flag detecting a list containing people who had left Parliament that then needed a manual review. In other cases, this was sometimes picking up lists that were not membership lists. We made some adjustments to the prompt after picking up attendees at the AGM – which is not wrong, but incomplete.
Step 5. Manual data
As our main blog post talks about, we then needed to contact APPGs directly for lists that were not published. This presented a new problem: what we got back was a combination of spreadsheets and emails with different levels of detail – some including party details in other columns, some not.
Our solution was to have a Google Doc that just has each list formatted under a heading with the APPG title – we could just copy and paste information into this.
This file is then downloaded as markdown and converted into a list of names. There are a few tweaks to clean up leading numbers, and identify the name component of the line. Again, this step was substantially written via prompt – giving the LLM examples of the problem data, and that would create regular expressions to clean the data into the basic list of names we needed.
Step 6: Tidy members information
What we want to do next is get from a list of names to a list of TheyWorkForYou unique IDs.
We have a library that helps reconcile names to IDs, but a challenge here is that there are a huge range of spelling mistakes (sometimes to an extent where you could not actually work out the correct MP).
What we needed was a quick tool to compare the input name against our list of known names and suggest near matches. Here we again turned to the coding agent, posing the problem, providing some snippets to interact with our existing library, and letting it craft a command line interface.
This fairly quickly gave a good interface for reviewing spelling problems (which was later refined to auto-match below a certain threshold). This helper tool is not especially complicated, but as something with a clear input and output, isolated from the rest of the flow, was a good candidate for testing using Copilot to create the function. In choosing what to spend time on, this would not otherwise have been a priority – but brought a useful feature into scope.
Result
The end result of this process is fairly effective – with a series of steps we can repeat every six weeks when a new APPG register is released to check for new webpages for new APPGs, or to recheck previously scanned pages.
The efficient sequencing of steps means that manual review happens on similar tasks in sequence, rather than checking each APPG through all steps.
In general, I’m pretty happy with the results of this, it made a project that would otherwise only have been possible with a big (and fairly boring for participants) crowdsourcing effort possible.
One of the problems we have to deal with a lot is fragmented public data, when relevant data is scattered all over the place and is a lot of work to bring back together. Here we found AI tools that were both useful in discovery of a component of the data, and in reconciling to a common standard.
The “AI scrapes then verifies content is present” approach worked well here but would struggle with more complex problems. For instance, if we really needed to be sure we were extracting a correct party label alongside a name, knowing that ‘Labour’ was present on the page wouldn’t be as helpful.
Building on this, the AI-written scraper code worked pretty well. If properly sandboxed (pydantic-ai has support for running python in a sandbox using pyodide), transformation code could be written to convert data between different sets of headers without running the data itself through an LLM to convert it. This potentially helps with some of the fragmented data problems of reconciling compatible but different schemas. LLM-involved approaches have a real potential to create new datasets through easier discovery and joining of data.
This is a way we can use new technology to make a dataset possible, but also it would be much easier if Parliament gathered and published this in the first place. The equivalent Cross Party Groups in the Scottish Parliament just make a downloadable file of all memberships in their open data portal. We need to think about how new technological approaches are not just propping up bad transparency – but part of encouraging better transparency all the way upstream.
Header image: Photo by Susan Holt Simpson on Unsplash