How can we tell when AI is actually the right tool for the job?

Generative AI is good at solving some kinds of problems, and bad at solving others. With the rush to apply AI approaches across the public and private sector, we want to encourage people to use the right tool for the right problem. This blog post proposes a test that makes it easy to understand whether or not the applications are genuinely beneficial for the job in hand.

Generative AI has no concept of truth. It is designed to create outputs that are internally consistent, and this might or might not coincide with true things when the training data and context are well aligned. By now, we’ve all heard examples of false-positive hallucinations, where AI has asserted that something exists or was said because doing so is internally consistent with the question — but which turns out not to be true. Depending on the application, if unchecked, this can have catastrophic effects, meaning that validation of outputs is essential.

How to assess your project for AI suitability

In our recent Shifting Landscapes report, we shared a simple matrix that helps to assess how useful it is to apply an AI approach to any given problem.

It asks how hard/expensive is it currently to produce a solution without AI, and how hard/expensive is to verify that the solution is correct, with four potential outcomes:

Producing a solution is cheap/easy Producing a solution is hard/expensive
Verifying the solution is cheap/easy Weak AI benefits (which may increase at scale) Significant AI benefits
Verifying the solution is hard/expensive Get a human to do it Break down the verification problem (and repeat)

Let’s look at each possible outcome in turn:

1. Weak AI benefits (which may increase at scale)
producing a solution is cheap / verifying the solution is cheap

This applies to tasks where AI tools might help people complete tasks more efficiently, but where the resulting impact or time savings are not significant. Over time/mass use, the benefits might increase.

Examples here include tasks like letter-writing and making summaries of documents or transcripts. If AI can do the initial grunt work, a human can take over and make tweaks to the output, nominally saving some time.

In our own field of civic tech, we can see this kind of tool being used to help people navigate bureaucracy: it might help format letters to representatives, or make effective appeals when FOI requests are refused.

Cheap processes at scale can also unlock new collective benefits. For instance, Muckrock uses LLMs to extract information and success/fail status from individual FOI responses. Doing this manually per request is easy for people, but requires lots of people to do the work to create a useful dataset across the entire corpus. An AI approach drops the costs further, which produces a small benefit on an individual scale, but collectively creates useful data.

As we note in our AI Framework, we have to recognise that a large number of small uses can build up into a negative effect. For instance, AI-created objections to planning applications might overwhelm a system that was built for a world in which there are higher hurdles to lodging an objection.

2. Significant AI benefits
producing a solution is expensive / verifying the solution is cheap

In this scenario, we’re thinking of situations where it is harder for a human to create a credible solution than it is to check if the outputs are valid. Conceiving a solution might be hard because it requires specialised knowledge, such as coding, or significant time and resources, like the analysis of a huge dataset; but it would be easy for a human to see whether or not the solution is working as intended.

One of the biggest practical uses of AI so far has been seen in coding, because coding problems fit so well into this category, and so provide potential benefits. The structure of computer code is often formally checkable (for at least syntax errors), and often there is a relatively short turnaround between “having code” and “checking the code is effective”. This isn’t to say that all coding fits in this box, but enough that a clearly productive set of tools exists. 

There are strong potential benefits here because an expensive process can be made cheaper, while the quality of the output can be checked through relatively cheap verification methods.

This segment of applications can be impactful even where access to models is relatively expensive, as a relatively small number of LLM users can have a big impact through the products that emerge.

3. Get a human to do it
producing a solution is cheap / verifying the solution is expensive

Some LLM processes produce outputs that cannot be quickly verified by automatic or human means.

Here, using an LLM for the initial solution might be less effective than having a human do it from the start.  While tweaking an email that contains slightly poor wording is a cheap correction, adjusting a multi-page report written by an LLM (involving fact checking, correction, restructure, etc) might be more complicated than just having someone write the original work.

When humans approach a piece of work like this, the production and verification processes pretty much happen at the same time, because the skills required to produce the work are the same ones that suggest the work is valid. 

“Use a human” is often most clearly the sensible approach for projects that need a high level of accuracy and confidence in the material produced. For example, we talked to OpenFun about their LawTrace site, which brings together legislative information in Taiwan. They made a point of choosing not to use AI at all in this project. Having accurate information was far more important to users than any convenience AI could introduce.

4. Break down the verification problem
producing a solution is expensive / verifying the solution is expensive

Sometimes solutions are expensive for a combination of reasons, and this can justify investment in trying to split the verification problem into smaller problems. 

Through a sequence of different checks on LLM output, we can move problems towards being strong uses of AI, because it dramatically reduces the time needed to produce the solution, while the verification costs are manageable. 

As an example, our APPG scraper sits in this category. We wanted to get accurate lists of parliamentary group memberships from dozens of different websites. Our original idea was that we would need to use a crowdsourcing approach, because we thought an LLM would be vulnerable to inventing lists of MPs. 

But after some consideration, we invested time in a step where we could verify with code whether or the names extracted were actually listed on the relevant sites. We can see a similar example in the public consensus platform Pol.is – where category descriptions are linked back to concrete sources to facilitate easier double checking

Similarly, you might find that aspects of your problem (if not the whole problem) are appropriate for mechanical checking. Could LLM code make a custom verification process easier? Can a series of automatic/human checks be made more efficient with a clear verification workflow? Each individual improvement moves your project closer to being a potentially strong use of AI.

Investment in the verification process might move the problem closer to having weak/strong AI benefits, where outputs can be derisked through cheap quality checks — but you’ll only know through systematically breaking it down in this way. 

We hope that, by sharing this matrix, we will encourage more thoughtful deployments of AI technology in governments  and beyond. Please feel free to share it with those who will find it useful.

This blog post has been adapted from our report Shifting Landscapes – A practical guide to pro-democratic tech.

Image: Leo Lau & Digit (CC-BY 4.0)