Running open LLM models

Most discussion and usage of LLMs is focused on high profile closed models such as OpenAI’s ChatGPT family, and Google’s Gemini – which are widely available and integrated into a range of existing products and services. 

Because these are closed models, access and hosting of the models is controlled by the companies that create them. This presents a dilemma for civic tech organisations who believe in open source – where important parts of their processes can disappear into black boxes beyond your control. These may work well/be affordable today, but creates new risks. Specific models might become unavailable, there might be changes in pricing, and this represents lock-in to specific providers. 

Open LLM models provide an alternative approach. In a familiar issue from open source licensing,  there are different ways in which a model can be ‘open’. Open weights models have the final structure of the model released and can be run on your own hardware (Meta’s Llama model is an example of this). Fully open models have the underlying (open licenced) training data released, as well as the recipes and evaluation systems used in their training. AI2’s OLMo family of models and the recent Swiss AI institute’s Apertus model are examples of these. Somewhere in between these are approaches like IBM’s Granite models, where the model is released as open weights and the data was licensed to be able to train on (addressing copyright issues) but is not publicly accessible. 

What are weights? Basically a model can be understood as a big network of connections – where the ‘weights’ are how strong (and influential) a connection is. What’s happening in the training process is a refinement of these weights as a result of being exposed to the training data. The weights at the end of the process are the trained model, and can be shared and used by others. But if you also have the training data and process, you can recreate the model step-by-step, with a clear audit trail of what’s in it.

Any kind of open weight model is practically appealing because they unlock new ways to work with private data without sharing with third parties, and create more flexibility around infrastructure. For instance, we currently use a fine-tuned version of Llama to help flag immigration correspondence in WhatDoTheyKnow.

Fully open models are ethically appealing because they avoid the issues of models that have been trained on copyrighted data. Their existence is a challenge to an AI policy debate where countries must trade-off the rights of creators against the benefits of AI as sold by a handful of companies.  They fit well with our open source ethos – and understanding more about how to use them practically helps give us options to improve our own services, and contribute to wider arguments about responsible use of AI.

This blog post is a write-up of several practical experiments in using the 7b parameters variation of OLMo-2 both locally on a laptop GPU and remotely using HuggingFace’s inference endpoints. 

Using OLMo-2 locally

Our purpose in running something locally is to be able to process sensitive information that should not leave our infrastructure. In this case, using OLMo-2 to create human-readable representations of clusters from WriteToThem survey responses. While users are asked not to include personal information in this survey, enough do that we need to treat the basic dataset as having personal information that should not be shared.

We used llama-cpp (and the associated python bindings) to run the local model. An alternative local approach is to use ollama to run a local server. The reason for using llama-cpp in this case is that ollama doesn’t always seem to pick up that less well known models can use ‘tools’ correctly (which is required for structured data output). Another benefit is having it run in process rather than as a separate server is the script can turn on and off the resource intensive bit (although there’s a corresponding start up time) rather than needing a separate server process to run.

Setting up the libraries

Installing llama-cpp in a way that can use the GPU is not straightforward. This set of instructions for Windows 11/Nvidia GPU mostly worked for me. I additionally needed to add an extra DLL directory before importing from llama_cpp because there’s a DLL folder that the library wasn’t yet referencing. 

Big picture, WheelNext is a project to try and make installing correct versions of the library easier across different OS/GPU combinations. In the meantime, setting up a local machine is a bit fiddly.

Downloading model information

Llama-cpp uses GGFU files – which have all the weights in a single file. There are libraries to convert from the transformers format – but this is often made available by model publishers on HuggingFace.

Downloading the model can be done using the huggingface_hub command line too (here using uv). 

uvx –from huggingface-hub hf download allenai/OLMo-2-1124-7B-Instruct-GGFU olmo-2-1124-7B-instruct-Q4_0.gguf –local-dir models

This is pulling down a quantised version – which has the same number of parameters – but the values of the weights have been significantly rounded down. This tends to have much less decrease in quality than the corresponding decrease in file/memory size (why? Broadly high fidelity here is useful for adjusting in training which will happen in small shifts, but when you have something working the general structure is good enough)  – and this fits it just inside the ability of my laptop’s GPU. 

This download can also just be done in code:

from llama_cpp import Llama

from functools import lru_cache

@lru_cache

def get_llm():

return Llama.from_pretrained(

    repo_id=“allenai/OLMo-2-1124-7B-Instruct-GGUF”,

    filename=“olmo-2-1124-7B-instruct-Q4_0.gguf”,

)

 

Structured data output

To get structured data out of the model, Pydantic AI can be used with Outline to query the llama cpp model.

This:

  • makes it easier to define Pydantic data structures that should be returned.
  • makes it easier to swap between local/remote models by swapping the model passed to the agent, but otherwise using a common API.

Hosted OLMo-2 model

An advantage of any open weights model is being able to run it on a range of infrastructure (and being able to change the infrastructure later). 

In this case, I had a use case where we wanted to do transformations on already public data (the appropriateness of linking to a specific Wikipedia page from a specific sentence in a parliamentary debate)  – and so there was no privacy/security issue for the purposes of the experiment. We are doing further exploration about how we can make this kind of use compliant with our wider legal and privacy commitments. 

Because OLMo-2 is not a commonly used model, there isn’t an inference service that offers it directly as an option (which would be most efficient – as you’re being charged for tokens while the underlying infrastructure is shared between many users). Instead, you need to create a private server that can manage the model. 

Creating an endpoint

Hugging Face Inference Endpoints is the approach I used here – that lets you provision an endpoint connected to a specific model. I’m using the same model as I used locally.

Depending on the properties of the model – the minimum GPU required will be suggested. This model was coming up about $0.8 an hour. Running the 13b parameter version of the model was about $2 an hour. There are options to run on AWS, Azure and Google Cloud in different regions (although processing data in the EU/UK is a requirement – this limits some of the GPU options). 

The scale-to-zero time is adjustable down to about 15 minutes. It takes a few minutes to load up from this. In principle, if the access token is scoped correctly – the huggingface_hub library can handle pausing and unpausing the endpoint (or even programmatically creating one), if some more control here is wanted. 

Structured data output

This endpoint works well using some of the example HuggingFace connections for PydanticAI. Something I had to adjust was adding an adapter to reduce complex json schemas (e.g. anything with multiple model types, enums, etc) from using ‘$defs’ to just being a normal structure because the Hugging Face text-generation-inference interface can’t handle them. 

I have an example of creating a model that Pydantic AI will accept here – the missing config bits are a token associated with the account and the url of the endpoint created. 

So in principle this means we can have an endpoint that gives us access to a GPU based model for an hour a day at a reasonable price – while we could at a later point swap out to use a local model without adjusting the general logic of the application. This is well suited to our current anticipated uses in batched backend processes, but would be less efficient if it needed to be responsive around the clock.

Reflecting on the results

Compared to previous projects using the OpenAI API, a key thing to note is it is slower and more fiddly on the infrastructure at hand. I was only using the 7b parameter model, while the 32b parameter model is the one that evaluates closer to GPT-4o mini. As such, prompts needed to be a bit more detailed on what was required. Similarly, a combination of the hardware and not being able to run queries in parallel over a wider infrastructure mean the process takes longer. 

But this is also like comparing cake to a well balanced meal – the benefits of an open model are not just philosophical but practical. With a bit more work on the prompt you can get useful results on a laptop with no dependency on third-party services. That brings into scope a range of use cases that OpenAI is not suitable for. 

Even where, such as in the Wikipedia example, there are no privacy issues in using OpenAI, making it easy to swap in an open model makes it much easier to evaluate the effect of using an open model. It will now be relatively straightforward to quickly substitute OLMo-2 into PydanticAI flows using other models and get a baseline feeling for effectiveness. Even where you might choose to use a closed model in a specific instance, it is very useful to work in such a way that you are not locked in to that model and could switch away in future.

Similarly, having a working process for a non-mainstream model like OLMo-2 makes it easier to explore other models like Apertus. As this has been trained on a wider range of non-English languages it could provide a more dependable component in LLM integration with the core Alavateli software – which powers Freedom of Information platforms across a range of languages. 

Understanding open models as a practical approach helps contribute more widely to policy conversations around AI – and where trade-offs and impacts are inherent to the nature of the technology, or are a consequence of how they are currently controlled and produced. 

Open models are always likely to lag slightly behind the frontier models, but they are already incredibly useful technologies compared to what was possible a few years ago. We want to understand more about how we can practically make use of these models – and help make sure the future of LLMs are shaped by ethical considerations about their training and use – rather than accepting them on the terms of the dominant tech giants. 

Header image: Photo by Zhang Zi Han on Unsplash