This is a more technical blog post in companion to our recent blog about local climate data. Read on if you’re interested in the tools and approaches we’re using in the Climate team to analyse and publish data.
How we’re handling common data analysis and data publishing tasks.
Generally we do all our data analysis in Python and Jupyter notebooks. While we have some analysis using R, we have more Python developers and projects, so this makes it easier for analysis code to be shared and understood between analysis and production projects.
Following the same basic ideas as (and stealing some folder structure from) the cookiecutter data science approach that each small project should live in a separate repository, we have a standard repository template for working with data processing and analysis.
The template defines a folder structure, and standard config files for development in Docker and VS Code. A shared data_common library builds a base Docker image (for faster access to new repos), and common tools and utilities that are shared between projects for dataset management. This includes helpers for managing dataset releases, and for working with our charting theme. The use of Docker means that the development environment and the GitHub Actions environment can be kept in sync – and so processes can easily be shifted to a scheduled task as a GitHub Action.
The advantage of this common library approach is that it is easy to update the set of common tools from each new project, but because each project is pegged to a commit of the common library, new projects get the benefit of advances, while old projects do not need to be updated all the time to keep working.
This process can run end-to-end in GitHub – where the repository is created in GitHub, Codespaces can be used for development, automated testing and building happens with GitHub Actions and the data is published through GitHub Pages. The use of GitHub Actions especially means testing and validation of the data can live on Github’s infrastructure, rather than requiring additional work for each small project on our servers.
One of the goals of this data management process is to make it easy to take a dataset we’ve built for our purposes, and make it easily accessible for re-use by others.
The data_common library contains a
datasetcommand line tool – which automates the creation of various config files, publishing, and validation of our data.
Rather than reinventing the wheel, we use the frictionless data standard as a way of describing the data. A repo will hold one or more data packages, which are a collection of data resources (generally a CSV table). The dataset tool detects changes to the data resources, and updates the config files. Changes between config files can then be used for automated version changes.
Leaning on the frictionless standard for basic validation that the structure is right, we use pytest to run additional tests on the data itself. This means we define a set of rules that the dataset should pass (eg ‘all cells in this column contain a value’), and if it doesn’t, the dataset will not validate and will fail to build.
This is especially important because we have datasets that are fed by automated processes, read external Google Sheets, or accept input from other organisations. The local authority codes dataset has a number of tests to check authorities haven’t been unexpectedly deleted, that the start date and end dates make sense, and that only certain kinds of authorities can be designated as the county council or combined authority overlapping with a different authority. This means that when someone submits a change to the source dataset, we can have a certain amount of faith that the dataset is being improved because the automated testing is checking that nothing is obviously broken.
The automated versioning approach means the defined structure of a resource is also a form of automated testing. Generally following the semver rules for frictionless data (exception that adding a new column after the last column is not a major change), the dataset tool will try and determine if a change from the previous version is a MAJOR (backward compatibility breaking), MINOR (new resource, row or column), or PATCH (correcting errors) change. Generally, we want to avoid major changes, and the automated action will throw an error if this happens. If a major change is required, this can be done manually. The fact that external users of the file can peg their usage to a particular major version means that changes can be made knowing nothing is immediately going to break (even if data may become more stale in the long run).
Data publishing and accessibility
The frictionless standard allows an optional description for each data column. We make this required, so that each column needs to have been given a human readable description for the dataset to validate successfully. Internally, this is useful as enforcing documentation (and making sure you really understand what units a column is in), and means that it is much easier for external users to understand what is going on.
Previously, we were uploading the CSVs to GitHub repositories and leaving it as that – but GitHub isn’t friendly to non-developers, and clicking a CSV file opens it up in the browser rather than downloading it.
To help make data more accessible, we now publish a small GitHub Pages site for each repo, which allows small static sites to be built from the contents of a repository (the EveryPolitician project also used this approach). This means we can have fuller documentation of the data, better analytics on access, sign-posting to surveys, and better sign-posted links to downloading multiple versions of the data.
The automated deployment means we can also very easily create Excel files that packages together all resources in a package into the same file, and include the meta-data information about the dataset, as well as information about how they can tell us about how they’re using it.
Publishing in an Excel format acknowledges a practical reality that lots of people work in Excel. CSVs don’t always load nicely in Excel, and since Excel files can contain multiple sheets, we can add a cover page that makes it easier to use and understand our data by packaging all the explanations inside the file. We still produce both CSVs and XLSX files – and can now do so with very little work.
For developers who are interested in making automated use of the data, we also provide a small package that can be used in Python or as a CLI tool to fetch the data, and instructions on the download page on how to use it.
At mySociety Towers, we’re fans of Datasette, a tool for exploring datasets. Simon Willison recently released Datasette Lite, a version that runs entirely in the browser. That means that just by publishing our data as a SQLite file, we can add a link so that people can explore a dataset without leaving the browser. You can even create shareable links for queries: for example, all current local authorities in Scotland, or local authorities in the most deprived quintile. This lets us do some very rapid prototyping of what a data service might look like, just by packaging up some of the data using our new approach.
Something in use in a few of our repos is the ability to automatically deploy analysis of the dataset when it is updated.
Analysis of the dataset can be designed in a Jupyter notebook (including tables and charts) – and this can be re-run and published on the same GitHub Pages deploy as the data itself. For instance, the UK Composite Rural Urban Classification produces this analysis. For the moment, this is just replacing previous automatic README creation – but in principle makes it easy for us to create simple, self-updating public charts and analysis of whatever we like.
Bringing it all back together and keeping people to up to date with changes
The one downside of all these datasets living in different repositories is making them easy to discover. To help out with this, we add all data packages to our data.mysociety.org catalogue (itself a Jekyll site that updates via GitHub Actions) and have started a lightweight data announcement email list. If you have got this far, and want to see more of our data in future – sign up!
One of the things we want to do as part of our Climate programme is help build an ecosystem of data around local authorities and climate data.
We have a goal of reducing the carbon emissions that are within the control of local authorities, and we want to help people build tools and services that further that ambition.
We want to do more to actively encourage people to use our data, and to understand if there are any data gaps we can help fill to make everyone’s work easier.
So, have we already built something you think might be useful? We can help you use it.
Also, if there’s a dataset that would help you, but you don’t have the data skills required to take it further, we might be able to help build it! Does MapIt almost meet your needs but not quite? Let’s talk about it!
You can email us, or we are experimenting with running some drop-in hours where you can talk through a data problem with one of the team.
You can also sign up to our Climate newsletter to find up more about any future work we do to help grow this ecosystem.
Making our existing data more accessible
Through our previous expertise in local authority data, and in building the Climate Action Plan Explorer, we have gathered a lot of data that can overcome common challenges in new projects.
- A swiss-army knife/skeleton key/useful spreadsheet that lists all current local authorities, and helps transform data between different lookups.
- Mapit An API that can take postcodes and tell you which local authority they’re in (and much more!) Free for low traffic charitable projects.
- Datasets of which authorities have published climate action plans.
- Datasets of which authorities have published net zero dates, and their scopes.
- A massive 1GB zip of all the climate plans we know about.
- Measure of local deprivation across the whole UK.
- A simplified version of the BEIS local authority emissions data.
- Measures of similarity between all local authorities (emissions, deprivation, distance, rural/urban and then all of those things together).
All of this data (plus more) can be found on our data portal.
We’ve also been working to make our data more accessible and explorable (example):
- Datasets now have good descriptions of what is in each column.
- Datasets can be downloaded as Excel files
- Datasets can be previewed online using Datasette lite.
- Providing basic instructions on how to automatically download updated versions of the data.
If you think you can build something new out of this data, we can help you out!
Building more data
There’s a lot of datasets we think we can make more of — for example, as part of our prototyping research we did some basic analysis of how we might use Energy Performance Certificate data (for home energy in general, and specific renting analysis).
But before we just started making data, we want to make sure we’re making data that is useful to people and that can help people tell stories, and build websites and tools. If there’s a dataset you need, where you think the raw elements already exist, get in touch. We might be able to help you out.
If you are using our data, please tell us you’re using our data
We really believe in the benefit of making our work open so that others can find and build on it. The big drawback is that the easier we make our data to access, the less we know about who is using it.
This is a problem, because ultimately our climate work is funded by organisations who would like to know what is happening because of our work. The more we know about what is useful about the data, and what you’re using it for, the better we can make the case to continue producing it.
Each download page has a survey that you can fill out to tell us about how you use the data. We’re also always happy to receive emails!
Stay updated about everything
Our work growing the ecosystem also includes events and campaigning activity. If you want to stay up to date with everything we do around climate, you can sign up to our newsletter.
Image: Emma Gossett