This is a more technical blog post in companion to our recent blog about local climate data. Read on if you’re interested in the tools and approaches we’re using in the Climate team to analyse and publish data.
How we’re handling common data analysis and data publishing tasks.
Generally we do all our data analysis in Python and Jupyter notebooks. While we have some analysis using R, we have more Python developers and projects, so this makes it easier for analysis code to be shared and understood between analysis and production projects.
Following the same basic ideas as (and stealing some folder structure from) the cookiecutter data science approach that each small project should live in a separate repository, we have a standard repository template for working with data processing and analysis.
The template defines a folder structure, and standard config files for development in Docker and VS Code. A shared data_common library builds a base Docker image (for faster access to new repos), and common tools and utilities that are shared between projects for dataset management. This includes helpers for managing dataset releases, and for working with our charting theme. The use of Docker means that the development environment and the GitHub Actions environment can be kept in sync – and so processes can easily be shifted to a scheduled task as a GitHub Action.
The advantage of this common library approach is that it is easy to update the set of common tools from each new project, but because each project is pegged to a commit of the common library, new projects get the benefit of advances, while old projects do not need to be updated all the time to keep working.
This process can run end-to-end in GitHub – where the repository is created in GitHub, Codespaces can be used for development, automated testing and building happens with GitHub Actions and the data is published through GitHub Pages. The use of GitHub Actions especially means testing and validation of the data can live on Github’s infrastructure, rather than requiring additional work for each small project on our servers.
One of the goals of this data management process is to make it easy to take a dataset we’ve built for our purposes, and make it easily accessible for re-use by others.
The data_common library contains a
datasetcommand line tool – which automates the creation of various config files, publishing, and validation of our data.
Rather than reinventing the wheel, we use the frictionless data standard as a way of describing the data. A repo will hold one or more data packages, which are a collection of data resources (generally a CSV table). The dataset tool detects changes to the data resources, and updates the config files. Changes between config files can then be used for automated version changes.
Leaning on the frictionless standard for basic validation that the structure is right, we use pytest to run additional tests on the data itself. This means we define a set of rules that the dataset should pass (eg ‘all cells in this column contain a value’), and if it doesn’t, the dataset will not validate and will fail to build.
This is especially important because we have datasets that are fed by automated processes, read external Google Sheets, or accept input from other organisations. The local authority codes dataset has a number of tests to check authorities haven’t been unexpectedly deleted, that the start date and end dates make sense, and that only certain kinds of authorities can be designated as the county council or combined authority overlapping with a different authority. This means that when someone submits a change to the source dataset, we can have a certain amount of faith that the dataset is being improved because the automated testing is checking that nothing is obviously broken.
The automated versioning approach means the defined structure of a resource is also a form of automated testing. Generally following the semver rules for frictionless data (exception that adding a new column after the last column is not a major change), the dataset tool will try and determine if a change from the previous version is a MAJOR (backward compatibility breaking), MINOR (new resource, row or column), or PATCH (correcting errors) change. Generally, we want to avoid major changes, and the automated action will throw an error if this happens. If a major change is required, this can be done manually. The fact that external users of the file can peg their usage to a particular major version means that changes can be made knowing nothing is immediately going to break (even if data may become more stale in the long run).
Data publishing and accessibility
The frictionless standard allows an optional description for each data column. We make this required, so that each column needs to have been given a human readable description for the dataset to validate successfully. Internally, this is useful as enforcing documentation (and making sure you really understand what units a column is in), and means that it is much easier for external users to understand what is going on.
Previously, we were uploading the CSVs to GitHub repositories and leaving it as that – but GitHub isn’t friendly to non-developers, and clicking a CSV file opens it up in the browser rather than downloading it.
To help make data more accessible, we now publish a small GitHub Pages site for each repo, which allows small static sites to be built from the contents of a repository (the EveryPolitician project also used this approach). This means we can have fuller documentation of the data, better analytics on access, sign-posting to surveys, and better sign-posted links to downloading multiple versions of the data.
The automated deployment means we can also very easily create Excel files that packages together all resources in a package into the same file, and include the meta-data information about the dataset, as well as information about how they can tell us about how they’re using it.
Publishing in an Excel format acknowledges a practical reality that lots of people work in Excel. CSVs don’t always load nicely in Excel, and since Excel files can contain multiple sheets, we can add a cover page that makes it easier to use and understand our data by packaging all the explanations inside the file. We still produce both CSVs and XLSX files – and can now do so with very little work.
For developers who are interested in making automated use of the data, we also provide a small package that can be used in Python or as a CLI tool to fetch the data, and instructions on the download page on how to use it.
At mySociety Towers, we’re fans of Datasette, a tool for exploring datasets. Simon Willison recently released Datasette Lite, a version that runs entirely in the browser. That means that just by publishing our data as a SQLite file, we can add a link so that people can explore a dataset without leaving the browser. You can even create shareable links for queries: for example, all current local authorities in Scotland, or local authorities in the most deprived quintile. This lets us do some very rapid prototyping of what a data service might look like, just by packaging up some of the data using our new approach.
Something in use in a few of our repos is the ability to automatically deploy analysis of the dataset when it is updated.
Analysis of the dataset can be designed in a Jupyter notebook (including tables and charts) – and this can be re-run and published on the same GitHub Pages deploy as the data itself. For instance, the UK Composite Rural Urban Classification produces this analysis. For the moment, this is just replacing previous automatic README creation – but in principle makes it easy for us to create simple, self-updating public charts and analysis of whatever we like.
Bringing it all back together and keeping people to up to date with changes
The one downside of all these datasets living in different repositories is making them easy to discover. To help out with this, we add all data packages to our data.mysociety.org catalogue (itself a Jekyll site that updates via GitHub Actions) and have started a lightweight data announcement email list. If you have got this far, and want to see more of our data in future – sign up!
If you visit FixMyStreet and suddenly start seeing spots, don’t rush to your optician: it’s just another feature to help you, and the council, when you make a report.
In our last two blog posts we announced Buckinghamshire and Bath & NE Somerset councils’ adoption of FixMyStreet Pro, and looked at how this integrated with existing council software. It’s the latter which has brought on this sudden rash.
At the moment, you’ll only see such dots in areas where the council has adopted FixMyStreet Pro, and gone for the ‘asset locations’ option: take a look at the Bath & NE Somerset installation to see them in action.
What is an asset?
mySociety developer Struan explains all.
Councils refer to ‘assets’; in layman’s language these are things like roads, streetlights, grit bins, dog poo bins and trees. These assets are normally stored in an asset management system that tracks problems, and once hooked up, FixMyStreet Pro can deposit users’ reports directly into that system.
Most asset management systems will have an entry for each asset and probably some location data for them too. This means that we can plot them on a map, and we can also include details about the asset.
When you make a report, for example a broken streetlight, you’ll be able to quickly and easily specify that precise light on the map, making things a faster for you. And there’s no need for the average citizen to ever know this, but we can then include the council’s internal ID for the streetlight in the report, which then also speeds things up for the council.
So, how do we get these assets on to the map? Here’s the technical part:
The council will either have a map server with a set of asset layers on it that we can use, or they’ll provide us with files containing details of the assets and we’ll host them on our own map server.
The map server then lets you ask for all the streetlights in an area and sends back some XML with a location for each streetlight and any associated data, such as the lamppost number. Each collection of objects is called a layer, mostly because that’s how mapping software uses them. It has a layer for the map and then adds any other features on top of this in layers.
Will these dots clutter up the map for users who are trying to make a report about something else?
Not at all.
With a bit of configuration in FixMyStreet, we associate report categories with asset layers so we only show the assets on the map when the relevant category is selected.
We can also snap problem reports to any nearby asset which is handy for things like street lights as it doesn’t make sense to send a report about a broken street light with no associated light.
Watch this space
And what’s coming up?
We’re working to add data from roadworks.org, so that when a user clicks on a road we’ll be able to tell them if roadworks are happening in the near future, which might have a bearing on whether they want to report the problem — for example there’s no point in reporting a pothole if the whole road is due to be resurfaced the next week.
Then we’ll also be looking at roads overseen by TfL. The issue with these is that while they are always within a council area, the council doesn’t have the responsibility of maintaining them, so we want to change where the report is going rather than just adding in more data. There’s also the added complication of things like “what if the issue is being reported on a council-maintained bridge that goes over a TFL road”.
There’s always something to keep the FixMyStreet developers busy… we’ll make sure we keep you updated as these new innovations are added.
From a council and interested in knowing more? Visit the FixMyStreet Pro website
All mySociety websites have strong security: when you think about some of the data we’re entrusted with (people’s private correspondence with their MPs, through WriteToThem, is perhaps the most extreme example, but many of our websites also rely on us storing your email address and other personal information) then you’ll easily understand why robust privacy and security measures are built into all our systems from the very beginning.
We’ve recently upped these even more for FixMyStreet. Like everyone else, we’ve been checking our systems and policies ahead of the implementation of the new General Data Protection Regulation in May, and this helped us see a few areas where we could tighten things up.
A common request from our users is that we remove their name from a report they made on FixMyStreet: either they didn’t realise that it would be published on the site, or they’ve changed their mind about it. Note that when you submit your report, there’s a box which you can uncheck if you would like your report to be anonymous:
FixMyStreet remembers your preference and applies it the next time you make a report.
In any case, now users can anonymise their own reports, either singly or all at once. When you’re logged in, just go to any of your reports and click ‘hide my name’. You’ll see both options:
Security for users was already very good, but with the following improvements it can be considered excellent!
- All passwords are now checked against a list of the 577,000 most common choices, and any that appear in this list are not allowed.
- Passwords must now also be of a minimum length.
- If you change your password, you have to input the previous one in order to authorise the change. Those who haven’t previously used a password (since it is possible to make a report without creating an account), will receive a confirmation email to ensure the request has come from the email address given.
- FixMyStreet passwords are hashed with an algorithm called bcrypt, which has a built in ‘work factor’ that can be increased as computers get faster. We’ve bumped this up.
- Admins can now log a user out of all their sessions. This could be useful for example in the case of a user who has logged in via a public computer and is concerned that others may be able to access their account; or for staff admin who share devices.
You may know Dr Ben Goldacre from his ‘Bad Science’ and ‘Bad Pharma’ campaigns, which fight misinformation around medicine. Ben has just launched his latest project, the AllTrials Transparency Index — and mySociety helped with the website side of things.
The AllTrials campaign focuses around the fact that a shockingly large proportion of clinical trials do not have their results publicly published.
Not only does this devalue the time, goodwill and even potential risk put in by participants, but there are also issues around bias. Those trials published tend to be the ones which show positive results: if that doesn’t sound like such a terrible situation to you, try playing this game from the Economist magazine, which graphically depicts the problems with skewed coverage. At worst, such selective publication can be dangerous, or lead to poor choices from bodies making medical purchasing decisions.
Transparency and data visualisation are two areas where mySociety has a long history, and so it came to be that we fashioned the deceptively simple AllTrials Transparency Index site, on which anyone can browse the transparency index of the world’s major drug companies, and dive in deeper to the data to see how it was compiled. The source data is free for others to download too, so anyone can integrate it into other projects.
AllTrials are also tracking whether companies register new trials:
That’s the part that anyone can understand — and now, notes for the more technically-inclined who may be wondering how we took the data on each drugs company and presented it in a way that can be quickly and easily taken in.
This is an ongoing campaign with a commitment to future audits, so we wanted to make it easy for the AllTrials team to update the site and republish the source data each time they do.
It’s a static Jekyll site. We wrote a custom plug-in to parse a CSV and produce a page for each company within that CSV, as well as creating some summary data that feeds into the graphs on the front page.
This data is then pulled from the CSV, and D3 is employed to build the graphs and insert them into the generated pages.
The end result is a site that looks good and which can automatically update whenever the underlying data changes. We hope we’ll have played a small part in helping to ensure that it does — and for the better.
You might already be enjoying one of the usability improvements that FixMyStreet version 2.0 has brought, though it’s possible that you haven’t given it much thought.
But for FixMyStreet, we hadn’t given much thought to letting you filter reports by more than one dimension, until Oxfordshire County Council suggested that it would be a useful feature.
For quite some time, you’d been able to filter by category and status (“Show me all pothole reports” or “Show me all ‘unfixed’ reports”), but this new functionality is more flexible.
You can now select multiple categories and multiple statuses simultaneously (“show me all pothole and graffiti reports that are unfixed or in progress”) — and all through the power of tickboxes.
If you’re a non-technical person, that’s all you need to know: just enjoy the additional flexibility next time you visit FixMyStreet. But if you are a coder, you might like to read more about how we achieved this feature: for you, Matthew has written about it over on the FixMyStreet Platform blog.
If you’ve used FixMyStreet recently — either to make a report, or as a member of a council who receives the reports — you might have noticed that the site’s automated emails are looking a lot more swish.
Where previously those emails were plain text, we’ve now upgraded to HTML, with all the design possibilities that this implies.
It’s all part of the improvements ushered in by FixMyStreet Version 2.0, which we listed, in full, in our recent blog post. If you’d like a little more technical detail about some of the thought and solutions that went into this switch to HTML, Matthew has obliged with in a blog post over on FixMyStreet.org.
FixMyStreet has been around for nearly nine years, letting people report things and optionally include a photo; the upshot of which is we currently have a 143GB collection of photographs of potholes, graffiti, dog poo, and much more. 🙂
For almost all that time, attaching a photo has been through HTML’s standard file input form; it works, but that’s about all you can say for it – it’s quite ugly and unfriendly.
We have always wanted to improve this situation – we have a ticket in our ticketing system, Display thumbnail of photo before submitting it, that says it dates from 2012, and it was probably in our previous system even before that – but it never quite made it above other priorities, or when it was looked at, browser support just made it too tricky to consider.
Here’s a short animation of FixMyStreet’s new photo upload, which also allows you to upload multiple photos:
For the user, the only difference from the current interface is that the photo field has been moved higher up the form, so that photos can be uploading while you are filling out the rest of the form.
Personally, I think this benefit is the largest one, above the ability to add multiple photos at once, or the preview function. Some of our users are on slow connections – looking at the logs I see some uploads taking nearly a minute – so being able to put that process into the background hopefully speeds up the submission and makes the whole thing much nicer to use.
When creating a new report, it can sometimes happen that you fill in the form, include a photo, and submit, only for the server to reject your report for some reason not caught client-side. When that happens, the form needs to be shown again, with everything the user has already entered prefilled.
There are various reasons why this might happen; perhaps your browser doesn’t support the HTML5 required attribute (thanks Safari, though actually we do work around that); perhaps you’ve provided an incorrect password.
However, browsers don’t remember file inputs, and as we’ve seen, photo upload can take some time. From FixMyStreet’s beginnings, we recognised that re-uploading is a pain, so we’ve always had a mechanism whereby an uploaded photo would be stored server side, even if the form had errors, and only an ID for the photo was passed back to the browser so that the user could hopefully resubmit much more quickly.
This also helped with reports coming in from external sources like mobile phone apps or Flickr, which might come with a photo already attached but still need other information, such as location.
Of course there were edge cases and things to tidy up along the way, but if the form hadn’t taken into account the user experience of error edge cases from the start, or worse, had assumed all client checks were enough, then nine years down the line my job would have been a lot harder.
Anyway, long story short, adding photos to your FixMyStreet reports is now a smoother process, and you should try it out.