1. Publishing and analysing data: our workflow

    This is a more technical blog post in companion to our recent blog about local climate data. Read on if you’re interested in the tools and approaches we’re using in the Climate team to analyse and publish data. 

    How we’re handling common data analysis and data publishing tasks.

    Generally we do all our data analysis in Python and Jupyter notebooks. While we have some analysis using R, we have more Python developers and projects, so this makes it easier for analysis code to be shared and understood between analysis and production projects. 

    Following the same basic ideas as (and stealing some folder structure from) the cookiecutter data science approach that each small project should live in a separate repository, we have a standard repository template for working with data processing and analysis. 

    The template defines a folder structure, and standard config files for development in Docker and VS Code. A shared data_common library builds a base Docker image (for faster access to new repos), and common tools and utilities that are shared between projects for dataset management. This includes helpers for managing dataset releases, and for working with our charting theme. The use of Docker means that the development environment and the GitHub Actions environment can be kept in sync – and so processes can easily be shifted to a scheduled task as a GitHub Action. 

    The advantage of this common library approach is that it is easy to update the set of common tools from each new project, but because each project is pegged to a commit of the common library, new projects get the benefit of advances, while old projects do not need to be updated all the time to keep working. 

    This process can run end-to-end in GitHub – where the repository is created in GitHub, Codespaces can be used for development, automated testing and building happens with GitHub Actions and the data is published through GitHub Pages. The use of GitHub Actions especially means testing and validation of the data can live on Github’s infrastructure, rather than requiring additional work for each small project on our servers.

    Dataset management

    One of the goals of this data management process is to make it easy to take a dataset we’ve built for our purposes, and make it easily accessible for re-use by others. 

    The data_common library contains a dataset command line tool – which automates the creation of various config files, publishing, and validation of our data. 

    Rather than reinventing the wheel, we use the frictionless data standard as a way of describing the data. A repo will hold one or more data packages, which are a collection of data resources (generally a CSV table). The dataset tool detects changes to the data resources, and updates the config files. Changes between config files can then be used for automated version changes. 

    Screenshot of the CLI --help options for the dataset tool.

    Data integrity

    Leaning on the frictionless standard for basic validation that the structure is right, we use pytest to run additional tests on the data itself. This means we define a set of rules that the dataset should pass (eg ‘all cells in this column contain a value’), and if it doesn’t, the dataset will not validate and will fail to build. 

    This is especially important because we have datasets that are fed by automated processes, read external Google Sheets, or accept input from other organisations. The local authority codes dataset has a number of tests to check authorities haven’t been unexpectedly deleted, that the start date and end dates make sense, and that only certain kinds of authorities can be designated as the county council or combined authority overlapping with a different authority. This means that when someone submits a change to the source dataset, we can have a certain amount of faith that the dataset is being improved because the automated testing is checking that nothing is obviously broken. 

    The automated versioning approach means the defined structure of a resource is also a form of automated testing. Generally following the semver rules for frictionless data (exception that adding a new column after the last column is not a major change), the dataset tool will try and determine if a change from the previous version is a MAJOR (backward compatibility breaking), MINOR (new resource, row or column), or PATCH (correcting errors) change. Generally, we want to avoid major changes, and the automated action will throw an error if this happens. If a major change is required, this can be done manually. The fact that external users of the file can peg their usage to a particular major version means that changes can be made knowing nothing is immediately going to break (even if data may become more stale in the long run).

    Screenshot of example pytest tests, showing ensuring an authority has been assigned a nation

    Data publishing and accessibility

    The frictionless standard allows an optional description for each data column. We make this required, so that each column needs to have been given a human readable description for the dataset to validate successfully. Internally, this is useful as enforcing documentation (and making sure you really understand what units a column is in), and means that it is much easier for external users to understand what is going on. 

    Previously, we were uploading the CSVs to GitHub repositories and leaving it as that – but GitHub isn’t friendly to non-developers, and clicking a CSV file opens it up in the browser rather than downloading it. 

    To help make data more accessible, we now publish a small GitHub Pages site for each repo, which allows small static sites to be built from the contents of a repository (the EveryPolitician project also used this approach). This means we can have fuller documentation of the data, better analytics on access, sign-posting to surveys, and better sign-posted links to downloading multiple versions of the data. 

    Screenshot of data descriptions of the local authorities dataset

    The automated deployment means we can also very easily create Excel files that packages together all resources in a package into the same file, and include the meta-data information about the dataset, as well as information about how they can tell us about how they’re using it. 

    Publishing in an Excel format acknowledges a practical reality that lots of people work in Excel. CSVs don’t always load nicely in Excel, and since Excel files can contain multiple sheets, we can add a cover page that makes it easier to use and understand our data by packaging all the explanations inside the file. We still produce both CSVs and XLSX files – and can now do so with very little work.

    Screenshot of downloadable excel file showing different sheets and descriptions

    For developers who are interested in making automated use of the data, we also provide a small package that can be used in Python or as a CLI tool to fetch the data, and instructions on the download page on how to use it

    Screenshot of the command line download instructions for a dataset

    At mySociety Towers, we’re fans of Datasette, a tool for exploring datasets. Simon Willison recently released Datasette Lite, a version that runs entirely in the browser. That means that just by publishing our data as a SQLite file, we can add a link so that people can explore a dataset without leaving the browser. You can even create shareable links for queries: for example, all current local authorities in Scotland, or local authorities in the most deprived quintile. This lets us do some very rapid prototyping of what a data service might look like, just by packaging up some of the data using our new approach.

    Screen shot of Datasette Lite showing a query of authorities in Scotland

    Data analysis

    Something in use in a few of our repos is the ability to automatically deploy analysis of the dataset when it is updated. 

    Analysis of the dataset can be designed in a Jupyter notebook (including tables and charts) – and this can be re-run and published on the same GitHub Pages deploy as the data itself. For instance, the UK Composite Rural Urban Classification produces this analysis. For the moment, this is just replacing previous automatic README creation – but in principle makes it easy for us to create simple, self-updating public charts and analysis of whatever we like. 

    Bringing it all back together and keeping people to up to date with changes

    The one downside of all these datasets living in different repositories is making them easy to discover. To help out with this, we add all data packages to our data.mysociety.org catalogue (itself a Jekyll site that updates via GitHub Actions) and have started a lightweight data announcement email list. If you have got this far, and want to see more of our data in future – sign up!

    Image: Sigmund

  2. Asking questions in public: the Alaveteli experiments

    Suppose we sent an automated tweet every time someone made a successful Freedom of Information request on WhatDotheyKnow — would it bring more visitors to the site?

    And, if you get a response to your first FOI request, does it mean you are more likely to make a second one?

    These, and many more, are the kind of questions that emerge as we refine the advice that we’re offering partner organisations.

    Our Freedom of Information platform Alaveteli underpins Freedom of Information sites all around the world. When we first launched it, our only priorities were to make the code work, and to make that code as easy as possible to implement. But, as a community emerged around Alaveteli, we realised that we’d all be better off if we shared advice, successes and ideas.

    And that’s where we began to encounter questions.

    Some of them, like how to get more users, or how to understand where users come from, are common to anyone running a website.

    Others are unique to our partner structure, in which effectively anyone in any part of the world may pick up the Alaveteli code and start their own site. In theory, we might know very little more than that a site is running, although we’ll always try to make contact and let the implementers know what help we can offer them.

    There were so many questions that we soon saw the need to keep them all in one place. At mySociety, we’re accustomed to using Github for anything resembling a to-do list (as well as for its primary purposes; Github was designed to store code, allow multiple people to work on that code, and to suggest or review issues with it), and so we created a slightly unusual repo, Alaveteli-experiments.

    Screenshot of the Alaveteli Experiments repo, showing a table of experiments and summaries of their results

    This approach also gives us the benefit of transparency. Anyone can visit that repo and see what questions we are asking, how we intend to find the answers, and the results as they come in. What’s more, anyone who has (or opens) a Github account will also be able to add their own comments.

    Have a browse and you’ll come across experiments like this one and this one, which attempt to answer the questions with which we opened this post.

    Some of the experiments, like this one to analyse whether people click the ‘similar requests’ links in the sidebar, we’re running on our own site, WhatDoTheyKnow. Others, such as this one about the successful requests listed on every Alaveteli site’s homepage, are being conducted on our partners’ sites.

    Our aims are to find out more about how to bring more users to all Alaveteli sites, how to encourage browsing visitors to become people who make requests, and how to turn one-off requesters into people who come back and make another — and then pass all that on to our partners.

    We hope you’ll find plenty of interest on there. We reckon it’s all relevant, especially to anyone running an FOI website, but in many cases to anyone wondering how best to improve a site’s effectiveness. And we’re very happy to hear your ideas, too: if we’ve missed some obvious experiment, or you’ve thought of something that would be really interesting to know through the application of this kind of research, you’re  welcome to let us know.

    You can open your own ticket on the repo, suggest it in the Alaveteli community mailing list, or email Alaveteli Partnerships Manager Gemma.

     


    Image: Sandia Labs (CC by-nc-nd/2.0)

  3. Co-brands, code additions and pull requests

    More and more people are starting to build websites to help people become more powerful in their civic and democratic lives. Some of these are on codebases that mySociety has created which is so great. There are some things which we would love to happen when you take our code and re-use it.

    We want people using our code to keep it as up to date as they can, so that they gain the benefits of any changes made to the code by us or by other users. There are a few reasons for this:

    • You can co-brand the site without breaking anything.

    Dave, one of our developers, explains how you do this. “So suppose, instead of calling it FixMyStreet you want to call it FixMyBorchester with a Borchester logo. Obviously this is a very real requirement, because people want to rebrand. One very feasible (but wrong! As you’ll see…) way of doing this is downloading the FixMyStreet code, finding the bit that paints the FixMyStreet logo and replacing it with the words <h1>FixMyBorchester</h1> and an image. This would work as far as the FixMyBorchester branding would appear on the site.

    But if you then saved and committed your change to git and passed it back to us as a push request, we would reject it. This is for the obvious reason that if we didn’t, next time we deployed FixMyStreet in the UK it would have your logo on it.

    However, say we suddenly discover there is a bug with FixMyStreet. For (a bizarre) example, if someone put the number 0 in instead of a postcode and the site returns a huge picture of a kitten. We love kittens, but that’s not what the site is trying to do. So, we make some fixes to the code that rejects zeros, commit it, update the repo, and it’s now there on the master branch. We write to everyone saying “really everyone, update to the latest (most up-to-date) place on the master branch” And you think, “yeah OK!” and you download the latest version.

    If you just download it and copy it into place, you’re going to lose your FixMyBorchester changes, because there’s a more recent version of that file from us that hasn’t got them. If you did a “git pull” (which roughly means, “git! get me the latest version of master branch”) then git will refuse because there’s a conflict on that file.

    So, instead of inserting your FixMyBorchester stuff over ours, which can’t work, you make a new directory in the right place called ‘FixMyBorchester‘, put your stuff in there and switch the FixMyStreet config — which knows this is something people want to do — to use that cobrand. Any templates FixMyStreet finds in there will now be used instead of ours. You can now safely update the codebase from our repo from time to time and FixMyStreet and git will never damage your templates, because they are in a place it doesn’t mess with.”

    • You can add new features

    Dave continues. “Say when someone uses FixMyBorchester it’s essential that you have their twitter handle, because every time a problem is updated, FixMyBorchester direct-tweets them a kitten for fun. Right now there is no capacity to store a twitter handle for a user in FixMyStreet.

    You simply add a column to the users table in your database and add some code for accepting that twitter handle when you register, and sending the kittens etc. That’s new code that isn’t in FixMyStreet at all. Sooner or later you’ll need to put at least one line into the main FixMyStreet program code to make this happen. As soon as you do that you have the same problem we had before, only this time it’s in code not in an HTML template.

    What we would encourage you to do is put all your new code in a branch that we can look at, and maybe set it to run only if there’s a config setting that says USE_TWITTER=true. That way any implementation that doesn’t want to use twitter, which is — at this point — every other FixMyStreet installation in the world — won’t be affected by it. You send that to us as a pull request and a developer checks it’s not breaking anything, and is up to scratch in quality, and has good test coverage. Then we’ll accept it.

    Even though currently nobody else in the world wants your twitter feature, it’s not breaking anything and it’s now in the repo so you can automatically update from our master when we change bits of our files, and the installation/overwrite/git-pull will work. Plus anyone that does decide they want this feature will now be able to enable it and use it.

    And all of this helps everyone using the code; you have a secure website that can be patched and updated each time we release something, other people have access to features you’ve built and vice versa. And overall, the project becomes more feature rich.

    Please do make changes and push them back to the main codebase!

    Image credit: US Coast Guard CC BY-NC-ND

  4. FixMyStreet’s been redesigned

    FixMyStreet, our site for reporting things like potholes and broken street lights, has had something of a major redesign, kindly supported in part by Kasabi. With the help of Supercool, we have overhauled the look of the site, bringing it up to date and making the most of some lovely maps. And as with any mySociety project, we’d really appreciate your feedback on how we can make it ever more usable.

    The biggest change to the new FixMyStreet is the use of responsive design, where the web site adapts to fit within the environment in which it’s being viewed. The main difference on FixMyStreet, besides the obvious navigation changes, is that in a small screen environment, the reporting process changes to have a full screen map and confirmation step, which we thought would be preferable on small touchscreens and other mobiles. There are some technical details at the end of this post.

    Along with the design, we’ve made a number of other improvements along the way. For example, something that’s been requested for a long time, we now auto-rotate photos on upload, if we can, and we’re storing whatever is provided rather than only a shrunken version. It’s interesting that most photos include correct orientation information, but some clearly do not (e.g. the Blackberry 9800).

    We have many things we’d still like to do, as a couple of items from our github repository show. Firstly, it would be good if the FixMyStreet alert page could have something similar to what we’ve done on Barnet’s planning alerts service, providing a configurable circle for the potential alert area. We also are going to be adding faceted search to the area pages, allowing you to see only reports in a particular category, or within a certain time period.

    Regarding native phone apps – whilst the new design does hopefully work well on mobile phones, we understand that native apps are still useful for a number of reasons (not least, the fact photo upload is still not possible from a mobile web app on an iPhone). We have not had the time to update our apps, but will be doing so in the near future to bring them more in line with the redesign and hopefully improve them generally as well.

    The redesign is not the only news about FixMyStreet today

    As part of our new DIY mySociety project, we are today publishing an easy-to-read guide for people interested in using the FixMyStreet software to run versions of FixMyStreet outside of Britain. We are calling the newly upgraded, more re-usable open source code the FixMyStreet Platform.

    This is the first milestone in a major effort to upgrade the FixMyStreet Platform code to make it easier and more flexible to run in other countries. This effort started last year, and today we are formally encouraging people to join our new mailing list at the new FixMyStreet Platform homepage.

    Coming soon: a major upgrade to FixMyStreet for Councils

    As part of our redesign work, we’ve spoken to a load of different councils about what they might want or need, too. We’re now taking that knowledge, combining it with this redesign, and preparing to relaunch a substantially upgraded FixMyStreet for Councils product. If you’re interested in that, drop us a line.

    Kasabi: Our Data is now in the Datastore

    Finally, we are also now pushing details of reports entered on FixMyStreet to Kasabi’s data store as open linked data; you can find details of this dataset on their site. Let us know if it’s useful to you, or if we can do anything differently to help you.

    Technical details

    For the web developers amongst you – we have a base stylesheet for everyone, and another stylesheet that is only included if your browser width is 48em or above (an em is a unit of measurement dependent on your font size), or if you’re running Internet Explorer 6-8 (as they don’t handle the modern CSS to do this properly, we assume they’ll want the larger styles) using a conditional comment. This second stylesheet has slight differences up to 61em and above 61em. Whilst everything should continue to work without JavaScript, as FixMyStreet has done with its map-based reporting since 2007, where it is enabled this allows us to provide the full screen map you can see at large screen sizes, and the adjusted process you see at smaller resolutions.

    We originally used Modernizr.mq() in our JavaScript, but found that due to the way this works (adding content to the end of the document), this can cause issues with e.g. data() set on other elements, so we switched to detecting which CSS is being applied at the time.

    On a mobile, you can see that the site navigation is at the end of the document, with a skip to navigation link at the top. On a desktop browser, you’ll note that visually the navigation is now at the top. In both cases, the HTML is the same, with the navigation placed after the main content, so that it hopefully loads and appears first. We are using display: table-caption and caption-side: top in the desktop stylesheet in order to rearrange the content visually (as explained by Jeremy Keith), a simple yet powerful technique.

    From a performance point of view, on the front page of the site, we’re e.g. using yepnope (you can get it separately or as part of Modernizr) so that the map JavaScript is downloading in the background whilst you’re there, meaning the subsequent map page is hopefully quicker to load. I’m also adding a second tile server today – not because our current one isn’t coping, it is, but just in case something should happen to our main one – we already have redundancy in our postcode/area server MapIt and our population density service Gaze.

    If you have any technical questions about the design, please do ask in the comments and I’ll do my best to answer.