Creating datasets from FOI data

Responses obtained from a widespread FOI project can be difficult to analyse, until they are sorted into neat datasets. This allows you to make valid comparisons, pull out accurate statistics and ultimately ensure your findings are meaningful.

In our third seminar within the Using Freedom of Information for Campaigning and Advocacy series, we heard from two speakers. Maya Esslemont from After Exploitation explained how to prepare for an FOI project to ensure you get the best results possible (and what to do if you don’t); and Kay Achenbach from the Open Data Institute explained the problems with ‘messy’ data, and how to fix them.

You can watch the video here, or read the detailed report below.

Preparing for an FOI project

After Exploitation is a non-profit organisation using varied data sources, including FOI requests, to track the hidden outcomes of modern slavery in the UK.

Maya explained that they often stitch together data from different sources to uncover new insights on modern slavery. She began with a case study showing some recent work they had done, using WhatDoTheyKnow to help them understand the longer term outcomes after survivors report instances of trafficking. This stood as an excellent example of how much work needs to be done before sending your requests, if you are to be sure to get the results you need.

In this case, After Exploitation were keen to understand whether there is any truth in widely-held assumptions around why human trafficking cases are dropped before they are resolved: it’s often thought that there are factors such as the survivors themselves not engaging with the police, perhaps because of a nervousness around authorities.

But what are these assumptions based upon? Actual information was not publicly available, so we wouldn’t know if cases were being dropped because of low police resource, a lack of awareness or more nuanced factors. Until the data could be gathered and analysed, the perceptions would continue, perhaps erroneously.

Before starting, After Exploitation thought carefully about the audience for their findings and their ultimate aims: in this case the audience would be mostly the media, with the aim of correcting the record if the results flew in face of what was expected; but they knew that the data would also be of use to practitioners. For example, charities could use it to see which areas to target regionally for training and other types of intervention.

They placed FOI requests with police forces across the country, making sure to ask for data using the crime codes employed by the forces: were cases dropped because of ‘lack of evidence’; did they have a status of ‘reported’ but not gone on to exist as an official crime record?

The project had a good outcome: while some requests had to go to internal review, ultimately over 80% of the forces responded with quality data. The findings were worthwhile, too: general perceptions did indeed prove to be wrong and there was no indication that ‘no suspect identified’ was a result of the victim’s lack of involvement. The resulting story was able to challenge the general narrative.

So, how can After Exploitation’s learnings be applied to the work of other organisations or campaigns?

Maya says:

  • Planning, rather than analysis, is the majority of the work;
  • Identify the need and purpose before you even start to pick which authorities to send requests to;
  • Be clear who the audience for your findings is;
  • Consult with other stakeholders to make sure your parameters are really clear.

Planning

Before you even begin, make sure your project isn’t asking for data that has already been collected and is in the public domain — this might seem obvious but it’s easy to overlook. Check other people’s FOI requests (you can do this by searching on WhatDoTheyKnow); look for reports, research, inspectorate/watchdog outputs, and data released as part of parliamentary enquiries.

That said, even if you do find previous data, there is sometimes value in requesting more up to date or more detailed information with a new set of FOI requests. If you see a national report collating data from every council for example, you could do an FOI project asking every council for a more detailed breakdown of what is happening in their region.

But before sending a batch of requests to multiple authorities, ask yourself if there is a centralised source for your data. If so, then just one FOI request might be enough: for example, homelessness data is already collected by the Depts for Housing, Levelling Up and Communities, in which case one request to them would save time for both you, and more than 300 public authorities.

Another question to ask before starting off on your project is “what is the social need?”. Does this need justify the resource you will expend? Mass FOI projects can be a bit of a time commitment, but the utility might not just be for your organisation: perhaps you can also identify a social benefit if the data would be of use to other groups, academics or journalists.

Define your intended audience: will the data you gather be of interest to them? Do you have a sense of what they want? For example, MPs often like to see localised data that applies to their constituencies. Journalists like big numbers and case studies. If you think your findings are important but might have limited appeal, you could consider including an extra question to provide details that you don’t need for your own purposes, but which could provide a hook.

Next, will the data that you gather actually be suitable for the analysis you want to perform? To avoid time-consuming mistakes, make sure the data you’ll receive is broken down in the way that you need. As an example, suppose you wanted to ask local authorities for details of programmes offered to children in different age bands: you might receive data from one council who has offerings for children ‘under 18 months’ and another ‘under two years old’ — and where units differ, they are difficult to compare and contrast. Be really precise in your wording so there’s no mismatch, especially if your request is going to a lot of authorities.

Consider, too, whether you can you get enough responses to make your data meaningful: 2,000 people is the figure believed to be representative of the population as a whole. Decide how many responses you ideally need for your purposes — and, in a scenario where not all authorities respond, the minimum you can work with.

You might want to contact other groups or organisations who could be interested in the same data, and ask if there are details that would be useful to their work.

As suggested in Maya’s case study, try to use existing measurements where you can: if you shape your requests to the methodology the authorities themselves use to collect the information, such as KPIs or their own metrics of success, these will be much easier for them to supply.

If you’re not sure what these metrics are, you can sometimes access internal guidance by googling the name of the authority plus ‘guidance’. Alternatively, submit scoping requests to a handful of authorities to ask how they measure success, etc.

At this stage it’s also useful to decide what quality of data you will include or exclude. For example, if you ask about training materials and one authority says they offer training, but don’t include the actual materials, do you include it in your figures? The more authorities you ask, the more ambiguities like this you’ll normally encounter.

Think about where and how you will log the data as it comes in. Maya recommended WhatDoTheyKnow Projects as a good tool for extracting data. Whatever you use, you should consider accessibility: can your platform be accessed by everyone you’re working with, across different communities? Especially if you are working with volunteers, it’s important to remember that not everyone has a laptop.

Also consider the security of the platform: how much this matters will depend on how sensitive the data is, but recognise that Google sheets and many other platforms store the data in the cloud where it could be more vulnerable to abuse.

After Exploitation take great pains to ensure that their data is accurate. They recommend that each response is assessed by two different people, making sure that everyone knows the criteria so they’re applied consistently; and doing regular spot checks on a handful of cases to make sure they are all logged in the same way and there’s no duplicate logging.

This is time-intensive and arduous, but if you have other stakeholders they might be able to help with the data checking: for example, knowing that they would eventually place the story with the BBC, After Exploitation were happy to hand this task over to their inhouse data checkers.

What if things go wrong?

If you’ve done all the planning suggested above, it’s less likely that your project will go awry, but even if it does, Maya says that there’s always something you can do.

No or few responses: ask yourself whether you have the capacity to chase no/late replies, and if you still don’t get a response, to refer them to the ICO. If not, consider prioritising the bodies that are most relevant to your work, eg the biggest authorities or those in areas with the densest populations; but be prepared to defend accusations that not every authority had a fair hearing unless you do them all.

If you know your requests were well worded, but you’re not getting a lot of responses — perhaps because you’re dealing with a contentious issue, or simply because the authorities cash-strapped — you could shift to measuring the types of responses you get. If authorities aren’t able to answer the question, this can often be just as revealing.

Responses that don’t tell you what you set out to understand: Consider whether there are any alternative angles in the data you do have: are there any additional themes, particularly in any free text fields? Or try a new round of requests asking for more detailed information.

Responses don’t cover the whole country: If you can’t get data from everywhere, could you narrow down to just one area and still have useful findings? Even the most basic data can set the scene for other researchers or organisations to build on: you can put it out and outline the limitations.

Results

The impact of gathering data through FOI can be massively powerful, as After Exploitation’s work shows. They have revealed the wrongful detention of thousands of potential victims of human trafficking when the government were denying it could happen; opened the debate about locking up vulnerable people; and uncovered the flawed decision making in the Home Office on modern slavery cases. It was only through FOI requests that all this information came into the public domain and was picked up by mainstream media.

Combining different sources of data to create datasets

Kay Achenbach is a data trainer on the Open Data Institute’s learning team; the ODI works with government and companies to create a world where data is working for everyone.

Kay shared a case study from the medical field, in which an algorithm was being designed to quickly assess high numbers of chest x-rays. The aim was to automate the process so that people identified as needing intervention would be sent to specialists right away.

The developers wanted to make sure that different demographic groups weren’t being biased against, a common issue with algorithms built on existing data which can contain previously undetected biases.

The test material was a set of x-rays from a diverse population, that had already been examined by specialists. They ran them past the algorithm to see if the diagnoses produced were the same as those made by human doctors.

The doctors’ assessments came from three different datasets which, combined, comprised data from more than 700,000 real patients. As soon as you combine datasets from different sources, you are likely to come across discrepancies which can make analysis difficult.

In this case, one dataset had diagnoses of 14 different diseases, and another had 15 — and from these, only eight overlapped. The only aspect that could for sure be compared was the “no finding” label, applied when the patient is healthy. That limitation set what the algorithm was asked to do.

Other fields were problematic in various ways: only one of the three sources contained data on ethnicity; one source only contained data on the sickest patients; another was from a hospital that only takes patients with diseases that they are studying, meaning there were zero “no finding” labels. Two of the sources contained no socio-economic data. Sex was self-reported in two of the sources, but assigned by clinicians in the other, which could also affect outcomes.

The advice from all this is that you should look carefully at each dataset before you combine them, to see what the result of combining them would be. In short: does it reflect real life?

Ultimately the researchers found that the algorithm was reflecting existing biases: it was much more likely to under-diagnose patients from a minority group; more likely to make mistake with female patients, the under 20s, Black people, and those from low socio-economic groups. The bias was compounded for those in more than one of those groups.

Cleaning up datasets

Once you’ve obtained your datasets from different FOI requests, you’re highly likely to find mismatches in the data that can make comparisons difficult or even impossible — but cleaning up the data can help.

For example, in spreadsheets you might discover empty fields, text in a numbers column, rows shifted, dates written in a variety of formats, different wording for the same thing, columns without titles, typos and so on.

Kay introduced a tool from Google called Refine that will solve many of the issues of messy data, and  pointed out that the ODI has a free tutorial on how to use it, which you can find here.