When working with data that you didn’t set out to gather you have to be careful to think about what the data actually means, rather than what it seems to be saying. As an example, one of the “interesting” side effects of FixMyStreet is a database of places people have reported dog poop (or “dog fouling” as it tends to be called academically). We now have over 20,000 locations across the UK where nature’s call has both been heard, and reported.
My first thought when learning about this data was “that’s a lot of dog poop!” but it turns out 20,000 dog poops is not a lot of dog poop at all. There are an estimated 8.5 million dogs in the UK, assuming (on average) each one poops once a day, they’ll produce over 3.1 billion poops a year.
So actually, 20,000 poops over nine years is nothing compared to the amount of pooping going on. But just because our data is a drop in the bucket doesn’t mean we can’t learn interesting things from it. The first question to ask is if we have a representative sample of where all this dog fouling is going on. The answer, sadly, is no. But the reasons for that answer raise further questions – which is interesting!
When you map the location of dog poo complaints in England against the Index of Multiple Deprivation , you get this:
This tells us that reports about dog fouling are roughly parabolic – there are more in areas in the middle than those that are either very deprived or very not.
This is interesting because when Keep Britain Tidy actually went out into the world and checked (p. 14), they found this:
This graph tells a very different story, where dog fouling gets worse the more deprived the area. But why is this? And why doesn’t our data tell the same story?
One reason we would expect more dog poop in the most deprived areas is that the most deprived areas are more urban. Taking the same IMD deciles and using the ONS’s RUC categories to apply a eight point ‘ruralness’ scale (where 1 is ‘Urban major conurbation’ and 8 is ‘Rural village and dispersed in a sparse setting’) lets us see the average ‘ruralness’ of each decile. While this reflects that deprivation is spread across urban and rural areas – the most deprived areas tend to be more urban.
As urban areas have fewer natural places to dispose of dog waste, and the most deprived areas are more urban, we would expect the most deprived areas to have more dog fouling. We also know that measures that contribute to IMD scores (such as crime levels) are related to trust and social cohesion in an area. When social cohesion is lower, we would expect more dog fouling because owners feel less surveyed and are less concerned with the opinion of neighbours. The real world increase reported by the Keep Britain Tidy survey supports these relationships.
The drop off in our reported data compared to the real world can be explained by features of the general model for understanding FixMyStreet reports — some measures of deprivation are correlated with increased reports (because they relate to more problems) and others with decreased reports (because they hurt the ability or inclination of people to report). We would also expect areas with worse deprivation to have fewer reports because of disengagement with civic structures.
Quickly checking the English dog fouling data (so only 17,103 dog poops) against the same model confirms that significant relationships exist for the same deprivation indexes as the global dataset with the largest effect size of a measure of deprivation being for health – as health deprivation in an area goes up, reports of dog fouling increase.
What this tells us is that our dog data (and probably our data more generally) is clipped in areas of the highest deprivation. We’re not getting as many reports as the physical survey would suggest and so our data has very real limits in identifying the areas worse affected by a problem.
This is a lesson in being careful about interpreting datasets you pick up off the ground – if you used this data to conclude the most deprived areas had a similar dog poop problem to the least deprived areas you would be wrong. Because we have an independent source of the real world rate of problems, we can see there is a mismatch between distribution in reports and reality. Using this independent data of ‘actual problems’ for one of our categories makes us more aware that there is negative pressure on reports in highly deprived areas.
If you’d like to learn more about the history of dealing with dog poo on the street (and who wouldn’t want to learn more about that!) – I’ve very generously gone into more detail here.
: An index that combines thirty-seven indicators from seven domains (income, health, crime, etc) to provide a single figure for an area that is indicative of its level of deprivation relative to other areas.
:This is relative. Rural areas still have problems with bagged dog poo (“the ghastly dog poo bauble” hanging from branches – as MP Anne Main put it). There is also a risk to the health of cows from dog fouling in farmland – so there are unique rural dog poo problems.
: Ross et al. found “People who report living in neighborhoods with high levels of crime, vandalism, graffiti, danger, noise, and drugs are more mistrusting. The sense of powerlessness, which is common in such neighborhoods, amplifies the effect of neighborhood disorder on mistrust.”
Header image: https://www.flickr.com/photos/scottlowe/3931408440/
I teach a data analysis module at The Open University. Would we be able to use this (and the FixMyStreet analysis) as an example of a data investigation? And would you be willing to share the technical details of how you did this (e.g. source code for the analysis)?
Hi Neil – will drop you an email.