Artificial intelligence and machine learning seem to be everywhere at the moment – every day there’s a new story about the latest smart assistant, self-driving car or the impending take over of the world by robots. With FixMyStreet having recently reached one million reports, I started wondering what kind of fun things could be done with that dataset.
Inspired by a recent post that generated UK place names using a neural network, I thought I’d dip my toes in the deep learning sea and apply the same technique to FixMyStreet reports. Predictably enough the results are a bit weird.
I took the titles from all the public reports on fixmystreet.com as the training data, and left the training process to run overnight. The number crunching was pretty slow and the calculations had barely reached 5% in the morning. I suspect the training set was a bit too large, at over 1M entries, but end result still gives enough to work with.
The training process produces checkpoints along the way, which you can use to see how the learning is progressing. After 1000 iterations the model was starting to be aware that it should use words, but didn’t really know how to spell them:
Mertricolbes Ice does thrown campryings Sunky riking proper, badger verwappefing cars off uping is! Finst Knmp Lyghimes Jn fence Moadle bridge is one descemjop
After 15000 iterations it’s starting to get the hang of real words, though still struggling to form coherent sentences.
Untaxed cacistance. Broken Surface in ARRUIGARDUR. Widdy movering Cracked already nail some house height avenue. Light not worky I large pot hole Dumped shood road nod at street. Grim Dog man Ongorently obstructing sofas. This birgs. Serious Dirches
After 68000 iterations there seems to be enough confusion in the training data that things start to go south again with the default parameters:
Urgely councille at jnc swept arobley men. They whention to public bend to street? For traffic light not working
Tweaking the ‘temperature’ of the sampling process produces increasingly sensible results:
Large crumbling on pavement Potholes all overgrown for deep pothole Very van causing the road Very deep potholes on pavement Weeds on the pavement Several potholes in the road Rubbish Dumped on the road markings Potholes on three away surface blocking my peride garden of the pavement Potholes and rubbish bags on pavement Poor road sign damaged Poor street lights not working Dog mess in can on road bollard on pavement A large potholes and street light post in middle of road
As well as plenty of variations on the most popular titles:
Pot hole Pot hole on pavement Pot holes and pavement around Pot holes needings to path Pothole Pothole dark Pothole in road Pothole/Damaged to to weeks Potholes Potholes all overgrown for deep pothole Potholes in Cavation Close Potholes in lamp post Out Potholes in right stop lines sign Potholes on Knothendabout Street Light Street Lighting Street light Street light fence the entranch to Parver close Street light not working Street light not working develter Street light out opposite 82/00 Tood Street lights Street lights not working in manham wall post Street lights on path Street lights out
It also seems to do quite well at making up road names that don’t exist in any of the original reports (or in reality):
Street Light Out - 605 Ridington Road Signs left on qualing Road, Leave SE2234 4 Phiphest Park Road Hasnyleys Rd Apton flytipping on Willour Lane The road U6!
Here are a few of my favourites for their sheer absurdity:
Huge pothole signs Lack of rubbish Wheelie car Keep Potholes Mattress left on cars Ant flat in the middle of road Flytipping goon! Pothole on the trees Abandoned rubbish in lane approaching badger toward Way ockgatton trees Overgrown bush Is broken - life of the road. Poo car Road missing Missing dog fouling - under traffic lights
Aside from perhaps generating realistic-looking reports for demo/development sites I don’t know if this has any practical application for FixMyStreet, but it was fun to see what kind of thing is possible with not much work.
Help us innovate in Civic TechnologyDonate now
We’ve just released Alaveteli 0.26! Here are some of the highlights.
Request page design update
After some research in to where people enter the site we decided to revamp the request pages to give a better first impression.
We’ve used the “action bar” pattern from the authority pages to move the request actions to a neater drop-down menu. We’ve also promoted the “follow” button to help other types of users interact with the site.
Since lots of users are entering an Alaveteli on the request pages, it might not be obvious that they too can ask for information. We’ve now made an obvious link to the new request flow from the sidebar of the request pages to emphasise this.
The correspondence bubbles have had a bit of a makeover too. Its now a lot more obvious how to link to a particular piece of correspondence, and we’ve tidied the header so that its a little clearer who’s saying what.
The listing of similar requests in the request page sidebar has been improved after observing they were useful to users.
Also in design-world we’ve added the more modern request status icons, made the search interfaces more consistent and helped prevent blank searches on the “Find an authority” page.
Admin UI Improvements
As an Alaveteli grows it can get trickier to keep an eye on everything that’s happening on the site.
We’ve now added a new comments list so that admins can catch offensive or spam comments sooner.
For the same reasons, we’ve added sorting to the users list and made banned users more obvious.
The CSV import page layout and inline documentation has also been updated.
The new statistics page adds contributor leaderboards to help admins identify users as potential volunteers, as well as a graph showing when site admins hide things to improve the transparency of the site.
Extra search powers
Conversion tracking improvements
The full list of highlights and upgrade notes for this release is in the changelog.
Thanks again to everyone who’s contributed!
If you’ve used FixMyStreet recently — either to make a report, or as a member of a council who receives the reports — you might have noticed that the site’s automated emails are looking a lot more swish.
Where previously those emails were plain text, we’ve now upgraded to HTML, with all the design possibilities that this implies.
It’s all part of the improvements ushered in by FixMyStreet Version 2.0, which we listed, in full, in our recent blog post. If you’d like a little more technical detail about some of the thought and solutions that went into this switch to HTML, Matthew has obliged with in a blog post over on FixMyStreet.org.
mySociety’s EveryPolitician project aims to make data available on every politician in the world. It’s going well: we’re already sharing data on the politicians from nearly every country on the planet. That’s over 68,652 people and 2.9 million individual pieces of data, numbers which will be out of date almost as soon as you’ve read them. Naturally, the width and depth of that data varies from country to country, depending on the sources available — but that’s a topic for another blog post.
Today the EveryPolitician team would like to introduce you to its busiest member, who is blogging at EveryPolitician bot. A bot is an automated agent — a robot, no less, albeit one crafted entirely in software.
First, some background on why we need our little bot.
Because there’s so much to do
One of the obvious challenges of such a big mission is keeping on top of it all. We’re constantly adding and updating the data; it’s in no way a static dataset. Here are examples — by no means exhaustive — of circumstances that can lead to data changes:
- Legislatures change en masse, because of elections, etc.
We try to know when countries’ governments are due to change because that’s the kind of thing we’re interested in anyway (remember mySociety helps run websites for parliamentary monitoring organisations, such as Mzalendo in Kenya). But even anticipated changes are rarely straightforward, not least because there’s always a lag between a legislature changing and the data about its new members becoming available, especially from official national sources.
- Legislatures change en masse, unexpectedly
Not all sweeping changes are planned. There are coups and revolutions and other unscheduled or premature ends-of-term.
- Politicians retire
Or die, or change their names or titles, or switch party or faction.
- New parties emerge
Or the existing ones change their names, or form coalitions.
- Areas change
There are good reasons (better representation) and bad reasons (gerrymandering) why the areas in constituency-based systems often change. By way of a timely example, our UK readers probably know that the wards have changed for the forthcoming elections, and that mySociety built a handy tool that tells you what ward you’re in.
- Existing data gets refined
Played Gender Balance recently? Behind that is a dataset that keeps being updated (whenever there are new politicians) but which is itself a source of constantly-updating data for us.
- Someone in Russia updates the wikipedia page about a politician in Japan
Wikidata is the database underlying projects like Wikipedia, so by monitoring all the politicians we have that are also in there, we get a constant stream of updates. For example, within a few hours of someone adding it, we knew that the Russian transliteration of 安倍晋三’s name was Синдзо Абэ — that’s Shinzo Abe, in case you can’t read kanji or Cyrillic script. (If you’re wondering, whenever our sources conflict, we moderate in favour of local context.)
- New data sources become available
Our data comes from an ever-increasing number of sources, commonly more than one for any given legislature (the politicians’ twitter handles are often found in a different online place from their dates of birth, for example). We always welcome more contributions — if you think you’ve got new sources for the country you live in, please let us know.
- New old data becomes available
We collect historic data too — not just the politicians in the current term. For some countries we’ve already got data going back decades. Sources for data like this can sometimes be hard to find; slowly but surely new ones keeping turning up.
So, with all this sort of thing going on, it’s too much to expect a small team of humans to manage it all. Which is where our bot comes in.
We’ve automated many of our processes: scraping, collecting, checking changes, submitting them for inclusion — so the humans can concentrate on what they do best (which is understanding things, and making informed decisions). In technical terms, our bot handles most things in an event-driven way. It springs into action when triggered by a notification. Often that will be a webhook (for example, a scraper finishes getting data so it issues a webhook to let the bot know), although the bot also follows a schedule of regular tasks too. Computers are great for running repetitive tasks and making quantitative comparisons, and a lot of the work that needs to be done with our ever-changing data fits such a description.
The interconnectedness of all the different tasks the bot performs is complex. We originally thought we’d document that in one go — there’s a beautiful diagram waiting to be drawn, that’s for sure — but it soon became clear this was going to be a big job. Too big. Not only is the bot’s total activity complicated because there are a lot of interdependencies, but it’s always changing: the developers are frequently adding to the variety of tasks the bot is doing for us.
So in the end we realised we should just let the bot speak for itself, and describe task-by-task some of the things it does. Broken down like this it’s easier to follow.
We know not everybody will be interested, which is fine: the EveryPolitician data is useful for all sorts of people — journalists, researchers, parliamentary monitors, activists, parliamentarians themselves, and many more — and if you’re such a person you don’t need to know about how we’re making it happen. But if you’re technically-minded — and especially if you’re a developer who uses GitHub but hasn’t yet used the GitHub API as thoroughly as we’ve needed to, or are looking for ways to manage always-shifting data sets like ours — then we hope you’ll find what the bot says both informative and useful.
The bot is already a few days into blogging — its first post was “I am a busy bot”, but you can see all the others on its own Medium page. You can also follow it on Twitter as @everypolitician. Of course, its true home, where all the real work is done, is the everypoliticianbot account on GitHub.
Images: CC-BY-SA from the EveryPolitician bot’s very own scrapbook.
- Legislatures change en masse, because of elections, etc.
Last year, when we were helping to develop YourNextMP, the candidate-crowdsourcing platform for the General Election, we made what seemed like an obvious decision.
We decided to use PopIt as the site’s datastore — the place from which it could draw information about representatives: their names, positions, et cetera. We’d been developing PopIt as a solution for parliamentary monitoring sites, but we reckoned it would also be a good fit for YourNextMP.
That turned out to be the wrong choice.
YourNextMP was up and running in time for the election, but at the cost of many hours of intensive development as we tried to make PopIt do what was needed for the site.
Once you’ve got an established site in production, changing the database it uses isn’t something you do lightly. But on returning to the codebase to develop it for international reuse, we had to admit that, in the words of mySociety developer Mark Longair, PopIt was “actually causing more problems than it was solving”. It was time to unpick the code and take a different approach.
Mark explains just what it took to decide to change course in this way, over on his own blog.
The post contains quite a bit of technical detail, but it’s also an interesting read for anyone who’s interested in when, and why, it’s sometimes best to question the decisions you’ve made.
FixMyStreet has been around for nearly nine years, letting people report things and optionally include a photo; the upshot of which is we currently have a 143GB collection of photographs of potholes, graffiti, dog poo, and much more. 🙂
For almost all that time, attaching a photo has been through HTML’s standard file input form; it works, but that’s about all you can say for it – it’s quite ugly and unfriendly.
We have always wanted to improve this situation – we have a ticket in our ticketing system, Display thumbnail of photo before submitting it, that says it dates from 2012, and it was probably in our previous system even before that – but it never quite made it above other priorities, or when it was looked at, browser support just made it too tricky to consider.
Here’s a short animation of FixMyStreet’s new photo upload, which also allows you to upload multiple photos:
For the user, the only difference from the current interface is that the photo field has been moved higher up the form, so that photos can be uploading while you are filling out the rest of the form.
Personally, I think this benefit is the largest one, above the ability to add multiple photos at once, or the preview function. Some of our users are on slow connections – looking at the logs I see some uploads taking nearly a minute – so being able to put that process into the background hopefully speeds up the submission and makes the whole thing much nicer to use.
When creating a new report, it can sometimes happen that you fill in the form, include a photo, and submit, only for the server to reject your report for some reason not caught client-side. When that happens, the form needs to be shown again, with everything the user has already entered prefilled.
There are various reasons why this might happen; perhaps your browser doesn’t support the HTML5 required attribute (thanks Safari, though actually we do work around that); perhaps you’ve provided an incorrect password.
However, browsers don’t remember file inputs, and as we’ve seen, photo upload can take some time. From FixMyStreet’s beginnings, we recognised that re-uploading is a pain, so we’ve always had a mechanism whereby an uploaded photo would be stored server side, even if the form had errors, and only an ID for the photo was passed back to the browser so that the user could hopefully resubmit much more quickly.
This also helped with reports coming in from external sources like mobile phone apps or Flickr, which might come with a photo already attached but still need other information, such as location.
Of course there were edge cases and things to tidy up along the way, but if the form hadn’t taken into account the user experience of error edge cases from the start, or worse, had assumed all client checks were enough, then nine years down the line my job would have been a lot harder.
Anyway, long story short, adding photos to your FixMyStreet reports is now a smoother process, and you should try it out.
A few of mySociety’s developers are at DjangoCon Europe in Cardiff this week – do say hello 🙂 As a contribution to the conference, what follows is a technical look (with bunny GIFs) into an issue we had recently with serving large amounts of data in one of our Django-based projects, MapIt, how it was dealt with, and some ideas and suggestions for using streaming HTTP responses in your own projects.
MapIt is a Django application and project for mapping geographical points or postcodes to administrative areas, that can be used standalone or within a Django project. Our UK installation powers many of our own and others’ projects; Global MapIt is an installation of the software that uses all the administrative and political boundaries from OpenStreetMap.
A few months ago, one of our servers fell over, due to running entirely out of memory.
Looking into what had caused this, it was a request for
/areas/O08, information on every “level 8” boundary in Global MapIt. This turned out to be just under 200,000 rows from one table of the database, along with associated data in other tables. Most uses of Global MapIt are for point lookups, returning only the few areas covering a particular latitude and longitude; it was rare for someone to ask for all the areas, but previously MapIt must have managed to respond within the server’s resources (indeed, the HTML version of that page had been requested okay earlier that day, though had taken a long time to generate).
resourcemodule, I manually ran through the steps of this particular view, running
print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024after each step to see how much memory was being used. Starting off with only 50Mb, it ended up using 1875Mb (500Mb fetching and creating a lookup of associated identifiers for each area, 675Mb attaching those identifiers to their areas (this runs the query that fetches all the areas), 400Mb creating a dictionary of the areas for output, and 250Mb dumping the dictionary as JSON).
The associated identifiers were added in Python code because doing the join in the database (with e.g.
select_related) was far too slow, but I clearly needed a way to make this request using less memory. There’s no reason why this request should not be able to work, but it shouldn’t be loading everything into memory, only to then output it all to the client asking for it. We want to stream the data from the database to the client as JSON as it arrives; we want in some way to use Django’s StreamingHTTPResponse.
The first straightforward step was to sort the areas list in the database, not in code, as doing it in code meant all the results needed to be loaded into memory first. I then tweaked our JSONP middleware so that it could cope when given a StreamingHTTPResponse as well as an HTTPResponse. The next step was to use the json module’s
iterencodefunction to have it output a generator of the JSON data, rather than one giant dump of the encoded data. We’re still supporting Django 1.4 until it end-of-lifes, so I included workarounds in this for the possibility of StreamingHTTPResponse not being available (though then if you’re running an installation with lots of areas, you may be in trouble!).
But having a StreamingHTTPResponse is not enough if something in the process consumes the generator, and as we’re outputting a dictionary, when I pass that dictionary to the json’s
iterencode, it will suck everything into memory upon creation, only then iterating for the output – not much use! I need a way to have it be able to iterate over a dictionary…
The solution was to invent the iterdict, which is a subclass of dict that isn’t actually a dict, but only puts an iterable (of key/value tuples) on items and iteritems. This tricks python’s JSON module into being able to iterate over such a “dictionary”, producing dictionary output but not requiring the dict to be created in memory; just what we want.
I then made sure that the whole request workflow was lazy and evaluated nothing until it would reach the end of the chain and be streamed to the client. I also stored the associated identifiers on the area directly in another iterator, not via an intermediary of (in the end) unneeded objects that just take up more memory.
I could now look at the new memory usage. Starting at 50Mb again, it added 140Mb attaching the associated codes to the areas, and actually streaming the output took about 25Mb. That was it 🙂 Whilst it took a while to start returning data, it also let the data stream to the client when the database was ready, rather than wait for all the data to be returned to Django first.
But I was not done. Doing the above then revealed a couple of bugs in Django itself. We have GZip middleware switched on, and it turned out that if your StreamingHTTPResponse contained any Unicode data, it would not work with any middleware that set Content-Encoding, such as GZip. I submitted a bug report and patch to Django, and my fix was incorporated into Django 1.8. A workaround in earlier Django versions is to run your iterator through
map(smart_bytes, content)before it is output (that’s six’s iterator version of map, for Python 2/3 compatibility).
Now GZip responses were working, I saw that the size of these responses was actually larger than not having the GZip middleware switched on?! I tracked this down to the constant flushing the middleware was doing, again submitted a bug report and patch to Django, which also made it into 1.8. The earlier version workaround is to have a patched local copy of the middleware.
Lastly, in all the above, I’ve ignored the HTML version of our JSON output. This contains just as many rows, is just as big an output, and could just as easily cripple our server. But sadly, Django templates do not act as generators, they read in all the data for output. So what MapIt does here is a bit of a hack – it has in its main template a “!!!DATA!!!” placeholder, and creates an iterator out of the template before/after that placeholder, and one compiled template for each row of the results.
Now Django 1.8 is out, the alternate Jinja2 templating system supports a
generate()function to render a template iteratively, which would be a cleaner way of dealing with the issue (though the templates would need to be translated to Jinja2, of course, and it would be more awkward to support less than 1.8). Alternatively, creating a generator version of Django’s Template.render() is Django ticket #13910, and it might be interesting to work on that at the Django sprint later this week.
Using a StreamingHTTPResponse is an easy way to output large amounts of data with Django, without taking up lots of memory, though I found it does involve a slightly different style of programming thinking. Make sure you have plenty of tests, as ever 🙂 Streaming JSON was mostly straightforward, though needed some creative encouragement when wanting to output a dictionary; if you’re after HTML streaming and are using Django 1.8, you may want to investigate Jinja2 templates now that they’re directly supported.
[ I apologise in the above for every mistaken use of generator instead of iterator, or vice-versa; at least the code runs okay 🙂 ]
In this post, we want to explain a bit more about why we spent time and effort on making them, when normally we advocate for mobile websites.
Plus, for our technical audience, we’ll explain some of the tools we used in the build.
The why bit
When we redesigned FixMyStreet last year, one of the goals was to provide a first class experience for people using the website on their mobile phone browsers. We’re pretty happy with the result.
We’re also believers in only building an app if it offers you something a mobile website can’t. There are lots of reasons for that: for a start, apps have to be installed, adding a hurdle at which people may just give up. They’re time-consuming to build, and you probably can’t cater for every type of phone out there.
Given that, why did we spend time on making a mobile app? Well, firstly, potholes aren’t always in places with a good mobile signal, so we wanted to provide a way to start reporting problems offline and then complete them later.
Secondly, when you’re on the move you often get interrupted, so you might well start reporting a problem and then become distracted. When that happens, reports in browsers may get lost, so we wanted an app that would save it as you went along, and allow you to come back to it later.
And the how bit (with added technical details, for the interested)
Having decided to build an app the next question is how to build it. The main decision is whether to use the native tools for each phone operating system, or to use one of the various toolkits that allow you to re-use code across multiple operating systems.
We quickly decided on using Phonegap, a cross platform toolkit, for several reasons: we’d already used Phonegap successfully in the past to build an app for Channel 4’s Great British Property Scandal (that won an award, so clearly something went right) and for Züri Wie Neu in Zurich, so it was an easy decision to use it again.
We decided to focus on apps for Android and iOS, as these are the two most popular operating systems. Even with this limitation, there is a lot of variety in the size and capability of devices that could be running the app – think iPads and tablets – but we decided to focus primarily on providing a good experience for people using phone-sized devices. This decision was partly informed by the resources we have at hand, but the main decider was that we mostly expect the app to be used on phones.
There was one big challenge: the functionality that allows you to take photos in-app. We just couldn’t get it to work with older versions of Android – and it’s still not really adequate. We just hope most people are updating their operating systems! Later versions of Android (and iOS) were considerably less frustrating, and perhaps an earlier decision to focus on these first would have led to a shorter development process.
On balance though? We’d still advocate a mobile-optimised browser site almost every time. But sometimes circumstances dictate – like they did for FixMyStreet – that you really need an app.
We’d give you the same advice, too, if you asked us. And we’d happily build you an app, or a mobile-friendly site, whichever was more suitable.
What are your plans for late April? If you’re a civic coder, a campaigner or activist from anywhere in the world, hold everything: we want to see you in Santiago, Chile, for the first international PoplusCon.
Poplus is a project which aims to bring together those working in the digital democracy arena – groups or individuals – so that we can share our code and thus operate more efficiently.
We’re right at the beginning of what we hope will grow into a worldwide initiative. If you’d like to get involved, now is the time.
Together with Poplus’ co-founders, Ciudadano Inteligente, we will be running a two-day conference in Santiago on the 29th and 30th of April. It is free to attend, and we can even provide travel grants for those who qualify.
Note (June 2016): This post is now slightly out of date. FixMyTransport is no longer running, though all of the other APIs and tools listed are still available.
There is also one significant addition which developers should find useful: EveryPolitician, which provides data on all current politicians around the world (and, in the future, we hope, all past ones too). See more here.
Much of what we do here at mySociety relies on Open Data, so naturally we support Open Data Day. In case you haven’t come across this event before, here’s the low-down:
Open Data Day is a gathering of citizens in cities around the world to write applications, liberate data, create visualizations and publish analyses using open public data to show support for and encourage the adoption open data policies by the world’s local, regional and national governments.
If you’re planning on being a part of Open Data Day, you may find some of mySociety’s feeds, tools and APIs useful. This post attempts to put them all in one place. (more…)