Here is a diagram of how the backend of Mapumental works. Take it in the spirit that Chris Lightfoot set when he made a similar diagram for the No. 10 petitions site – although many such diagrams are useless, hopefully this one contains useful information.
(Click on the diagram for a large version)
Below, I’ve explained what the main components are, and some interesting things about them.
Everything can, at least in theory, run on lots of servers. Currently we are only actually using one server for web requests, because of problems with HAProxy. We’re runnning isodaemons on two different servers.
Basic web application – it started out as raw Python, but the more Matthew hacks on it the more Django libraries he pulls in. Soon it’ll be indistinguishable from a Django app. When someone enters a new postcode, it adds it to the work queue in the PostgreSQL database, then refreshes waiting for the job to be finished. Then it displays the flash application (made by Stamen), set up to load the appropriate tile layers.
Tile server and cache – This uses the Python-based TileCache, calling Geospatial Data Abstraction Library (GDAL) to help render the tiles from points. It was originally written by Stamen, and expanded by mySociety. GDAL isn’t perfect, it doesn’t have fancy enough algorithms for my liking. e.g. Using a median rather than a weighted mean.
Isodaemons – These are controlled by a Python script, but the bulk of the code is custom written in C++. Slightly crazily, this can find the quickest route by public transport for each of 300,000 journeys from every station in the UK to a particular station, arriving at a particular time, in 10 to 30 seconds.
I had no idea how to do this, but luckily I live in Cambridge, UK. It’s a city fit to bursting with computer scientists. Many of the jobs are dull, and need little computing, never mind science – like writing interface layers for SQL server. So if you have a real interesting problem it’s easy to get help!
The universal advice was to use Dijkstra’s algorithm, which needed a bit of adaptation to work efficiently over space-time, rather than just space. Normally it is used for planning routes round a map, but public transport isn’t like that, you have to arrive in time for each particular train, so time affects what journeys you can take.
I originally wrote it in Python, which was not only too slow, but used up far far too much RAM. It could never have loaded the whole dataset in. However, the old Python code is still run by the test script, to double check the C++ code against. It is also still used to make the binary timetable files, see below.
Travel times, 1 binary file / postcode – I briefly attempted to insert 300,000 rows into PostgreSQL for each postcode looked up, but it was obvious it wasn’t going to scale. Going back to basics, it now just saves the time taken to travel to each station in a simple binary file – two bytes for each station, 600k in total. The tile server then does random access lookups into that file, as it renders each tile. It only needs to look up the values for the stations it knows are on/near the tile.
There’s various other bits:
- cron jobs for sending out invites
- converting timetable data from ATCO-CIF to the binary format
- loading static layer data into the database
- precaching every tile for static datasets
- Squid and Apache and FastCGI both sit in front of the web applications
- for speed, we cache the mapping background tiles from Cloudmade
- when zoomed out, there is code to cull which stations are used to draw tiles
- of course, a bunch of test code
Thanks to everyone who helped make Mapumental, we couldn’t have done it without lots of clever people.
I realise the above is a sketchy overview, so please ask questions in the comments, and I’ll do my best to answer them.
mySociety would never have been able to make Mapumental in the way we did if it wasn’t for the help of San Franciso-based geovisualisation gurus Stamen. They came up with the brilliant idea of sliders instead of static contours lines, they built the flash front end, and, crucially, they helped make sure all the contours had just the right degree of splodginess for a satisfyingly splodgy user experience.
We’ve been hinting for a while about a secret project that we’re working on, and today I’m pleased to be able to take the wraps off Mapumental. It’s currently in Private Beta but invites are starting to flow out.
Built with support from Channel 4’s 4IP programme, Mapumental is the culmination of an ambition mySociety has had for some time – to take the nation’s bus, train, tram, tube and boat timetables and turn them into a service that does vastly more than imagined by traditional journey planners.
In its first iteration it’s specially tuned to help you work out where else you might live if you want an easy commute to work.
Francis Irving, the genius who made it all work, will post on the immense technical challenge overcome, soon. My thanks go massively to him; to Stamen, for their lovely UI, and to Matthew, for being brilliant as always.
Words don’t really do Mapumental justice, so please just watch the video 🙂 Update: Now available here in HD too
Also new: We’ve just set up a TheyWorkForYou Patrons pledge to help support the growth and improvement of that site. I can neither confirm nor deny that pledgees might get invites more quickly than otherwise 😉
See also: the main travel-time maps report.
Our newly released travel time maps are currently shooting round the internet. It was great fun making them, and you might like to have a go too – there are plenty of public datasets you could overlay on the same base maps, using the same flash app (source code). There are a few notes about how we made them on the page itself, and the associated real time page. For a far more interesting view of the development process, read Tom Carden from Stamen’s account.
The most interesting blog post I’ve seen to come from this is Whitehall staff have no life by Simon Dickson, who was inspired by the maps to think about the destruction of social capital caused by commuting. “Whitehall staff on all but the highest salaries can’t expect to live anywhere near their work, and hence can’t expect to have any kind of a social (capital) life.”
You may remember that back in 2006 mySociety published some maps showing how long it took to commute places via public transport.
We’ve just made some more which have some lovely new features we reckon you’ll probably like a lot.
If you’d like to see more maps like this in your area, please ask your local transport authority to get in touch with us, or nudge these people 🙂
PS As always, Francis Irving remains a genius.
The next mySociety Disruptive Tech Talk is a week today at 7.30pm at the London Knowledge Lab on Emerald St.
This time we have Steve Coast, founder of Open Street Map. When Open Street Map started a few years ago, I thought it would never take off. Earlier this year I accidentally went to their conference in Manchester, and was blown away. There’s a whole community of active people, collaboratively building a vector map of not just the whole country, but the whole world. And it is very usable now – for example, my home town of Cambridge is extremely high quality.
If you’re interested in mapping, or in how to organise communities that disrupt with technology, then come along. But please sign up as the last event was full to capacity! It’s free.
So, we’ve been a bit quiet on this blog, but naturally busy. I just did my invoice and timesheet for last month, and remembered how bitty it has been. In one day I often do things to 3 websites, and that is just CVS commit messages – no doubt I handled emails for more. This makes it quite hard to summarise what has been happening, and also quite hard to measure how much time we spend maintaining each website.
We’ve recently made a London version of PledgeBank, which I’ll remind Tom to explain about on the main news blog. It is a PledgeBank “microsite”, with a special query for the front page and all pledges page that shows only pledges in Greater London. Which is conveniently almost exactly a circle radius 25km with centre at 51.5N -0.1166667E. I worked that out by dividing the area (found on the Greater London Wikipedia page) by pi and taking the square root And rounding up a bit.
Yesterday we launched a new call for proposals – head on over, and tell us your ideas for new civic websites. It is another WordPress modification, but this time to the very blog that you’re reading now. The form for submitting proposals I made anew, It creates a new WordPress low-privileged user by directly inserting into the database, and then calls the function wp_insert_post to create a post by them in a special category. The rest of the blogging software then trivially does comments, RSS, search, email alerts and archiving.
Meanwhile, Chris has written some monitoring software for our servers, to alert us of problems and potential problems. Perl modules do the tests, things like enough disk space and that web servers that are up. I’ve been tweaking it a bit, for example adding a test to watch for long-running PostgreSQL queries which indicate a deadlock. We’ve got a problem in the PledgeBank SMS code which causes deadlocks sometimes, which we’re still debugging.
So, a silly post for today: Postcodeine. This is a British version of Ben Fry’s zipdecode, a “tool” for visualising the distribution of zipcodes in the United States. This is, as has been pointed out to me, wholly pointless, but it’s quite fun and writing it was an interesting exercise (it also taught me a little bit about AJAX, the web’s technology trend du jour). If you want the source code, it’s at the foot here; licence is the Affero GPL, as for all the other mySociety code.
(I should say, by the way, that I wrote this in my copious spare time. It’s copyright mySociety because I don’t have the right to use the postcode database myself.)
… or, “how near is ‘nearby’?”
On PledgeBank we offer search and local alert features which will tell users about pledges which have been set up near them, the idea being that if somebody’s organising a street party in the next street over, you might well want to hear about it, but if it’s somebody a thousand miles away, you probably don’t.
At the moment we do this by gathering location data from pledge creators (either using postcodes, or location names via Gaze), and comparing it to search / alert locations using a fixed distance threshold — presently 20km (or about 12 miles). This works moderately well, but leads to complaints from Londoners of the form “why have I been sent a pledge which is TEN MILES away from me?” — the point being that, within London, people’s idea of how far away “nearby” things is is quite different from that of people who live in the countryside — they mean one tube stop, or a few minutes’ walk, or whatever. If you live in the countryside, “nearby” might be the nearest village or even the nearest town.
So, ages ago we decided that the solution to this was to find some population density data and use it to produce an estimate for what is “nearby”, defined as, “the radius around a point which contains at least N people”. That should capture the difference between rural areas and small and large towns.
(In fairness, the London issue could be solved by having the code understand north vs south of the river as a special case, and never showing North-Londoners pledges in South London. But that’s just nasty.)
Unfortunately the better solution requires finding population density data for the whole world, which is troublesome. There seem to be two widely-used datasets with global coverage: NASA SEDAC’s Gridded Population of the World, and Oak Ridge National Laboratory’s Landscan database. GPW is built from census data and information about the boundaries of each administrative unit for which the census data is recorded, and Landscan improves on this by using remote-sensing data such as the distribution of night-time lights, transport networks and so forth.
(Why, you might wonder, is Oak Ridge National Laboratory interested in such a thing? It is, apparently, “for estimating ambient populations at risk” from natural disasters and whatnot. That’s very worthy, but I can’t help but wonder whether the original motivation for this sort of work may have been a touch more sinister. But what do I know?)
Anyway, licence terms seem to mean that we can use the GPW data and we can’t use the Landscan data, which is a pity, since the GPW data is only really very good in its coverage of rich western countries which produce very detailed census returns on, e.g., a per-municipality basis. Where census returns are only available on the level of regions, the results are less good. Anyway, subject to that caveat, it seems to solve the problem. Here’s a map showing a selection of points, and the circles around them which contain about 200,000 people (that seems to be about the right value for N):
The API to access this will go into the Gaze interface, but it’s not live yet. I’ll document the RESTful API when it is.
One last note, which might be of use to people working with the GPW data in the future. GPW is a cell-based grid: each cell is a region lying between two lines of longitude and two lines of latitude, and within each cell three variables are defined: the population in the cell, the population density of the cell, and the land area of the cell. (This is one of those rare exceptions described in to Alvy Ray Smith’s rant, A Pixel Is Not A Little Square….) But note that the land area is not the surface area of the cell, and the population density is not the population divided by the surface area of the cell!
This becomes important in the case of small islands; for instance (a case I hit debugging the code) the Scilly Isles. The quoted population density for the Scilly Isles is rather high: somewhere between 100 and 200 persons/km2, but when integrating the population density to find the total population in an area, this is absolutely not the right value to use: the proper value there is the total population of a cell, divided by its total surface area. The reason for that is that when sampling from the grid to find the value of the integrand (the population density) you don’t know, a priori, whether the point you’re sampling at has landed on land or non-land, but the quoted population density assumes that you are always asking about the land. When integrating, the total population of each cell should be “smeared out” over the whole area of the cell. If you don’t do this then you will get very considerable overestimates of the population in regions which contain islands.