Here is a diagram of how the backend of Mapumental works. Take it in the spirit that Chris Lightfoot set when he made a similar diagram for the No. 10 petitions site – although many such diagrams are useless, hopefully this one contains useful information.
(Click on the diagram for a large version)
Below, I’ve explained what the main components are, and some interesting things about them.
Everything can, at least in theory, run on lots of servers. Currently we are only actually using one server for web requests, because of problems with HAProxy. We’re runnning isodaemons on two different servers.
Basic web application – it started out as raw Python, but the more Matthew hacks on it the more Django libraries he pulls in. Soon it’ll be indistinguishable from a Django app. When someone enters a new postcode, it adds it to the work queue in the PostgreSQL database, then refreshes waiting for the job to be finished. Then it displays the flash application (made by Stamen), set up to load the appropriate tile layers.
Tile server and cache – This uses the Python-based TileCache, calling Geospatial Data Abstraction Library (GDAL) to help render the tiles from points. It was originally written by Stamen, and expanded by mySociety. GDAL isn’t perfect, it doesn’t have fancy enough algorithms for my liking. e.g. Using a median rather than a weighted mean.
Isodaemons – These are controlled by a Python script, but the bulk of the code is custom written in C++. Slightly crazily, this can find the quickest route by public transport for each of 300,000 journeys from every station in the UK to a particular station, arriving at a particular time, in 10 to 30 seconds.
I had no idea how to do this, but luckily I live in Cambridge, UK. It’s a city fit to bursting with computer scientists. Many of the jobs are dull, and need little computing, never mind science – like writing interface layers for SQL server. So if you have a real interesting problem it’s easy to get help!
The universal advice was to use Dijkstra’s algorithm, which needed a bit of adaptation to work efficiently over space-time, rather than just space. Normally it is used for planning routes round a map, but public transport isn’t like that, you have to arrive in time for each particular train, so time affects what journeys you can take.
I originally wrote it in Python, which was not only too slow, but used up far far too much RAM. It could never have loaded the whole dataset in. However, the old Python code is still run by the test script, to double check the C++ code against. It is also still used to make the binary timetable files, see below.
Travel times, 1 binary file / postcode – I briefly attempted to insert 300,000 rows into PostgreSQL for each postcode looked up, but it was obvious it wasn’t going to scale. Going back to basics, it now just saves the time taken to travel to each station in a simple binary file – two bytes for each station, 600k in total. The tile server then does random access lookups into that file, as it renders each tile. It only needs to look up the values for the stations it knows are on/near the tile.
There’s various other bits:
- cron jobs for sending out invites
- converting timetable data from ATCO-CIF to the binary format
- loading static layer data into the database
- precaching every tile for static datasets
- Squid and Apache and FastCGI both sit in front of the web applications
- for speed, we cache the mapping background tiles from Cloudmade
- when zoomed out, there is code to cull which stations are used to draw tiles
- of course, a bunch of test code
Thanks to everyone who helped make Mapumental, we couldn’t have done it without lots of clever people.
I realise the above is a sketchy overview, so please ask questions in the comments, and I’ll do my best to answer them.
Very interesting. I have a couple of questions. How big is a tileset for a single postcode? Do you build all the tiles in a tileset up front or do you build them as they are requested? And what’s your policy for deleting old tilesets and travel times? (Or are you yet to run out of disk space?)
What tool do you use to create the “how mapumental works” diagram?
Gareth:
The public transport tiles for a particular postcode / time are only generated for the places people have browsed. They are then cached in the “Cached tiles” blue circle on the diagram.
Looking at the caches, most are about 2Mb to 8Mb in size, although some are much larger if the person scrolled about a lot.
The housing and scenicness tiles we’ve precached for most zoom levels for the whole of the UK. They come out at 2.3G for housing, and
We haven’t yet run out of disk space – there is currently 8 terabytes free on it, so I’m not worried yet!
I suspect the storage we use will change, as we make it scale onto larger numbers of machines. I’m not convinced NFS will perform.
David – I used Open Office Draw, in its standalone version oodraw. It turns out it actually has connectors which clip onto objects, a facility I couldn’t find in Inscape, but it’s likely I was just being dumb.
Interesting… care to elaborate on the problems you’re having with HAProxy? Some session affinity issue?
It would often drop one server, and then not start using it again in a mysterious way that I couldn’t track down. I also found it surprising that it didn’t have a dynamic load balancer, which would adjust which server it sent requests to based on how loaded they were.
I’m planning on trying both nginx and Amazon’s Elastic Load Balancer to see if they do better.
and while we are here, when are you going to send out more invites to use the beta 🙂
Oh yes, sorry! I’ve just sent out another 100 invites.
We’re currently working on changes to make it scale well so it can come out of beta, but it’ll be a while before its ready I’m afraid.
We’ll use the invites to help test as we change things for that.
Hi Francis,
This is really beautiful and useful mapping! Nice one.
One question/suggestion – how about applying this to cycling? The (also Cambridge-based) guys over at http://www.CycleStreets.net are doing a fine job on using the OpenCycleMap network to apply routing algorithms, and there are an awful lot of people in the country who’d love a “find me a place to live where I could cycle to work” – especially if (as cyclestreets do) it’s not distance based but based on quietness and the national cycle network.
I’d be interested in lending a hand if you’re keen on that…
Cheers
Simon
Yup, I know the cyclestreets people! As well as cycling, car travel would be useful for lots of people too (maybe we can reduce car commute time).
We use a special route finding algorithm to generate all the routes, but probably not that bad to code on top of the cyclestreets systems. We’re generalising everything so it is in separate components so will be easier to add soon.
Do you know the cyclestreets codebase, and have access to their data? If you can get that now, then will be in a good position to integrate with mapumental later.
Francis,
We do have a currently-undocumented XML interface to the planner which could be used, but the number of queries you would be sending would collapse our single server 🙂 It’s on our to-do and strategy lists to make a formal API.
We plan to open-source the CycleStreets code in due course (there’s a lot of refactoring still going on, and audits needed first). Once that’s done I imagine you should then be able to install a local instance and fire away your own queries.
However, I suspect it will still be too slow though, because the engine is necessarily optimised for detailed street results rather than an overall rough time/distance result. I do wonder whether a more general-purpose road router would approximate sufficiently though.
It’s definitely an area which would be interesting to collaborate on though. (Time in particular is the main issue at present!)
Hi Francis,
how do you plan to keep information updated? For example I imagine with additional funding, some places have the potential to become much more picturesque, others to go downhill. It’s an issue I’m grappling with myself at th emoment because I’m hoping to create a visual way to understand services for parents and disabled people by location, however my struggle at the moment is how to collect, but also maintain this information which seems to change frequently!
Celine – we’re currently reworking most of the code, splitting Mapumental into three separate components.
One of those is a contour rendering tile layer which will have the facility to manage many such layers. So for a new version of data, we can make a new layer with a new date, then flip the application to use it.
So yes, you have to plan for updates to the original data!
Hi Francis et al,
Mapumental really does rock and I can’t wait for it to come out of beta (and perhaps have more functionality such as dual centres, more specific house price data etc).
Any idea when that will be?
Thanks very much
Tony
A friend of mine has been relocated to an office in Central London as the company he works centralised operations. Clearly the highly paid individual who took the decision had little regard for the cost of transport, living etc. As a result my friend is looking for a new job not involve a lengthy commute into London. It would be nice to have a tool companies could use to match annual salary to affordable places to live, or to areas in the UK with the right skills match (Leeds and Chester both have lots of banking back office functions and are easy commutes to London for meetings etc).