A few of mySociety’s developers are at DjangoCon Europe in Cardiff this week – do say hello 🙂 As a contribution to the conference, what follows is a technical look (with bunny GIFs) into an issue we had recently with serving large amounts of data in one of our Django-based projects, MapIt, how it was dealt with, and some ideas and suggestions for using streaming HTTP responses in your own projects.
MapIt is a Django application and project for mapping geographical points or postcodes to administrative areas, that can be used standalone or within a Django project. Our UK installation powers many of our own and others’ projects; Global MapIt is an installation of the software that uses all the administrative and political boundaries from OpenStreetMap.
A few months ago, one of our servers fell over, due to running entirely out of memory.
Looking into what had caused this, it was a request for
/areas/O08, information on every “level 8” boundary in Global MapIt. This turned out to be just under 200,000 rows from one table of the database, along with associated data in other tables. Most uses of Global MapIt are for point lookups, returning only the few areas covering a particular latitude and longitude; it was rare for someone to ask for all the areas, but previously MapIt must have managed to respond within the server’s resources (indeed, the HTML version of that page had been requested okay earlier that day, though had taken a long time to generate).
resource module, I manually ran through the steps of this particular view, running
print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024 after each step to see how much memory was being used. Starting off with only 50Mb, it ended up using 1875Mb (500Mb fetching and creating a lookup of associated identifiers for each area, 675Mb attaching those identifiers to their areas (this runs the query that fetches all the areas), 400Mb creating a dictionary of the areas for output, and 250Mb dumping the dictionary as JSON).
The associated identifiers were added in Python code because doing the join in the database (with e.g.
select_related) was far too slow, but I clearly needed a way to make this request using less memory. There’s no reason why this request should not be able to work, but it shouldn’t be loading everything into memory, only to then output it all to the client asking for it. We want to stream the data from the database to the client as JSON as it arrives; we want in some way to use Django’s StreamingHTTPResponse.
The first straightforward step was to sort the areas list in the database, not in code, as doing it in code meant all the results needed to be loaded into memory first. I then tweaked our JSONP middleware so that it could cope when given a StreamingHTTPResponse as well as an HTTPResponse. The next step was to use the json module’s
iterencode function to have it output a generator of the JSON data, rather than one giant dump of the encoded data. We’re still supporting Django 1.4 until it end-of-lifes, so I included workarounds in this for the possibility of StreamingHTTPResponse not being available (though then if you’re running an installation with lots of areas, you may be in trouble!).
But having a StreamingHTTPResponse is not enough if something in the process consumes the generator, and as we’re outputting a dictionary, when I pass that dictionary to the json’s
iterencode, it will suck everything into memory upon creation, only then iterating for the output – not much use! I need a way to have it be able to iterate over a dictionary…
The solution was to invent the iterdict, which is a subclass of dict that isn’t actually a dict, but only puts an iterable (of key/value tuples) on items and iteritems. This tricks python’s JSON module into being able to iterate over such a “dictionary”, producing dictionary output but not requiring the dict to be created in memory; just what we want.
I then made sure that the whole request workflow was lazy and evaluated nothing until it would reach the end of the chain and be streamed to the client. I also stored the associated identifiers on the area directly in another iterator, not via an intermediary of (in the end) unneeded objects that just take up more memory.
I could now look at the new memory usage. Starting at 50Mb again, it added 140Mb attaching the associated codes to the areas, and actually streaming the output took about 25Mb. That was it 🙂 Whilst it took a while to start returning data, it also let the data stream to the client when the database was ready, rather than wait for all the data to be returned to Django first.
But I was not done. Doing the above then revealed a couple of bugs in Django itself. We have GZip middleware switched on, and it turned out that if your StreamingHTTPResponse contained any Unicode data, it would not work with any middleware that set Content-Encoding, such as GZip. I submitted a bug report and patch to Django, and my fix was incorporated into Django 1.8. A workaround in earlier Django versions is to run your iterator through
map(smart_bytes, content) before it is output (that’s six’s iterator version of map, for Python 2/3 compatibility).
Now GZip responses were working, I saw that the size of these responses was actually larger than not having the GZip middleware switched on?! I tracked this down to the constant flushing the middleware was doing, again submitted a bug report and patch to Django, which also made it into 1.8. The earlier version workaround is to have a patched local copy of the middleware.
Lastly, in all the above, I’ve ignored the HTML version of our JSON output. This contains just as many rows, is just as big an output, and could just as easily cripple our server. But sadly, Django templates do not act as generators, they read in all the data for output. So what MapIt does here is a bit of a hack – it has in its main template a “!!!DATA!!!” placeholder, and creates an iterator out of the template before/after that placeholder, and one compiled template for each row of the results.
Now Django 1.8 is out, the alternate Jinja2 templating system supports a
generate() function to render a template iteratively, which would be a cleaner way of dealing with the issue (though the templates would need to be translated to Jinja2, of course, and it would be more awkward to support less than 1.8). Alternatively, creating a generator version of Django’s Template.render() is Django ticket #13910, and it might be interesting to work on that at the Django sprint later this week.
Using a StreamingHTTPResponse is an easy way to output large amounts of data with Django, without taking up lots of memory, though I found it does involve a slightly different style of programming thinking. Make sure you have plenty of tests, as ever 🙂 Streaming JSON was mostly straightforward, though needed some creative encouragement when wanting to output a dictionary; if you’re after HTML streaming and are using Django 1.8, you may want to investigate Jinja2 templates now that they’re directly supported.
[ I apologise in the above for every mistaken use of generator instead of iterator, or vice-versa; at least the code runs okay 🙂 ]