Friday 11 a.m.–11:15 a.m.

Big data in little laptop: a streaming story in Python (featuring toolz)

Juan Nunez-Iglesias

Audience level:


Python contains nice primitives for streaming data processing. @mrocklin's Toolz library extends them to enable gorgeous, concise, memory-frugal code. I'll present an intuitive approach to streaming analysis in Python, starting with a "hello world"-level example, moving through image correction and streaming sklearn classifiers, and finally analysing a complete genome in a few minutes.


In my brief experience people rarely take this [streaming] route. They use single-threaded in-memory Python until it breaks, and then seek out Big Data Infrastructure like Hadoop/Spark at relatively high productivity overhead. — Matt Rocklin

That quote succinctly summarises my computational life, right up until recent months. I doubt I'm alone, so with this talk I hope to convert a few more SciPythonistas to the Streaming, Out-of-core Way.

In traditional programming models, you pass a function some data, the function processes the data, and then it returns the result. But in streaming programs, a function processes some of the data, yields the processed chunk, then downstream functions deal with that chunk, then the original function receives a bit more, and so on... All these things are going on at the same time! How can one keep them straight?

For many years, I didn't. I mostly avoided Python's iterators and generators in favour of numpy arrays and pandas DataFrames. Whenever these structures got too big, I manually chunked them and distributed them on a compute cluster. But Matt Rocklin's blog posts on this topic opened my eyes to the utility and elegance of streaming data analysis and the need for libraries to support iterators as input.

The Python language contains nice primitives for streaming data processing, and these can be combined with Matt's Toolz library to generate gorgeous, concise code that is extremely memory-efficient.

I'll present streaming data analysis in Python from the ground up, starting with a "hello world"-level example, moving through image illumination correction and streaming versions of scikit-learn classifiers, and finally analysing a full human genome in a few minutes.