News from the Blosc ecosystem: introducing Bloscpack

Valentin Haenel

Sat 24 11:50 a.m.–12:10 p.m. in Dupreel

Abstract

It has now been 4 years since Francesc Alted held the keynote at the EuroScipy 2009 presenting the starving CPU problem and the Blosc compressor. Many things have happend since then -- hardware has become faster, new use-cases have emerged and the Blosc ecosystem is beginning to flourish.

One of these new use-cases is to persist Blosc compressed data-buffers to disk. Either to be able to save the compressed in-memory representation for future use or to accelerate IO. In this talk I will focus on the Bloscpack project which addresses exactly that use-case. In particular I will explain why the naive approach of just serializing Blosc compressed buffers does not suffice and why I invented a new binary format, the Bloscpack format. I will also describe the features this binary format brings, such as the ability to partially decompress desired chunks, the inclusion of a metadata section in the file and the checksumming features which can help to prevent silent data corruption.

Additionally, I will focus on how one can use Bloscpack to serialize Numpy arrays and compare this feature- and performance-wise to the well-known and established NPY/NPZ format. Lastly I will then go on to show some real-world applications: the role that Bloscpack plays as superchunk in Continuum's BLZ (Blaze) format and an application of Bloscpack to improve the persistent disk caching offered by the Joblib framework.