Are you performing experiments? Then structure your data as late as possible.


Imagine you perform some experiments, be it lab experiments, business studies or algorithm benchmarking. You will probably manipulate a number of factors and record the outcome of the experiment for particular factor combinations.

So how would you store these results? My naive solution was to create a dict of a dict of a dict ... and so on, especially since Python offers a great support for dicts. And I would store the results in a pickle file.

However, I noticed that this is leads to very brittle structures. You often have to change factor names or combination types as you go. This leads a lot of Russian dolls of dicts that are incompatible with each other. And the most annoying thing is the churn in the data-producing code as well as in the data-postprocessing code (e.g. visualisation).


The solution is, in hindsight, really simple: Structure your data as late as possible. The data-producing code should record data in the Pandas' long format. Then you are free to change the factors and factor combinations as you wish, the data-producing code stays the same. You now only have to change the data-postprocessing code.

Talk overview

In this talk, I will show a simple example of algorithm benchmarking. I will demonstrate how rigid structures get repeatedly broken over time and how one can alleviate this problem by using Pandas' long format. I will also present a short overview of relevant Pandas functions.


In a way, my call for late structuring resonates with this blog post.