Are you performing experiments? Then structure your data as late as possible.

Problem

Imagine you perform some experiments, be it lab experiments, business studies or algorithm benchmarking. You will probably manipulate a number of factors and record the outcome of the experiment for particular factor combinations.

So how would you store these results? My naive solution was to create a dict of a dict of a dict ... and so on, especially since Python offers a great support for dicts. And I would store the results in a pickle file.

However, I noticed that this is leads to very brittle structures. You often have to change factor names or combination types as you go. This leads a lot of Russian dolls of dicts that are incompatible with each other. And the most annoying thing is the churn in the data-producing code as well as in the data-postprocessing code (e.g. visualisation).

Solution

The solution is, in hindsight, really simple: Structure your data as late as possible. The data-producing code should record data in the Pandas' long format. Then you are free to change the factors and factor combinations as you wish, the data-producing code stays the same. You now only have to change the data-postprocessing code.

Talk overview

In this talk, I will show a simple example of algorithm benchmarking. I will demonstrate how rigid structures get repeatedly broken over time and how one can alleviate this problem by using Pandas' long format. I will also present a short overview of relevant Pandas functions.

References

In a way, my call for late structuring resonates with this blog post.