Imagine you perform some experiments, be it lab experiments, business studies or algorithm benchmarking. You will probably manipulate a number of factors and record the outcome of the experiment for particular factor combinations.
So how would you store these results? My naive solution was to create a dict
of a dict
of a dict
... and so on, especially since Python offers a great support for dict
s. And I would store the results in a pickle file.
However, I noticed that this is leads to very brittle structures. You often have to change factor names or combination types as you go. This leads a lot of Russian dolls of dict
s that are incompatible with each other. And the most annoying thing is the churn in the data-producing code as well as in the data-postprocessing code (e.g. visualisation).
The solution is, in hindsight, really simple: Structure your data as late as possible. The data-producing code should record data in the Pandas' long format. Then you are free to change the factors and factor combinations as you wish, the data-producing code stays the same. You now only have to change the data-postprocessing code.
In this talk, I will show a simple example of algorithm benchmarking. I will demonstrate how rigid structures get repeatedly broken over time and how one can alleviate this problem by using Pandas' long format. I will also present a short overview of relevant Pandas functions.
In a way, my call for late structuring resonates with this blog post.