EuroSciPy logo

EuroScipy

Cambridge, UK - 27-30 August 2014

Simplify and Popularize Scientific Data Management Using RQL

Alain Leufroy

Abstract

Multimodal and multi-source data are a growing concern for scientists, due to the increase of datasets size and availability. Indeed, they deal with different results coming from experimental data, simulation data or data from databases (e.g. Open Data).

There are already Python tools and libraries which can exploit each of these data (scipy, numpy, pandas, scikit-learn), but combining several of them cannot be done easily without adding a complexity layer.

The two major steps while working with complex scientific data are:

  • modeling the related meta-data (e.g. experiment descriptors such as subject's age in medical studies) that reflect the high-level business logic;
  • importing and storing data in a database.

A common choice is to use a relational database, which allows for complex and in-depth queries, via the SQL language. A model targeting a relational database usually induces an overhead, i.e. forces the user to manipulate objects that are not directly related to the original data model, but that are rather related to the underlying physical structure of the relational database.

We present here Python tools which avoid this overhead by using (1) the YAMS schema library [1] for data modeling and (2) the RQL query language [2]. This language brings several advantages compared to SQL, including human-readable queries closer to the business logic, type inference and abstraction of the database structure. Both tools allow the scientist to import massive data from several sources with a few lines of Python, and do not require advanced technical skills. They allow the scientist to stay focused on data analysis rather than waste time on data access machinery. The tools will be illustrated with Brainomics, a medical data project.

Let's see an example:

"Query all the scans of male subjects"

Any X WHERE X is Scan, X concerns S, S is Subject, S gender "male"

See more examples on Brainomics website

[1] Yet Another Magic Schema

[2] Relation Query Language

NOTE: put this text in the field "Additional Notes"

Brainomics is an open-source solution to manage brain imaging datasets, genomic datasets and associated meta-data. This project is supported by grants from the French National Research Agency (ANR IA BRAINOMICS; ANR-10-BINF-04).

Sponsors