EuroSciPy logo

EuroSciPy 2013

Brussels, Belgium - August 21-24 2013

Distributed High-Dimensional Regression with Shared Memory for Neuroimaging-Genetic Studies

Benoit Da Mota

Sat 24 11:30 a.m.–11:50 a.m. in Dupreel

Abstract

Problem setting and issues

In the 90's, the Human Genome Project started sequencing and mapping the nucleotides of a human reference genome. This project is undoubtedly one of the biggest recent achievements of the international scientific collaboration. Soon our genome might be part of our medical records with promises of new cures. What will be the next breakthrough? Possibly a better comprehension of the human brain, as many efforts are ongoing, with the Human Brain Project and BRAIN initiative. This opens a new challenge: how do genetics influence brain functions? The integration of genetics information with neuroimaging data promises to significantly improve our understanding of both normal and pathological variability of the brain. It should lead to the development of bio-markers and in the future personalized medicine via gene therapy. The statistical analysis of such high-dimensional and complex data is carried out with increasingly sophisticated techniques and represents a great computational challenge.

Functional neuroimaging-genetic datasets consist of a set of :

  • genotype measurements on genetic variables, such as Single Nucleotide Polymorphisms (SNPs),

  • 3D brain images, that represent the amount of functional activation in response to a certain task (very noisy data).

The standard approach for such studies is the Mass Univariate Linear Model [ste10], that considers each (SNP, voxel) pair independently and tests the significance of the correlation between these traits. With 50,000 voxels and 500,000 SNPs, there are 25 billions possible pairs, many that will appear correlated by chance: the problem of multiple comparisons leads to very low sensitivity.

A more realistic goal is to look at how all the genetic variants can predict the brain activity to assess their influence. This work describes how we tackle the underlying computational challenges to such questions: we want to use all the CPU cores with a limited amount of memory per core (about 2GB) in an HPC context. The memory is shared between process using memmap, an efficient vectorized ridge regression is described and the computation is distributed with a MapReduce approach.

Preliminary choice

  1. Neuroimaging dimensions reduction

In functional neuroimaging, brain atlases can be used to provide a simpler representation of the data by considering signal averages within groups of neighboring voxels. If those groups do not overlap and that every voxel belongs to one group, the term parcel is employed, and the atlas is called a parcellation. In this work, we restrict ourselves to working with parcellations because : (i) it is a simple and easily interpretable method, and (ii) by reducing the number of descriptors, it reduces the multiple comparisons problem [thi06]. For this purpose we use the Ward hierarchical clustering of the Scikit-learn.

  1. Multivariate models: ridge regression

Gains in sensitivity might be provided by multivariate models in which the joint variability of several genetic variables is considered simultaneously. But, the cost of unitary fit is high due to the size of the data; moreover, permutation testing [and01] is necessary to assess the statistical significance (p-values) of the results of such procedures in the absence of analytical tests. In this work we will consider the simplest approach, ridge regression [koh11], that is powerful for detecting multivariate associations between large variable sets, but does not enforce sparsity in the solution. Our starting point is the RidgeCV of the Scikit-learn.

In this setting, we have 1,000 parcels and approximately 500,000 SNPs observed in about 1,500 subjects. the outer cross-validation loop makes 10 iterations (to measure the prediction) and the inner loop choose between 5 penalization parameters. A permutation test with 10,000 iterations, raises the total number of ridge regressions to 600 millions!

The computational burden is huge, but the memory limitation is a more constraining problem. In the inner cross-validation loop the number of subjects is about 1,000 subjects, then the matrix of genetic measurements already takes 1.9GB in simple precision.

Our approach

  1. Vectorized ridge regression

The main idea is to vectorize the ridge regression to take advantage of floating point SIMD instructions of the modern CPUs. To compute the coefficients of the ridge regression (Beta), we compute the SVD of X ( where X are the genetic measurements) with the Eigen solver of SciPy, then the problem can be rewritten

Ridge coefficients formula

where y are the parcels signals (i refers to a given parcels) and lambda the penalization parameter (j refers to a given parameter). Then, the vectorized version is almost straightforward to obtain, avoiding the loop on i and j.

  1. Optimal caching of the sub-results and memmap for shared memory

Solving the ridge with the SVD algorithm has a big advantage with a permutation test: the computation of SVD(X) (and some dot products) depends neither from the parcels neither from the permutations and so will be reused 10 millions times. Finally, most of the runtime is NumPy dot products (~85%) and the computation of the 60 SVDs.

We first use joblib to manage results caching, but in a multicore context the memory constraints forced us to move to memmap. With files caching, the system provides us an efficient mechanism to share memory with multiprocessing. The access to the memmap are very predictible and theoretically the computation lets us the time to load the data from the disk, so we try to prefetch the data. The result is quite disappointing: the prefetching process was unable to load the data during computations. We impute this impossibility to the saturation of the memory bus.

We try to solve our problem with multi-threaded MKL (OpenMP), but solving 12 dot products in parallel is far more efficient than solving one with 12 cores. All computations are in single precision (32 bits) and dot products are computed in place or with tiling techniques to control memory consumption and this improves slightly performances. Memory saving is obvious since Numpy does not have to communicate the full matrix to the underlying linear algebra library. For the product of a matrix A by a matrix B, there are two cases :

  • The case without loss of precision. A has many lines (same idea if B has many columns)

    for i in range(0, A.shape[0], tile_size):
        result[i:i + tile_size, :] = np.dot(A[i: i + tile_size, :])
    
  • The case with loss of precision. B has many lines (same idea if A has many columns)

    for i in range(0, B.shape[0], tile_size):
        result[:, :] += np.dot(A, B[:, i:i + tile_size])
    

For our application, the accuracy loss is negligible and this trick alleviates some constraints for the library and is responsible for a part of the speedup. The other part is linked to the size of the tiles and architecture of modern computers (CPUs cache and memory bandwidth).

In summary, results of the SVDs are shared with memmap that are accessed in parallel and we tune the amount of ridge regressions solved at the same time to ensure that the system keep the memmap in the cache.

memory profile

  1. The distributed part: MapReduce like

We use the data locality so that each node works only on one inner cross-validation loop. This choice limits to 6 the number of SVDs to load and to keep in cache (~12GB) and allows to fit the model for 1,000 targets and 5 penalization parameters at the same time. If needed, we can limit the role of a node to only one SVD to save more memory or to scale to bigger problems. The rest of the MapReduce algorithm is quite simple:

  • the map performs the computation of the statistical model (both on original and permuted data), and

  • the reducer has a role to compute the mean scores and the p-values from the distribution under the null hypothesis.

Results

To solve our setting, we estimate to approximately 50,000 hourscores the total runtime, ie. 4-5 days on our 480 cores HPC cluster, where the RidgeCV of the Scikit-learn has not enough memory or takes millions hourscores for smaller problems. Coming back to our application, the best explained variance (R2) founded on one parcel is 0.07, which means that we predict only a small fraction of the brain activity. The current genetic measurements are probably to far from our target and we need better comprehension and new advances in both domains to get better results, but the (Python) tool is ready.


[and01] M. J. Anderson and J. Robinson. Permutation tests for linear models, Australian and New Zealand Journal of Statistics, (43):75--88, 2001.

[koh11] O. Kohannim et al. Boosting power to detect genetic associations in imaging using multi-locus, genome-wide scans and ridge regression, In Biomedical Imaging, 2011.

[ste10] J. L. Stein et al. Voxelwise genome-wide association study (vgwas), Neuroimage, 53(3):1160--1174, Nov 2010.

[thi06] B. Thirion et al. Dealing with the shortcomings of spatial normalization: multi-subject parcellation of fMRI datasets. Human Brain Mapping 27, 2006.

Sponsors