Leverage knowledge from under-represented classes in machine learning: an introduction to imbalanced-learn

The curse of imbalanced data sets

The curse of imbalanced data set refers to data sets in which the number of samples in one class is less than in others. This issue is often encountered in real world data sets such as medical imaging applications (e.g. cancer detection), fraud detection, etc. In such particular condition, machine learning algorithms learn sub-optimal models which will generally favor the class having the largest number of samples.

In this talk, we will present the imbalanced-learn package which implement some of the state-of-the-art algorithms, tackling the class imbalance problem.

A `scikit-learn-contrib` project

scikit-learn includes a tremendous set of pre-processing methods (i.e. transformers, standardizers, etc.) to optimally train machine learning algorithms. However, there is currently no estimators to reduce or generate samples. Therefore, the imbalanced-learn provides a new type of estimator, named sampler, aiming at resampling a data set whenever it is desired. The samplers are fully compatible with the current scikit-learn API and are composed of the following main methods inspired from scikit-learn: (i) fit, (ii) sample, and (iii) fit_sample. Additionally, a class Pipeline is inherited from scikit-learn, permitting to incorporate samplers in the usual classification pipeline. During the talk, we will also present the key parameters, shared by all the samplers.

A data science perspective

Regarding the data science aspect of this talk, we will highlight the distinctive characteristics of the different algorithms: (i) over-sampling, (ii) controlled under-sampling, (iii) cleaning under-sampling, (iv) combination of over-sampling and cleaning under-sampling, and (v) ensemble sampler.

Concrete examples

In addition, we will briefly present a couple of examples in which the package has been used on real-world data sets.

Perspectives

Our package is still under heavy development and we are aiming at improving the following points:

Speed optimization through benchmarking and profiling.
Quantitative classification performance benchmarking.
Additional algorithms (categorical features, ...)