The curse of imbalanced data set refers to data sets in which the number of samples in one class is less than in others. This issue is often encountered in real world data sets such as medical imaging applications (e.g. cancer detection), fraud detection, etc. In such particular condition, machine learning algorithms learn sub-optimal models which will generally favor the class having the largest number of samples.
In this talk, we will present the imbalanced-learn
package which implement some of the state-of-the-art algorithms, tackling the class imbalance problem.
scikit-learn-contrib
projectscikit-learn
includes a tremendous set of pre-processing methods (i.e. transformers, standardizers, etc.) to optimally train machine learning algorithms. However, there is currently no estimators to reduce or generate samples. Therefore, the imbalanced-learn
provides a new type of estimator, named sampler, aiming at resampling a data set whenever it is desired. The samplers are fully compatible with the current scikit-learn
API and are composed of the following main methods inspired from scikit-learn
: (i) fit
, (ii) sample
, and (iii) fit_sample
. Additionally, a class Pipeline
is inherited from scikit-learn
, permitting to incorporate samplers in the usual classification pipeline. During the talk, we will also present the key parameters, shared by all the samplers.
Regarding the data science aspect of this talk, we will highlight the distinctive characteristics of the different algorithms: (i) over-sampling, (ii) controlled under-sampling, (iii) cleaning under-sampling, (iv) combination of over-sampling and cleaning under-sampling, and (v) ensemble sampler.
In addition, we will briefly present a couple of examples in which the package has been used on real-world data sets.
Our package is still under heavy development and we are aiming at improving the following points: