EuroSciPy logo


Cambridge, UK - 27-30 August 2014

Generalizing nonparametric regression methods in Python

Jason Rudy


What statisticians used to call model selection is now often a component of model fitting. Machine learning gives us techniques for fitting not only the coefficients of our models, but also the parametric forms of the models themselves, to our data sets. Examples are random forests, multivariate adaptive regression splines, and gaussian process regression. For computational simplicity, most such techniques are designed to minimize the euclidean distance, or some penalized variant of euclidian distance, between predicted and observed responses. The minimum euclidean solution corresponds to a maximum likelihood estimate under the assumption of normally distributed residuals. However, there are many types of data sets for which residuals should be treated as non-gaussian. Statisticians often address such problems using the framework of generalized linear models (GLM). I will explain how the mathematical framework of GLMs can be combined with commonly used machine learning algorithms to perform nonparametric regression under more generalized residual distributions and demonstrate prototypes based on the scikit-learn interface.