In machine learning, linear regression is applied to predict an outcome (called the dependent variable) as a function of one or more predictors (called independent variables), which are correlated with the outcome. For example, the weight of students in a class can be predicted with two variables — age, height — which are correlated with weight. In a linear function this relation can be represented as:
weight = c + b1*age + b2*height
c, b1, b2 are parameters to be estimated from training data.
In this kind of regression two important assumptions are made: a) that the outcome is a continuous variable and b)that it is normally distributed.
However, in reality, this is not the case all the time. Outcomes are not always normally distributed, nor are they always have to be continuous variables.
Here are just a few of many examples where these assumptions are violated:
Since ordinary linear regression is not suitable in those instances, what are the alternatives? Here are some options:
In cases such as #1 and #2 above, if the outcome/dependent variable is binary or categorical, machine learning classification models should work fine. L ogistic Regression for binary and Random Forest for multi-class classification are two frequently applied algorithms in the machine learning world.
That leaves us with two following situations where neither ordinary linear regression nor classification algorithms will work:
1) Count outcome
2) Continuous but skewed outcome
This is where the Generalized Linear Models (GLM) come handy (aside: it’s g eneralized linear models, NOT general linear model which refers to conventional OLS regression). GLMs are a class of models that are applied in cases where linear regression isn’t applicable or fail to make appropriate predictions.
A GLM consists of three components:
There are several great packages in R and Python to implement GLM but below is an implementation using
statmodels library in Python.
Note that if you are into R programming language, be sure to check out this example from a Princeton researcher .
In summary, in this article, we’ve discussed that ordinary linear regression is applied if the outcome is a continuous variable and is normally distributed. However, there are cases where these two assumptions do not hold true. In those situations a suite of Generalized Linear Models is applied. A GLM has three elements: random, systematic and link function which need to be specified in each model implementation.
Hope this was useful, you can follow me on Twitter for updates and new article alerts.