We’ll cover the following 4 topics:
To go straight to the Python code that shows how to test for normality, scroll down to the section named Example .
The data set used in the article can be downloaded from this link .
Normality means that your data follows the normal distribution. Specifically, each value y_i in Y is a ‘realization’ of some normally distributed random variable N(µ_i , σ_i) as follows:
While building a linear regression model, one assumes that Y depends on a matrix of regression variables X. This makes Y conditionally normal on X. If X =[x_1, x_2, …, x_n] are jointly normal, then µ = f( X ) is a normally distributed vector, and so is Y, as follows:
Several statistical techniques and models assume that the underlying data is normally distributed.
I’ll give below three such situations where normality rears its head:
Bear in mind that even if the errors are not normally distributed, the OLS estimator is still the BLUE i.e. B est L inear U nbiased E stimator for the model as long as E(ϵ)=0, and all other requirements of OLSR are satisfied.
Several statistical tests are available to test the degree to which your data deviates from normality, and if the deviation is statistically significant .
In this article, we’ll look at moment based measures, namely Skewness and Kurtosis , and the statistical tests of significance, namely Omnibus K² and Jarque — Bera , that are based on these measures.
Skewness lets you test by how much the overall shape of a distribution deviates from the shape of the normal distribution.
The following figures illustrate skewed distributions.
The moment based definition of Skewness is as follows:
Skewness is defined as the third standardized central moment, of the random variable of the probability distribution.
The formula for skewness of the population is show below:
In practice, we can estimate the skewness in the population by calculating skewness for a sample. For the sample, we cheat a little by assuming that the random variable is uniformly distributed, so the probability of each y_i in the sample is 1/n and the third, central, sample moment becomes 1/n times a simple summation over all (y_i —y_bar)³.
Skewness is very sensitive to the parameters of the probability distribution.
The following figure illustrates the skewness of the Poisson distribution ’s P robability M ass F unction for various values of the event rate parameter λ:
Why does skewness of Poisson’s PMF reduce for large event rates?For large values of λ, the Poisson distribution’s PMF approaches the Normal distribution’s PMF with mean and variance = λ. That is, Poisson(λ) → N(λ, λ), as λ → ∞. Therefore, it’s no coincidence what are seeing in the above figure.
As λ → ∞, skewness of the Poisson distribution tends to the skewness of the normal distribution, namely 0.
There are other measures of Skewness also, for example:
Kurtosis is a measure of how differently shaped are the tails of a distribution as compared to the tails of the normal distribution. While skewness focuses on the overall shape, Kurtosis focuses on the tail shape.
Kurtosis is defined as follows:
Kurtosis is the fourth standardized central moment, of the random variable of the probability distribution.
The formula for Kurtosis is as follows:
For a sample, excess Kurtosis is estimated by dividing the fourth central sample moment by the fourth power of the sample standard deviation, and subtracting 3.0, as follows:
Here is an excellent image from Wikipedia Commons that shows the Excess Kurtosis of various distributions. I have super-imposed a magnified version of the tails in the top left side of the image:
While Skewness and Kurtosis quantify the amount of departure from normality, one would want to know if the departure is statistically significant . The following two tests let us do just that:
In both tests, we start with the following hypotheses:
The Omnibus test combines the random variables for Skewness and Kurtosis into a single test statistic as follows:
Probability distribution of the test statistic:In the above formula, the functions Z1() and Z2() are meant to make the random variables g1 and g2 approximately normally distributed. Which in turn makes their sum of squares approximately Chi-squared(2) distributed , thereby making the statistic of the Omnibus K-squared approximately Chi-squared(2) distributed under the assumption that null hypothesis is true , i.e. the data is normally distributed.
The test statistic for this test is as follows:
Probability distribution of the test statistic:The test statistic is the scaled sum of squares of random variables g1 and g2 that are each approximately normally distributed, thereby making the JB test statistic approximately Chi-squared(2) distributed , under the assumption that the null hypothesis is true.
We’ll use the following data set from the U.S. Bureau of Labor Statistics , to illustrate the application of normality tests:
Here are the first few rows of the data set:
You can download the data from this link .
Let’s fit the following OLS regression model to this data set:
Wages = β_0 + β_1 * Year + ϵ
Yearis the regression a.k.a. explanatory variable
is the intercept of regression,β_1
is the coefficient of regression, andϵ is the unexplained regression error
We’ll use Python libraries pandas and statsmodels to read the data, and to build and train our OLS model for this data.
Let’s start with importing the required packages:
import pandas as pdimport statsmodels.formula.api as smfimport statsmodels.stats.api as smsfrom statsmodels.compat import lzipimport matplotlib.pyplot as pltfrom statsmodels.graphics.tsaplots import plot_acf
Read the data into the pandas data frame:
df = pd.read_csv('wages_and_salaries_1984_2019_us_bls_CXU900000LB1203M.csv', header=0)
Plot Wages against Year :
fig = plt.figure()plt.xlabel('Year')plt.ylabel('Wages and Salaries (USD)')fig.suptitle('Wages and Salaries before taxes. All US workers')wages, = plt.plot(df['Year'], df['Wages'], 'go-', label='Wages and Salaries')plt.legend(handles=[wages])plt.show()
Create the regression expression in patsy syntax. In the following expression, we are telling statsmodels that Wages is the response variable and Year is the regression variable. statsmodels will automatically add an intercept to the regression equation.
expr = 'Wages ~ Year'
Configure the OLS regression model by passing the model expression, and train the model on the data set, all in one step:
olsr_results = smf.ols(expr, df).fit()
Print the model summary:
In the following output, I have called out the areas that bode well and bode badly for our OLS model’s suitability for the data:
Following are a few things to note from the results:
name = ['Omnibus K-squared test', 'Chi-squared(2) p-value']#Pass the residual errors of the regression into the test
test = sms.omni_normtest(olsr_results.resid)lzip(name, test)#Prints out the following:
> [('Omnibus K-squared test', 1.2194658631806088), ('Chi-squared(2) p-value', 0.5434960003061313)]name = ['Jarque-Bera test', 'Chi-squared(2) p-value', 'Skewness', 'Kurtosis']test = sms.jarque_bera(olsr_results.resid)lzip(name, test)#Prints out the following:
[('Jarque-Bera test', 1.109353094606092), ('Chi-squared(2) p-value', 0.5742579764509973), ('Skewness', 0.26780140709870015), ('Kurtosis', 2.3116476989966737)]
Now for the bad part:Both the Durbin-Watson test and the Condition number of the residuals indicates auto-correlation in the residuals, particularly at lag 1.
We can easily confirm this via the ACF plot of the residuals:
plot_acf(olsr_results.resid, title='ACF of residual errors')plt.show()
This presents a problem for us: One of the fundamental requirements of Classical Linear Regression Models is that the residual errors should not be auto-correlated. In this case they most certainly are so. Which means that the OLS estimator may have under-estimated the variance in the training data, which in turn means that it’s predictions will be off by a large amount.
Simply put, the OLS estimator is no longer BLUE ( B est L inear U nbaised E stimator) for the model. Bummer!
The auto-correlation of residual errors points to a possibility that our model was incorrectly chosen, or incorrectly configured. Particularly,
In such cases, your choice is between accepting the sub optimal-ness of the chosen model, and addressing the above two reasons for sub optimality.
Thanks for reading! If you liked this article, please follow me at Sachin Date to receive tips, how-tos and programming advice on time series analysis and forecasting.