Unlike other types of statistical models, Time Series models assume dependent observations and argue yesterday's influences today in a meaningful way. In this post, I train a SARIMA model to predict the monthly water levels in Lake Erie. I’ll walk through the pipeline, explaining what to do and why we do it.
#load the raw dataset mydata<-read.csv(“monthly-lake-erie-levels-1921–19.csv”) mydata_new<-mydata[-601,]#remove the last blank row Lake_Erie_Levels_ts<-ts(mydata_new$Monthly.Lake.Erie.Levels.1921…1970,frequency = 12,start=c(1921,1))#turn into a TS model # Plot the Dependent Variable plot(Lake_Erie_Levels_ts,xlab=”Date”,ylab=”Levels”, main=”Figure 1: Monthly Lake Erie Levels from 1921 to 1970")
As a rule of thumb, we should eyeball the raw data for outliers, trend, seasonality, and sharp discontinuity.
From Figure 1 , data points move up and down within a certain range, suggesting constant variance.
The mean moves along with the horizontal line (time) regularly every 12 lags, indicating a cyclical pattern. We will do a spectral analysis to determine the exact seasonality.
No sharp discontinuity.
Let’s split the data into two sets: training and test.
#We train a model using 90% of the data, aka. 1921.01–1965.12 training = window(Lake_Erie_Levels_ts,start=c(1921,1), end=c(1965,12)) #10% for the test set, aka. 1966.1-1970.12 test = window(Lake_Erie_Levels_ts,start=c(1966,1), end=c(1970,12))#test: 1966.1–1970.12; 10% as the test
Let’s apply a spectral analysis to determine the seasonality.
install.packages(“TSA”) require(TSA) periodogram(Lake_Erie_Levels_ts);abline(h=0);axis(1,at=0.084)
Spectral analysis suggests a period of 1/0.084 ≈11.9, which is pretty close to 12.
We decide the seasonality is 12, which makes sense since we have monthly data.
library(tseries, quietly = T) adf.test(training) Augmented Dickey-Fuller Test data: training Dickey-Fuller = -2.0622, Lag order = 8, p-value = 0.552 alternative hypothesis: stationary
To test for stationarity, we conduct an Augmented Dickey-Fuller Test.
As it turns out, the data is not stationary.
So, let’s take differencing to make the data stationary. Starting from a first differencing:
training_diff1 <- diff(training,1) ts.plot(training_diff1,ylab= expression(paste(nable,y))) var(training_diff1) adf.test(training_diff1)
Great! The dataset is stationary now. The mean stabilized over time with only a few significant spikes.
Just play the safe card, let’s check if a second differencing is necessary.
training_diff2 <- diff(training,2) var(training_diff2) adf.test(training_diff2)
After the second differencing, the model is still stationary but with an increased variance, which suggests over-differencing.
library(PerformanceAnalytics) chart.ACFplus(training_diff1, maxlag = 84, elementcolor = “black”)
To determine the potential relationships between lags, we plot the ACF and PACF plots after the first differencing.
The two plots suggest a strong seasonality component because ACFs die off regularly. In other words, we need to include a seasonality component.
training_diff1_seasonal12<-diff(training_diff1,12) plot.ts(training_diff1_seasonal12) chart.ACFplus(training_diff1_seasonal12, maxlag = 84, elementcolor = “gray”)
It becomes stationary with only a few spikes around the year 1940.
Based on the ACF and PACF patterns, we tentatively come up with the following seasonal models:
1. ACF cuts off after lag 1 & PACF is tailing off → SMA(1), or P=0,Q=1.
2. A spike of ACF at lag 12 & PACF decay exponentially with seasonal lags at 12, 24, 36, and others → SMA(1), or P=0 and Q=1.
3. ACF and PACF both are tailing off in seasonal lags → SARMA(3,1), or P=3,Q=1
1. PACF cuts off after lag 1 → AR(1), or p=1,q=0 2. PACF cuts off after lag 1 & ACF cuts off after lag 1 → p=q=1 3. PACF cuts off after lag 3 & ACF cuts off after lag 1 → p=3, q=1
In total, we have the following 6 potential models with seasonal and non-seasonal components listed:
########### use auto.arima to find the best model automatically
model_auto= auto.arima(training, stepwise=FALSE, approximation=FALSE)#this function takes a lot of time to run!!!!
summary(model_auto)#ARIMA(1,1,2)(2,0,0) with AIC=670.27 AICc=670.45 BIC=695.31###### auto selection
summary(model_auto_1)# ARIMA(1,1,0)(2,0,0) with AIC=672.23 AICc=672.31 BIC=688.91
The built-in function ( auto.arima ) generates a model as a reference point, comparing to the hand-picked model.
model_1 = arima(training, order = c(1,1,0),seasonal = list(order=c(0,1,1),period =12));model_1#AIC=553.74 model_2 = arima(training, order = c(1,1,1),seasonal = list(order=c(0,1,1),period =12));model_2#AIC=555.48 model_3 = arima(training, order = c(3,1,1),seasonal = list(order=c(0,1,1),period =12));model_3#AIC=548.71model_4 = arima(training, order = c(1,1,0),seasonal = list(order=c(3,1,1),period =12));model_4#AIC=556.94 model_5 = arima(training, order = c(2,1,1),seasonal = list(order=c(3,1,1),period =12));model_6#AIC=558.24 model_6 = arima(training, order = c(3,1,0),seasonal = list(order=c(3,1,1),period =12));model_7#AIC=553.14
We run these six functions and obtain the AIC results. It seems that Model_3 is the winner, which is consistent with the result suggested by ACF/PACF.
ts.plot(residuals(model_3))# a few spikes observed. # normality check shapiro.test(residuals(model_3)) qqnorm(residuals(model_3));qqline(residuals(model_3),col =”blue”)#skewed to the right as can be seen from the distribution that a few data points stand above the abline.
According to Shapiro-Wilk normality test and Q-Q plot, residuals are not normally distributed.
chart.ACFplus(residuals(model_3), maxlag = 48, elementcolor = “gray”)
From the ACF and PACF plots, most residual points fall within the 95% confidence interval with a few significant spikes. Residuals follow a pattern of white noise.
Box.test(residuals(model_3),lag=22,type = c(“Ljung-Box”),fitdf=4) Box-Ljung test data: residuals(model_3) X-squared = 23.657, df = 18, p-value = 0.1666
Residuals pass the linear dependency test after passing the Ljung-Box test.
#install.packages(“GeneCycle”) library(“GeneCycle”) fisher.g.test(residuals(model_3)) 0.436583
Residuals follow a Gaussian white noise distribution after Fisher’s test.
mypred <- predict(model_3, n.ahead=120) Upper_tr = mypred$pred + 1.96*mypred$se Lower_tr = mypred$pred — 1.96*mypred$se ts.plot(training, xlab=”Year”,ylab=”Level”,main = “Forecast”) lines(Upper_tr,lty=2,col=”gray”) lines(Lower_tr,lty=2,col=”gray”) points(mypred$pred,col=”blue”) lines(test,col=”red”)#plot test data