A Time Series Analysis of Lake Erie From 1921 to 1970 Using A SARIMA Model

Unlike other types of statistical models, Time Series models assume dependent observations and argue yesterday's influences today in a meaningful way. In this post, I train a SARIMA model to predict the monthly water levels in Lake Erie. I’ll walk through the pipeline, explaining what to do and why we do it.

  1. Load, Clean, and Transform The Dataset
#load the raw dataset
mydata_new<-mydata[-601,]#remove the last blank row
Lake_Erie_Levels_ts<-ts(mydata_new$Monthly.Lake.Erie.Levels.1921…1970,frequency = 12,start=c(1921,1))#turn into a TS model
# Plot the Dependent Variable
 main=”Figure 1: Monthly Lake Erie Levels from 1921 to 1970")

As a rule of thumb, we should eyeball the raw data for outliers, trend, seasonality, and sharp discontinuity.

From Figure 1 , data points move up and down within a certain range, suggesting constant variance.

The mean moves along with the horizontal line (time) regularly every 12 lags, indicating a cyclical pattern. We will do a spectral analysis to determine the exact seasonality.

No sharp discontinuity.

2. Preliminary Preparations: Data Split and Check For Seasonality and Stationarity

Let’s split the data into two sets: training and test.

#We train a model using 90% of the data, aka. 1921.01–1965.12
training = window(Lake_Erie_Levels_ts,start=c(1921,1), end=c(1965,12))
#10% for the test set, aka. 1966.1-1970.12
test = window(Lake_Erie_Levels_ts,start=c(1966,1), end=c(1970,12))#test: 1966.1–1970.12; 10% as the test

Let’s apply a spectral analysis to determine the seasonality.


Spectral analysis suggests a period of 1/0.084 ≈11.9, which is pretty close to 12.

We decide the seasonality is 12, which makes sense since we have monthly data.

library(tseries, quietly = T)

	Augmented Dickey-Fuller Test

data:  training
Dickey-Fuller = -2.0622, Lag order = 8, p-value = 0.552
alternative hypothesis: stationary

To test for stationarity, we conduct an Augmented Dickey-Fuller Test.

As it turns out, the data is not stationary.

So, let’s take differencing to make the data stationary. Starting from a first differencing:

training_diff1 <- diff(training,1)
ts.plot(training_diff1,ylab= expression(paste(nable,y)))

Great! The dataset is stationary now. The mean stabilized over time with only a few significant spikes.

Just play the safe card, let’s check if a second differencing is necessary.

training_diff2 <- diff(training,2)

After the second differencing, the model is still stationary but with an increased variance, which suggests over-differencing.

3. Model Construction and Selection

chart.ACFplus(training_diff1, maxlag = 84, elementcolor = “black”)

To determine the potential relationships between lags, we plot the ACF and PACF plots after the first differencing.

The two plots suggest a strong seasonality component because ACFs die off regularly. In other words, we need to include a seasonality component.

chart.ACFplus(training_diff1_seasonal12, maxlag = 84, elementcolor = “gray”)

It becomes stationary with only a few spikes around the year 1940.

Based on the ACF and PACF patterns, we tentatively come up with the following seasonal models:

1. ACF cuts off after lag 1 & PACF is tailing off → SMA(1), or P=0,Q=1.
2. A spike of ACF at lag 12 & PACF decay exponentially with seasonal lags at 12, 24, 36, and others → SMA(1), or P=0 and Q=1.
3. ACF and PACF both are tailing off in seasonal lags → SARMA(3,1), or P=3,Q=1

Non-seasonal models:

1. PACF cuts off after lag 1 → AR(1), or p=1,q=0 2. PACF cuts off after lag 1 & ACF cuts off after lag 1 → p=q=1 3. PACF cuts off after lag 3 & ACF cuts off after lag 1 → p=3, q=1

In total, we have the following 6 potential models with seasonal and non-seasonal components listed:

  1. Model 1: SARIMA(1,1,0)*(0,1,1)12
  2. Model 2: SARIMA(1,1,1)*(0,1,1)12
  3. Model 3: SARIMA(3,1,1)*(0,1,1)12
  4. Model 4: SARIMA(1,1,0)*(3,1,1)12
  5. Model 5: SARIMA(1,1,1)*(3,1,1)12
  6. Model 6: SARIMA(3,1,1)*(3,1,1)12

4. Model Comparison

########### use auto.arima to find the best model automatically
model_auto= auto.arima(training, stepwise=FALSE, approximation=FALSE)#this function takes a lot of time to run!!!!
summary(model_auto)#ARIMA(1,1,2)(2,0,0)[12] with AIC=670.27 AICc=670.45 BIC=695.31
###### auto selection
model_auto_1= auto.arima(training)
summary(model_auto_1)# ARIMA(1,1,0)(2,0,0)[12] with AIC=672.23 AICc=672.31 BIC=688.91

The built-in function ( auto.arima ) generates a model as a reference point, comparing to the hand-picked model.

model_1 = arima(training, order = c(1,1,0),seasonal = list(order=c(0,1,1),period =12));model_1#AIC=553.74
model_2 = arima(training, order = c(1,1,1),seasonal = list(order=c(0,1,1),period =12));model_2#AIC=555.48
model_3 = arima(training, order = c(3,1,1),seasonal = list(order=c(0,1,1),period =12));model_3#AIC=548.71model_4 = arima(training, order = c(1,1,0),seasonal = list(order=c(3,1,1),period =12));model_4#AIC=556.94
model_5 = arima(training, order = c(2,1,1),seasonal = list(order=c(3,1,1),period =12));model_6#AIC=558.24
model_6 = arima(training, order = c(3,1,0),seasonal = list(order=c(3,1,1),period =12));model_7#AIC=553.14

We run these six functions and obtain the AIC results. It seems that Model_3 is the winner, which is consistent with the result suggested by ACF/PACF.

5. Model Diagnostic

ts.plot(residuals(model_3))# a few spikes observed.
# normality check
qqnorm(residuals(model_3));qqline(residuals(model_3),col =”blue”)#skewed to the right as can be seen from the distribution that a few data points stand above the abline.

According to Shapiro-Wilk normality test and Q-Q plot, residuals are not normally distributed.

chart.ACFplus(residuals(model_3), maxlag = 48, elementcolor = “gray”)

From the ACF and PACF plots, most residual points fall within the 95% confidence interval with a few significant spikes. Residuals follow a pattern of white noise.

Box.test(residuals(model_3),lag=22,type = c(“Ljung-Box”),fitdf=4)	Box-Ljung test

data:  residuals(model_3)
X-squared = 23.657, df = 18, p-value = 0.1666

Residuals pass the linear dependency test after passing the Ljung-Box test.

fisher.g.test(residuals(model_3))[1] 0.436583

Residuals follow a Gaussian white noise distribution after Fisher’s test.

6. Prediction

mypred <- predict(model_3, n.ahead=120)
Upper_tr = mypred$pred + 1.96*mypred$se
Lower_tr = mypred$pred — 1.96*mypred$se
ts.plot(training, xlab=”Year”,ylab=”Level”,main = “Forecast”)
lines(test,col=”red”)#plot test data