Jan 10 ·6min read
Time Series is a class of data science problems where the primary values of interest are a series of data points measured over a period of time. This notebook aims to provide the basic building blocks of some of the more modern algorithms / techniques (and data!) for solving these types of problems.
Is ARIMA the first thing you think of when you hear about time series? It might be time to explore other ventures and methodologies. There is a lot of new innovation and modern techniques being actively developed and some of them are outperforming the traditional ARIMA models. We’ll look at some of these models and try to apply them on stock market data to predict price.
Models explored in this notebook:
Auto ARIMAX
Facebook Prophet
Lets get some data!
!pip install pmdarima import lightgbm as lgb import numpy as np import pandas as pdfrom fbprophet import Prophet from matplotlib import pyplot as plt from pmdarima import auto_arima from sklearn.metrics import mean_absolute_error, mean_squared_errormyfavouritenumber = 23 seed = myfavouritenumber np.random.seed(seed)
The dataset used is stock market data of the Nifty-50 index from NSE (National Stock Exchange) India over the last 20 years (2000–2019)
The historic VWAP (Volume Weighted Average Price) is the target variable to predict. VWAP is a trading benchmark used by traders that gives the average price the stock has traded at throughout the day, based on both volume and price.
Read more about the dataset: https://www.kaggle.com/rohanrao/nifty50-stock-market-data
The stock used is BAJAJFINSV .
Reading the market data of BAJAJFINSV stock and preparing a training dataset and validation dataset.
df = pd.read_csv("/kaggle/input/nifty50-stock-market-data/BAJAJFINSV.csv") df.set_index("Date", drop=False, inplace=True) df.head()
Plotting the target variable VWAP over time
df.VWAP.plot(figsize=(14, 7))
Almost every time series problem will have some external features or some internal feature engineering to help the model.
Let’s add some basic features like lag values of available numeric features that are widely used for time series problems. Since we need to predict the price of the stock for a day, we cannot use the feature values of the same day since they will be unavailable at actual inference time. We need to use statistics like mean, standard deviation of their lagged values.
We will use three sets of lagged values, one previous day, one looking back 7 days and another looking back 30 days as a proxy for last week and last month metrics.
df.reset_index(drop=True, inplace=True) lag_features = ["High", "Low", "Volume", "Turnover", "Trades"] window1 = 3 window2 = 7 window3 = 30 df_rolled_3d = df[lag_features].rolling(window=window1, min_periods=0) df_rolled_7d = df[lag_features].rolling(window=window2, min_periods=0) df_rolled_30d = df[lag_features].rolling(window=window3, min_periods=0) df_mean_3d = df_rolled_3d.mean().shift(1).reset_index().astype(np.float32) df_mean_7d = df_rolled_7d.mean().shift(1).reset_index().astype(np.float32) df_mean_30d = df_rolled_30d.mean().shift(1).reset_index().astype(np.float32) df_std_3d = df_rolled_3d.std().shift(1).reset_index().astype(np.float32) df_std_7d = df_rolled_7d.std().shift(1).reset_index().astype(np.float32) df_std_30d = df_rolled_30d.std().shift(1).reset_index().astype(np.float32) for feature in lag_features: df[f"{feature}_mean_lag{window1}"] = df_mean_3d[feature] df[f"{feature}_mean_lag{window2}"] = df_mean_7d[feature] df[f"{feature}_mean_lag{window3}"] = df_mean_30d[feature] df[f"{feature}_std_lag{window1}"] = df_std_3d[feature] df[f"{feature}_std_lag{window2}"] = df_std_7d[feature] df[f"{feature}_std_lag{window3}"] = df_std_30d[feature] df.fillna(df.mean(), inplace=True) df.set_index("Date", drop=False, inplace=True) df.head()
For boosting models, it is very useful to add datetime features like hour, day, month, as applicable to provide the model information about the time component in the data. For time series models it is not explicitly required to pass this information but we could do so and we will in this notebook so that all models are compared on the exact same set of features.
df.Date = pd.to_datetime(df.Date, format="%Y-%m-%d") df["month"] = df.Date.dt.month df["week"] = df.Date.dt.week df["day"] = df.Date.dt.day df["day_of_week"] = df.Date.dt.dayofweek
Splitting the data into train and validation along with features.
df_train = df[df.Date < "2019"] df_valid = df[df.Date >= "2019"] exogenous_features = ["High_mean_lag3", "High_std_lag3",Low_mean_lag3", "Low_std_lag3","Volume_mean_lag3","Volume_std_lag3","Turnover_mean_lag3"Turnover_std_lag3", "Trades_mean_lag3", "Trades_std_lag3", "High_mean_lag7", "High_std_lag7", "Low_mean_lag7", "Low_std_lag7", "Volume_mean_lag7", "Volume_std_lag7", "Turnover_mean_lag7", "Turnover_std_lag7", "Trades_mean_lag7", "Trades_std_lag7", "High_mean_lag30", "High_std_lag30", "Low_mean_lag30", "Low_std_lag30", "Volume_mean_lag30", "Volume_std_lag30", "Turnover_mean_lag30", "Turnover_std_lag30", "Trades_mean_lag30", "Trades_std_lag30", "month", "week", "day", "day_of_week"]
The additional features supplied to time series problems are called exogenous regressors.
ARIMA (Auto Regressive Integrated Moving Average) models explain a given time series based on its own past values, that is, its own lags and the lagged forecast errors, so that equation can be used to forecast future values.
ARIMA models require certain input parameters: p for the AR(p) part, q for the MA(q) part and d for the I(d) part. Thankfully, there is an automatic process by which these parameters can be chosen which is called Auto ARIMA.
When exogenous regressors are used with ARIMA it is commonly called ARIMAX.
Read more about ARIMA:
https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_averagemodel = auto_arima(df_train.VWAP, exogenous=df_train[exogenous_features], trace=True, error_action="ignore", suppress_warnings=True) model.fit(df_train.VWAP, exogenous=df_train[exogenous_features]) forecast = model.predict(n_periods=len(df_valid), exogenous=df_valid[exogenous_features]) df_valid["Forecast_ARIMAX"] = forecast
The best ARIMA model is ARIMA(2, 0, 1) which has the lowest AIC .
df_valid[["VWAP", "Forecast_ARIMAX"]].plot(figsize=(14, 7))
print("RMSE of Auto ARIMAX:", np.sqrt(mean_squared_error(df_valid.VWAP, df_valid.Forecast_ARIMAX))) print("\nMAE of Auto ARIMAX:", mean_absolute_error(df_valid.VWAP, df_valid.Forecast_ARIMAX))
The Auto ARIMAX model seems to do a fairly good job in predicting the stock price given data till the previous day. Can other models beat this benchmark?
Prophet is an open-source time series model developed by Facebook. It was released in early 2017. An exerpt from the homepage:
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
Read more about Prophet: https://facebook.github.io/prophet/
Note that the default parameters are used for Prophet. They can be tuned to improve the results.
model_fbp = Prophet() for feature in exogenous_features: model_fbp.add_regressor(feature)model_fbp.fit(df_train[["Date", "VWAP"] + exogenous_features].rename(columns={"Date": "ds", "VWAP": "y"}))forecast = model_fbp.predict(df_valid[["Date", "VWAP"] + exogenous_features].rename(columns={"Date": "ds"})) df_valid["Forecast_Prophet"] = forecast.yhat.valuesmodel_fbp.plot_components(forecast)
df_valid[["VWAP", "Forecast_ARIMAX", "Forecast_Prophet"]].plot(figsize=(14, 7))
print("RMSE of Auto ARIMAX:", np.sqrt(mean_squared_error(df_valid.VWAP, df_valid.Forecast_ARIMAX))) print("RMSE of Prophet:", np.sqrt(mean_squared_error(df_valid.VWAP, df_valid.Forecast_Prophet))) print("\nMAE of Auto ARIMAX:", mean_absolute_error(df_valid.VWAP, df_valid.Forecast_ARIMAX)) print("MAE of Prophet:", mean_absolute_error(df_valid.VWAP, df_valid.Forecast_Prophet))
我来评几句
登录后评论已发表评论数()