万千模型于一身：最强机器学习股票预测框架！

1

SG指训练一个用于组合（combine）其他多个不同模型的模型，具体是说首先我们 使用不同的算法或者其他方法能够训练出多个不同的模型，然后将这些模型的输出作为新的数据集，即将这些训练的模型的输出再作为为输入训练一个模型，最后得到一个最终的输出。

2

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.1533

1、建立多个不同的模型（具有不同的学习算法、不同的超参数或不同的特征）来预测。

2、训练一个”元模型”或”混合模型”来确定如何结合每个这些多模型的预测，从而为一个回归或分类任务生成一个单一的、最佳的稳健预测。

3

4

1、样本外训练

2、Non-Negativity

但如果一个模型只有在不断预测错误结果的情况下才有用，那它可能就是一个我们不愿相信的模型。

5

```import numpy as np
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like
%matplotlib inline

from IPython.core.display import HTML,Image
HTML('<style>{}</style>'.format(config.CSS))

def get_symbols(symbols,data_source, begin_date=None,end_date=None):
out = pd.DataFrame()
for symbol in symbols:
df.columns = ['date','open','high','low','close','volume'] #my convention: always lowercase
df['symbol'] = symbol # add a new column which contains the symbol so we can keep multiple symbols in the same dataframe
df = df.set_index(['date','symbol'])
out = pd.concat([out,df],axis=0) #stacks on top of previously collected data
return out.sort_index()

idx = get_symbols(['AAPL','CSCO','MSFT','INTC'],data_source='quandl',begin_date='2012-01-01',end_date=None).index
# note, we're only using quandl prices to generate a realistic multi-index of dates and symbols

num_obs = len(idx)
split = int(num_obs*.80)

## First, create factors hidden within feature set
hidden_factor_1 = pd.Series(np.random.randn(num_obs),index=idx)
hidden_factor_2 = pd.Series(np.random.randn(num_obs),index=idx)
hidden_factor_3 = pd.Series(np.random.randn(num_obs),index=idx)
hidden_factor_4 = pd.Series(np.random.randn(num_obs),index=idx)

## Next, generate outcome variable y that is related to these hidden factors
y = (0.5*hidden_factor_1 + 0.5*hidden_factor_2 +  # factors linearly related to outcome
hidden_factor_3 * np.sign(hidden_factor_4) + hidden_factor_4*np.sign(hidden_factor_3)+ # factors with non-linear relationships
pd.Series(np.random.randn(num_obs),index=idx)).rename('y') # noise

## Generate features which contain a mix of one or more hidden factors plus noise and bias

f1 = 0.25*hidden_factor_1  +  pd.Series(np.random.randn(num_obs),index=idx) + 0.5
f2 = 0.5*hidden_factor_1  +  pd.Series(np.random.randn(num_obs),index=idx) - 0.5
f3 = 0.25*hidden_factor_2  +  pd.Series(np.random.randn(num_obs),index=idx) + 2.0
f4 = 0.5*hidden_factor_2  +  pd.Series(np.random.randn(num_obs),index=idx) - 2.0
f5 = 0.25*hidden_factor_1 + 0.25*hidden_factor_2  +  pd.Series(np.random.randn(num_obs),index=idx)
f6 = 0.25*hidden_factor_3  +  pd.Series(np.random.randn(num_obs),index=idx) + 0.5
f7 = 0.5*hidden_factor_3  +  pd.Series(np.random.randn(num_obs),index=idx) - 0.5
f8 = 0.25*hidden_factor_4  +  pd.Series(np.random.randn(num_obs),index=idx) + 2.0
f9 = 0.5*hidden_factor_4  +  pd.Series(np.random.randn(num_obs),index=idx) - 2.0
f10 = hidden_factor_3 + hidden_factor_4  +  pd.Series(np.random.randn(num_obs),index=idx)

## From these features, create an X dataframe
X = pd.concat([f1.rename('f1'),f2.rename('f2'),f3.rename('f3'),f4.rename('f4'),f5.rename('f5'),
f6.rename('f6'),f7.rename('f7'),f8.rename('f8'),f9.rename('f9'),f10.rename('f10')],axis=1)```

6

1、特征和目标变量的分布。

```X.plot.kde(legend=True,xlim=(-5,5),color=['green']*5+['orange']*5,title='Distributions - Features and Target')
y.plot.kde(legend=True,linestyle='--',color='red') # target```

2、与目标变量相比，十个特征中的每一个都是简单的单变量回归。

```import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="dark")

# Set up the matplotlib figure
fig, axes = plt.subplots(4, 3, figsize=(8, 6), sharex=True, sharey=True)

# Rotate the starting point around the cubehelix hue circle
for ax, s in zip(axes.flat, range(10)):
cmap = sns.cubehelix_palette(start=s, light=1, as_cmap=True)
x = X.iloc[:,s]
sns.regplot(x, y,fit_reg = True, marker=',', scatter_kws={'s':1},ax=ax,color='salmon')
ax.set(xlim=(-5, 5), ylim=(-5, 5))
ax.text(x=0,y=0,s=x.name.upper(),color='black',
**{'ha': 'center', 'va': 'center', 'family': 'sans-serif'},fontsize=20)

fig.tight_layout()
fig.suptitle("Univariate Regressions for Features", y=1.05,fontsize=20)
```

3、显示特征之间相关性的clustermap。

```from scipy.cluster import hierarchy
from scipy.spatial import distance

corr_matrix = X.corr()
correlations_array = np.asarray(corr_matrix)
method='average')
row_cluster=True,col_cluster=True,figsize=(5,5),cmap='Greens',center=0.5)
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.show()
label_order = corr_matrix.iloc[:,g.dendrogram_row.reordered_ind].columns
```

7

1、一组简单的线性回归模型。

2、树模型的集合，在这种情况下，使用ExtraTrees算法。

```from sklearn.base import clone
from sklearn.linear_model import LinearRegression

def make_walkforward_model(features,outcome,algo=LinearRegression()):
recalc_dates = features.resample('Q',level='date').mean().index.values[:-1]

## Train models
models = pd.Series(index=recalc_dates)
for date in recalc_dates:
X_train = features.xs(slice(None,date),level='date',drop_level=False)
y_train = outcome.xs(slice(None,date),level='date',drop_level=False)
#print(f'Train with data prior to: {date} ({y_train.count()} obs)')

model = clone(algo)
model.fit(X_train,y_train)
models.loc[date] = model

begin_dates = models.index
end_dates = models.index[1:].append(pd.to_datetime(['2099-12-31']))

## Generate OUT OF SAMPLE walk-forward predictions
predictions = pd.Series(index=features.index)
for i,model in enumerate(models): #loop thru each models object in collection
#print(f'Using model trained on {begin_dates[i]}, Predict from: {begin_dates[i]} to: {end_dates[i]}')
X = features.xs(slice(begin_dates[i],end_dates[i]),level='date',drop_level=False)
p = pd.Series(model.predict(X),index=X.index)
predictions.loc[X.index] = p

return models,predictions
```

```from sklearn.linear_model import LinearRegression
from sklearn.ensemble import ExtraTreesRegressor

linear_models,linear_preds = make_walkforward_model(X,y,algo=LinearRegression())
tree_models,tree_preds = make_walkforward_model(X,y,algo=ExtraTreesRegressor())
```

```print("Models:")
print()
print("Predictions:")
```
```Models:
2012-03-31    LinearRegression(copy_X=True, fit_intercept=Tr...
2012-06-30    LinearRegression(copy_X=True, fit_intercept=Tr...
2012-09-30    LinearRegression(copy_X=True, fit_intercept=Tr...
2012-12-31    LinearRegression(copy_X=True, fit_intercept=Tr...
2013-03-31    LinearRegression(copy_X=True, fit_intercept=Tr...
dtype: object

Predictions:
date        symbol
2012-04-02  AAPL     -0.786846
CSCO     -1.518537
INTC      0.145496
MSFT     -0.677892
2012-04-03  AAPL      0.403579
dtype: float64
```

```pd.DataFrame([model.coef_ for model in linear_models],
columns=X.columns,index=linear_models.index).plot(title='Weighting Coefficients for \nLinear Model')
```

```from sklearn.metrics import r2_score,mean_absolute_error

def calc_scorecard(y_pred,y_true):

def make_df(y_pred,y_true):
y_pred.name = 'y_pred'
y_true.name = 'y_true'

df = pd.concat([y_pred,y_true],axis=1).dropna()

df['sign_pred'] = df.y_pred.apply(np.sign)
df['sign_true'] = df.y_true.apply(np.sign)
df['is_correct'] = 0
df.loc[df.sign_pred * df.sign_true > 0 ,'is_correct'] = 1 # only registers 1 when prediction was made AND it was correct
df['is_incorrect'] = 0
df.loc[df.sign_pred * df.sign_true < 0,'is_incorrect'] = 1 # only registers 1 when prediction was made AND it was wrong
df['is_predicted'] = df.is_correct + df.is_incorrect
df['result'] = df.sign_pred * df.y_true
return df

df = make_df(y_pred,y_true)

scorecard = pd.Series()
# building block metrics
scorecard.loc['RSQ'] = r2_score(df.y_true,df.y_pred)
scorecard.loc['MAE'] = mean_absolute_error(df.y_true,df.y_pred)
scorecard.loc['directional_accuracy'] = df.is_correct.sum()*1. / (df.is_predicted.sum()*1.)*100
scorecard.loc['edge'] = df.result.mean()
scorecard.loc['noise'] = df.y_pred.diff().abs().mean()
# derived metrics
scorecard.loc['edge_to_noise'] = scorecard.loc['edge'] / scorecard.loc['noise']
scorecard.loc['edge_to_mae'] = scorecard.loc['edge'] / scorecard.loc['MAE']
return scorecard

calc_scorecard(y_pred=linear_preds,y_true=y).rename('Linear')
```
```RSQ                      0.027760
MAE                      1.784532
directional_accuracy    53.788634
edge                     0.278431
noise                    0.530620
edge_to_noise            0.524727
edge_to_mae              0.156024
Name: Linear, dtype: float64
```

```def scores_over_time(y_pred,y_true):
df = pd.concat([y_pred,y_true],axis=1).dropna().reset_index().set_index('date')
scores = df.resample('A').apply(lambda df: calc_scorecard(df[y_pred.name],df[y_true.name]))
return scores

scores_by_year = scores_over_time(y_pred=linear_preds,y_true=y)
print(scores_by_year.tail(3).T)
scores_by_year['edge_to_mae'].plot(title='Prediction Edge vs. MAE')
```
```date                  2016-12-31  2017-12-31  2018-12-31
RSQ                     0.028539    0.017950   -0.006912
MAE                     1.777784    1.726196    1.779631
directional_accuracy   55.853175   52.104208   53.813559
edge                    0.274354    0.254843    0.254830
noise                   0.514929    0.502813    0.503823
edge_to_noise           0.532799    0.506835    0.505792
edge_to_mae             0.154323    0.147633    0.143193
```

8

```from sklearn.linear_model import LassoCV
def prepare_Xy(X_raw,y_raw):
''' Utility function to drop any samples without both valid X and y values'''
Xy = X_raw.join(y_raw).replace({np.inf:None,-np.inf:None}).dropna()
X = Xy.iloc[:,:-1]
y = Xy.iloc[:,-1]
return X,y
X_ens, y_ens = prepare_Xy(X_raw=pd.concat([linear_preds.rename('linear'),tree_preds.rename('tree')],
axis=1),y_raw=y)

ensemble_models,ensemble_preds = make_walkforward_model(X_ens,y_ens,algo=LassoCV(positive=True))
ensemble_preds = ensemble_preds.rename('ensemble')
```date        symbol
2012-07-02  AAPL      0.464468
CSCO      0.238618
INTC     -0.008967
MSFT      0.864243
2012-07-03  AAPL      0.437890
Name: ensemble, dtype: float64
```

```pd.DataFrame([model.coef_ for model in ensemble_models],
columns=X_ens.columns,index=ensemble_models.index).plot(title='Weighting Coefficients for \nSimple Two-Model Ensemble')
```

9

```# calculate scores for each model
score_ens = calc_scorecard(y_pred=ensemble_preds,y_true=y_ens).rename('Ensemble')
score_linear = calc_scorecard(y_pred=linear_preds,y_true=y_ens).rename('Linear')
score_tree = calc_scorecard(y_pred=tree_preds,y_true=y_ens).rename('Tree')

scores = pd.concat([score_linear,score_tree,score_ens],axis=1)
scores.loc['edge_to_noise'].plot.bar(color='grey',legend=True)
scores.loc['edge'].plot(color='green',legend=True)
scores.loc['noise'].plot(color='red',legend=True)

plt.show()
print(scores)
```

1、总体比任何一个基本模型都要有效。

2、随着时间的推移，所有的模型似乎都在变得更好，而且它们有更多的数据可供训练。

3、随着时间的推移，整体效果似乎更加一致。就像一个多样化的股票投资组合应该比其中的个股波动更小一样，一个多样化的模型组合往往会在一段时间内表现得更稳定。

1、更多模型类型添加SVM、深度学习模型、正则化回归和降维模型。

2、 更多超参数组合：在特定算法上尝试多组超参数。

3、正交特征集：尝试在不同的特征子集上训练基模型。通过将每个基本模型限制在适当数量的特征上来避免“维数灾难”。

2020年第 43 篇文章