Predicting the end: the ROC story cloze task

Can you teach common sense to an NLP model? And can you design a dataset that tests it?

Sep 17 ·4min read

In the ROC story cloze task, an NLP model receives a four-sentence story context and must pick the more plausible of two possible story endings. Although both the training and evaluation stories for this task are crowdsourced through MTurk and related to non-fiction daily life, the 50000 training stories are five sentences long and entirely correct, whereas the validation set contains 1871 four-sentence contexts paired with both a right and wrong ending. The research goal is not just to pick the correct ending, as in the story cloze task, but to design models that can understand and generate “common sense” relations.

ROC validation example. Source:

When Mostafazadeh et al. released the corpus in 2016, they found that humans could identify the correct story ending with 100% accuracy. However, their best performing model (the deep structured semantic model) could only achieve a 7.2 point improvement (58.5% accuracy) over constant-choose-first. Mostafazadeh et al. hypothesize that richer semantic representation is needed to complete the task and that the NLP community would develop models that generalize well to new concepts and situations.

In this article, we discuss the paper “Pay Attention to the Ending: Strong Neural Baselines for the ROC Story Cloze Task ” by Cai et al., which

  1. achieves an accuracy of 74.7% by only training on the validation set
  2. achieves an accuracy of 72.5% by only training on story endings

In other words, the model can predict the story ending without having read the story! The paper also finds that humans who haven’t read story contexts can distinguish between right and wrong endings with 78% accuracy.

The Model

Cai et al. train bi-directional LSTMs to encode sequences of words or sentences, and test different modeling variations for scoring endings and encoding the plot.

In EncWords, a bi-LSTM RNN is trained to encode each word of a sentence, and the sentence representation is obtained by adding the forward and backward vectors at each word and then averaging across words.

f_i and b_i are the forward and backward LSTM hidden states of encoding sentence w_i. S_i denotes the resulting sentence representation.

In EncSents, a bi-LSTM RNN is trained to encode a sequence of sentence representations by adding the final hidden vector of the forward and backward pass.

F_{-1} and B_{-1} are the final hidden vectors of F, B respectively

EncWords and EncSents are used to build two models of multi-sentence representation.

  1. Hier: each sentence is encoded using EncWord and then the sentence representations are encoded with EncSent.
  2. Flat: the sentences are concatenated and EncWord is applied to the mega-sentence

After words, sentences, and multi-sentence sequences can be encoded, Cai et al. test two ways of scoring endings against the 4-sentence contexts.

  1. PlotEnd: the 4-sentence context is encoded separately from the ending, and the two encodings are scored together. The resulting representation is in the form <Enc(plot), Enc(end)>
  2. Story: the plot and ending are encoded together and then scored. The resulting representation is in the form <Enc(plot+end)>

Furthermore, they test an attention mechanism based on the representation of a candidate ending. A feed-forward network takes the final representation of the story and outputs a score for the likelihood of the story. Hinge loss is minimized between the score for the correct and incorrect ending.


Without attention, the Story scoring outperforms the PlotEnd story across both Hier and Flat encodings. However, adding attention is especially useful in PlotEnd + Hier, which achieves an accuracy (74.7%) close to the state of the art result (75.2%).

However, much of the model’s performance is due to biases in the task rather than commonsense reasoning.When the same PlotEnd + Hier model is used on just the story endings, it achieves 72.5% accuracy. Similarly, when a human annotator was given 100 ending pairs, he was able to select the more likely ending 78% of the time even without reading the story. For example, “I practice all the time now” is more likely than “I hope I drop the batons.” Although some common sense is involved in selecting the correct ending, plot-based reasoning is not needed. Cai et al. note that simpler heuristics can provide several percentage points’ improvement over constant-choose-first.

  • sentiment: picking the ending with more positive sentiment score improves constant-choose-first by 7 pp
  • negation: picking the ending with fewer negations provides a 4 pp improvement
  • length: picking the longer ending provides a 2 pp improvement

Furthermore, some common words appear exclusively in correct or exclusively in incorrect endings.

Some words are exclusive to correct or incorrect endings

These results highlight that appropriate baselines are needed to account for biases in NLP tasks: constant-choose-first accuracy is an overly generous baseline! A model that learns “common sense” ending prediction based on story plots must outperform a “sense-less” model that predicts endings without knowing the story plot.