In the ROC story cloze task, an NLP model receives a four-sentence story context and must pick the more plausible of two possible story endings. Although both the training and evaluation stories for this task are crowdsourced through MTurk and related to non-fiction daily life, the 50000 training stories are five sentences long and entirely correct, whereas the validation set contains 1871 four-sentence contexts paired with both a right and wrong ending. The research goal is not just to pick the correct ending, as in the story cloze task, but to design models that can understand and generate “common sense” relations.
When Mostafazadeh et al. released the corpus in 2016, they found that humans could identify the correct story ending with 100% accuracy. However, their best performing model (the deep structured semantic model) could only achieve a 7.2 point improvement (58.5% accuracy) over constant-choose-first. Mostafazadeh et al. hypothesize that richer semantic representation is needed to complete the task and that the NLP community would develop models that generalize well to new concepts and situations.
In this article, we discuss the paper “Pay Attention to the Ending: Strong Neural Baselines for the ROC Story Cloze Task ” by Cai et al., which
In other words, the model can predict the story ending without having read the story! The paper also finds that humans who haven’t read story contexts can distinguish between right and wrong endings with 78% accuracy.
Cai et al. train bi-directional LSTMs to encode sequences of words or sentences, and test different modeling variations for scoring endings and encoding the plot.
In EncWords, a bi-LSTM RNN is trained to encode each word of a sentence, and the sentence representation is obtained by adding the forward and backward vectors at each word and then averaging across words.
In EncSents, a bi-LSTM RNN is trained to encode a sequence of sentence representations by adding the final hidden vector of the forward and backward pass.
EncWords and EncSents are used to build two models of multi-sentence representation.
After words, sentences, and multi-sentence sequences can be encoded, Cai et al. test two ways of scoring endings against the 4-sentence contexts.
Furthermore, they test an attention mechanism based on the representation of a candidate ending. A feed-forward network takes the final representation of the story and outputs a score for the likelihood of the story. Hinge loss is minimized between the score for the correct and incorrect ending.
Without attention, the Story scoring outperforms the PlotEnd story across both Hier and Flat encodings. However, adding attention is especially useful in PlotEnd + Hier, which achieves an accuracy (74.7%) close to the state of the art result (75.2%).
However, much of the model’s performance is due to biases in the task rather than commonsense reasoning.When the same PlotEnd + Hier model is used on just the story endings, it achieves 72.5% accuracy. Similarly, when a human annotator was given 100 ending pairs, he was able to select the more likely ending 78% of the time even without reading the story. For example, “I practice all the time now” is more likely than “I hope I drop the batons.” Although some common sense is involved in selecting the correct ending, plot-based reasoning is not needed. Cai et al. note that simpler heuristics can provide several percentage points’ improvement over constant-choose-first.
Furthermore, some common words appear exclusively in correct or exclusively in incorrect endings.
These results highlight that appropriate baselines are needed to account for biases in NLP tasks: constant-choose-first accuracy is an overly generous baseline! A model that learns “common sense” ending prediction based on story plots must outperform a “sense-less” model that predicts endings without knowing the story plot.