Instead of obtaining human ranking for system output (expensive) and comparing it with the metric-base ranking, the idea is to modify existing references with specific transformations, and examine the scores assigned by various metrics to such corruptions. Our experiments dealt with three broad categories: meaning-altering, meaning-preserving, and fluency disrupting. Here is an example of the three types of corruptions:

  • Original Sentence: "A man is playing a guitar."
  • Meaining-Altering: "A man is not playing a guitar."
  • Meaning-Preseving: "A guitar is being played by a man."
  • Fluency Disrupting: "A man a guitar is playing."


Our experiments were done on the SICK dataset. In addition to being a standard corpus within the semantic textual similarity community, this dataset was chosen because its sentences contain common sentence transormation patterns. This dataset was built from the 8K Image-Flickr dataset and the SemEval 2012 STS MSR-video description corpus, so we used those datasets to find the reference sentences for the sentences in the SICK dataset.

At the moment, only unit tests can only be done on the SICK dataset, as our corruption scripts use its metadata to produce some of the corruptions.


We reported results/developed scripts to work with the following metrics:

  • CIDEr
  • BLEU
  • TERp
The metrics are tested on how accuratley they can assign a better score to the original sentence compared to the corruption- or in the case of meaning-preserving corruptions, how often the two sentences receive a close score (15% from each other).