Instead of obtaining human ranking for system output (expensive) and comparing it with the metric-base ranking, the idea is to modify existing references with specific transformations, and examine the scores assigned by various metrics to such corruptions. Our experiments dealt with three broad categories: meaning-altering, meaning-preserving, and fluency disrupting. Here is an example of the three types of corruptions:
Our experiments were done on the SICK dataset. In addition to being a standard corpus within the semantic textual similarity community, this dataset was chosen because its sentences contain common sentence transormation patterns. This dataset was built from the 8K Image-Flickr dataset and the SemEval 2012 STS MSR-video description corpus, so we used those datasets to find the reference sentences for the sentences in the SICK dataset.
At the moment, only unit tests can only be done on the SICK dataset, as our corruption scripts use its metadata to produce some of the corruptions.
We reported results/developed scripts to work with the following metrics: