In the interest of designing better metrics for language generation tasks such as machine translation and paraphrasing, we have developed a unit testing framework. Given a dataset, a metric, and a set of corruptions, our code takes the corpus, generates corrupted sentences, and sees if the metric is able to identify the true sentence based on a set of references.
For example, if the sentence is "A man is playing a harp", and the corruption is negated subject, then the corrupted sentence will be "There is no man playing a harp". The success of a metric in this test is its ability to assign a better score to the original sentence, since it holds a higher semantic similarity to the reference sentences.