In the interest of designing better metrics for language generation tasks such as machine translation and paraphrasing, we have developed a unit testing framework. Given a dataset, a metric, and a set of corruptions, our code takes the corpus, generates corrupted sentences, and sees if the metric is able to identify the true sentence based on a set of references.

For example, if the sentence is "A man is playing a harp", and the corruption is negated subject, then the corrupted sentence will be "There is no man playing a harp". The success of a metric in this test is its ability to assign a better score to the original sentence, since it holds a higher semantic similarity to the reference sentences.

  • Free software: Apache v2.0 license

View on GitHub


Boag, William, Renan Campos, Kate Saenko, and Anna Rumshisky. "MUTT: Metric Unit TesTing for Language Generation Tasks." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2016): August 7-12, 2016. Berlin, Germany.