在NLP当中，不同的评价指标，BLEU, METEOR, ROUGE和CIDEr的逻辑意义？_综合

翻译自：　https://www.zhihu.com/question/304798594

These metrics are all used to evaluate the quality of text generation under supervision. The general approach is to compare the similarity of a candidate text (usually generated by a machine) and several other reference texts (usually marked by humans). However, the applicable scenarios are slightly different. BLEU, METEOR, ROUGE are generally used in translation, and CIDEr is generally used in image captioning.

BLEU
The general idea of BLEU is to compare the overlapping degree between the n-grams in the candidate text and the reference text (from unigram to 4-gram in practice). The higher the overlapping degree, the higher the translation quality. N-grams of different lengths are chosen because the accuracy of unigram can be used to measure the accuracy of word translation, and the higher order n-grams can be used to measure the fluency of sentences.
This metric only looks at the accuracy rate, that is it cares more about how many n-grams in the candidate sentence are correct (that is, it appears in the reference sentences), and does not care about the recall rate (that is, n-grams which are in the reference sentence do not appear in the candiate sentence). The original paper suggests that you can have 4 references in your test set, so as to reduce the impact of language diversity. In addition, there is a “brevity penalty” to punish the candiate sentence for too short (too short candidate sentence often means something is missing, which is low recall).
METEOR
The main idea of METEOR is that sometimes the result of the translation model is correct, but it happens to be mismatched with reference translation, for example, a synonym is used. So the synonym set is expanded with knowledge sources such as WordNet. And words forms are considered (words with the same stem are also considered to be partially matched, and certain rewards should be given, for example, translating likes into like is better than translating into other words). When evaluating sentence fluency, the concep of chunk is used (the candidate translation and the reference translation can be aligned, and the spatial arrangement of consecutive words form a chunk). The number of chunks Less means that the average length of each chunk is longer, which means that the word order of the candidate translation and the reference translation are more consistent. Finally, both recall rate and accuracy rate must be considered, and the F value is used as the final evaluation index.
ROUGE
ROUGE and BLEU are almost the same. The difference is that BLEU only calculates the accuracy rate, while ROUGE only calculates the recall rate.
For NMT (Neural Network Machine Translation), the translation results are smooth, but sometimes it is easy to translate blindly, such as changing personal names/numbers, and throwing away half a sentence. This is very common.
So some people say that we don’t look at fluency and only look at the recall rate (refer to how many n-grams in the translation are present in the candidate translation), so that we can know whether the NMT system has missed translation (this will lead to Low recall rate).
CIDEr
It treats each sentence as a document, and then calculates the cosine angle of the TF-IDF vector (except that term is an n-gram instead of a word). Based on this, it obtains the similarity between the candidate sentence and the reference sentence, which is also n of different lengths. -gram similarity is averaged to get the final result. The advantage is that different n-grams have different weights with different TF-IDF, because the more common n-grams in the entire corpus contain a smaller amount of information. The main point of the evaluation of image caption generation is to see if the model has captured key information. For example, the content of a picture is “a person swimming in a pool during the day”, and the most critical information should be “swimming”. If it is included when generating captions Or missing some other information (such as “daytime”) is actually irrelevant, so such an operation for reducing the weights of non-keywords is needed. In machine translation, the translation should be faithful to the original text, so multiple translations of the same sentence should be narrated to each other and contain the same information; and multiple subtitles of the same picture may not necessarily be narrated to each other, because different subtitles can contain Different amounts of image details, whether described in more detail or rough, are correct subtitles.