Machine Translation (MT) Engine Quality Evaluation
Machine Translation engines require “training”, and the metric generally applied to evaluate whether training is needed or not is called BLEU SCORE. The metric signals at which point the data model reaches the best possible quality outcome. Further training would not make sense and result in overtraining.
Although the BLEU-score may work well when evaluating one engine, it is a tricky metric when used for comparison purposes. For instance, the same BLEU-score means very different things in Google AutoML and Microsoft Custom Translator, so much so that a direct comparison would not make sense.
The BLEU-score is too rough a measure for a direct comparison between engines and their output.
And there’s more: The BLEU-score of many engines can be significantly inflated. This is because the usual method of splitting the dataset into 90%-10% subsets does not work with language processing.
To get a fair estimate of quality, you would need to purge the output of all that matches at 100% with the training set because the neural model remembers very well what it has already seen. To further improve the quality estimation, you would need to purge also sentences close to the 100% matches.
And ultimately, to further increase the estimate’s precision, all high fuzzy matches should be purged from the training set, a task which is all but trivial and cannot be performed with a traditional CAT tool (they simply were not made for that).
To sum it up, to properly evaluate your MT output, you should use a more nuanced translation quality metric than the BLEU-score.
At ASAP Globalizers we are all too aware of the quality issues and possible commercial consequences of a poor evaluation. And through experience, we can help you choose the MT engine which is best suited for the language combination you need. We can also help you understand how best to train it and how to evaluate its quality along the process.
Please @@book your consultation@@ to explore viable alternatives.
## Why ВLEU Scores are often inflated