Technical clean ups

Preparing your data for MT use

Did you ever hear the expression “garbage in – garbage out”? Contrary to what happens in the circular economy world, where garbage has the potential to be transformed into precious output, in Machine Translation and AI in general, the results depend heavily on the quality of your input, be it data or algorithms. 

Training an engine with “dirty” data would corrupt and preempt to increase translation efficiency through machine translation. The baseline is that for good engine training, you need clean data. And data cleaning is as important, if not more, than data accumulation.

Unfortunately, many public or private repositories contain a significant proportion of unclean data and are, as such, not optimal for training a customized, quality engine.

Some recent data banks providing COVID-19 for machine learning have forcefully demonstrated the importance of this primary tenant. Engines were trained on data containing, among others, keywords indiscriminately crawled from the Web, with an abundance of words like “plague” and “black death” (etc.) and extensive references to biblical expressions. Was this appropriate for a COVID-19 medical text written in our times? Obviously not, but a machine cannot judge. The machine depends on your judgment and on the quality of your data processes. 

Data quality is so essential that even a tiny proportion of dirty data (less than 10%) can ruin and corrupt the trained engine and its output.

So, how do you make sure that the training corpus is of good quality?

ASAP Globalizers has developed tools to help you improve your data quality. We base our judgment on sound metrics and can measure incremental gains.

There are various approaches to data quality, different ways to achieve it, and the strategy needs to tie in with your specific data strategy and processes.

Do you want better quality data for better automation results? @@book a consultation@@ to understand how to achieve this and the customized solution we can offer.