Data crawling

Data collection through Data Crawling and Data Scraping

To train a Machine Translation engine, you need data. And such data can be obtained in different ways. 

You can decide to choose publicly available parallel corpora. They can be either purchased or are made available by international organisations, such as the European Union. It is a good start. But public corpora may not align with your company’s specific communication rules regarding terminology, syntax or style.

The best option is to train your engines using existing bilingual company assets (i.e. files in XLIFF format or translation memories). If your company has them, they are a great leverage for quality output and the best foundation possible for training a company-specific or company- and domain-specific MT engine. If your company has not yet capitalised on its linguistic assets, then the next best thing to do is to align your existing content. This can be achieved through, i.e., crawling and scraping your Intranet, your help guides or whatever is helpful for the area or domain you want to Machine-translate. 

Should you not be able to leverage internal documentation or legacy sites, the next best option is to scrape data from the Internet by crawling publicly available websites. This can be done manually or through bots. Although it sounds simple in principle, its execution can be tricky and requires specific expertise.

  1. First, you need to make sure that the content you crawl is of appropriate quality. To achieve this, you must be able, manually or automatically, to identify and exclude all machine produced content since it is not good to train your engine on machine-produced output (it only amplifies mistakes).
  2. Second, you need some technical knowledge to do this thoroughly. AJAX websites, for example, cannot be crawled with the wget command.

At ASAP Globalizers, we have the experience and the competence to crawl, evaluate and automatically align such data for building a solid foundation for your MT training requirements. Through our international network of offices and diverse workforce, we can support many languages natively.

If reading this very short presentation made you curious about what you could achieve with a better-trained engine, you are welcome to @@book a consultation@@ to discuss your data accumulation issues and requirements and explore possible solutions.