14 April 2016

The Future of Language Resources for Machine Translation (LR4MT)

In a recent brief survey of language service suppliers (LSPs), LT-Innovate attempted to find out from the translation industry how they saw the future of language data availability specifically for machine translation. The results provide food for thought when it comes to planning for the improved usability of digital text resources in the years ahead. It looks as if new developments in machine translation (MT) technology will work in parallel with a growing need for the right data.

First, statistical machine translation is clearly on most LSPs’ radar screens. Hard data on the actual size of the user market for MT systems is impossible to calculate today, as is information on who uses which free or paying services available online in their everyday work. But everyone who responded to our survey claims they will be “using” MT in the next 2 to 3 years. 

Preparing for this transition is therefore vital for the nascent language data resource sector.

Overall, 30% of our LSP respondents reckon that the data they will need to prime their MT engines will come from their clients, 78% will use their in-house translation memories and similar, and 70% will try to find third party sources from outside their immediate business nexus. 

15% of them would be prepared to buy such data, 70% of them will crawl the web, while a total of 83% of them expect more free resources will become available.


Judging by our current findings from mapping publicly-available LR4MT in Europe, the chances of them finding the relevant resources easily look relatively small. Sharing language data is not a high-visibility phenomenon so far.

However 39% said they did not have the necessary engineering resources in-house to transform the content they might find into viable MT data. This suggests there could be a small market for language data cleaning and aligning for data harvested from the web or well-known repositories.

When it comes to the desired quality criteria for usable language resources, by far the most important criterion (84%) was unsurprisingly domain relevance. Indeed, small customer- or domain-specific language models for MT are typically considered to outperform general models by a very large factor. This suggests that some serious effort will need to go into pinpointing domain relevance in any language resource supply platform, rather than rely, say, on volume as a virtue in itself.


Appropriately, there was also considerable emphasis on leveraging the semantic characteristics of language data needed for MT. Semantically enriched data, as proposed by such EC-funded projects as LIDER (a Linked Open Data-based ecosystem of free, interlinked, and semantically interoperable language resources) and BabelNet (multilingual dictionary underpinned by a rich semantic network) clearly have potential as a future resource. We therefore need to examine the fastest and most efficient way to transform this potential technology stack into an operational reality. We can also expect to hear much more from Coreon about multilingual knowledge management as a fundamental business tool.

So what can we expect for a more effective and efficient deployment of LR4MT? In general, respondents are looking towards new hybrid models of machine translation involving the integration of transfer/grammar and semantic modules into the plain vanilla statistical model as it exists today. This suggests that language technology and data resource quality will need to evolve closely in parallel.

They also expect deep learning to be applied to MT, together with such processes as continuous retraining during the MT post-editing phase. In other words, we are just at the beginning of a new cycle of more artificial intelligence-driven MT systems that will be able to learn as they go and leverage even more usability from relevant data resources. But as one respondent pointed out, the ultimate litmus test for the value of translation resource data is whether or not the original translation is any good. Tools to tame the elusive beast of rapid translation quality evaluation will still need to be part of the mix.

What specific needs for or constraints on MT data resources do you foresee in Europe in the near future? Tell us here or respond to our survey.

Jo CĂ©line

16 comments:

  1. Fantastic article with valuable information waiting for next blog thanks for sharing you.
    Data Science Course in Hyderabad

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Good work, unique site and interesting too… keep it up…looking forward for more updates. Good luck to all of you and thanks so much for your hard-work…

    Data Science Training in Hyderabad

    ReplyDelete
  4. https://www.digitpro.co.uk/the-uppsala-internationalization-model-and-its-limitation-in-the-new-era/ nice article

    ReplyDelete
  5. This turned into an outstanding page for this type of difficult situation to speak about. chat random

    ReplyDelete
  6. A properly weblog continually comes-up with new and interesting statistics. ome tv

    ReplyDelete
  7. You re in point of fact a just right webmaster. The website loading speed is amazing. It kind of feels that you're doing any distinctive trick. Moreover, The contents are masterpiece. you have done a fantastic activity on this subject! cloud computing training institute in gurgaon

    ReplyDelete
  8. I have bookmarked your site since this site contains significant data in it. You rock for keeping incredible stuff. I am a lot of appreciative of this site.

    ReplyDelete
  9. 360DigiTMG, the top-rated organisation among the most prestigious industries around the world, is an educational destination for those looking to pursue their dreams around the globe. The company is changing careers of many people through constant improvement, 360DigiTMG provides an outstanding learning experience and distinguishes itself from the pack. 360DigiTMG is a prominent global presence by offering world-class training. Its main office is in India and subsidiaries across Malaysia, USA, East Asia, Australia, Uk, Netherlands, and the Middle East.

    ReplyDelete
  10. Extremely helpful post. This is my first time I visit here. I found so many fascinating stuff with regards to your blog particularly its conversation. Actually its extraordinary article. Keep it up
    buy facebook post likes

    ReplyDelete
  11. Thanks for your post. I’ve been thinking about writing a very comparable post over the last couple of weeks, I’ll probably keep it short and sweet and link to this instead if thats cool. Thanks.
    cyber security course malaysia

    ReplyDelete
  12. Develop technical skills and become an expert in analyzing large sets of data by enrolling for the Best Data Science course in Bangalore. Gain in-depth knowledge in Data Visualization, Statistics, and Predictive Analytics along with the two famous programming languages and Python. Learn to derive valuable insights from data using skills of Data Mining, Statistics, Machine Learning, Network Analysis, etc, and apply the skills you will learn in your final Capstone project to get recognized by potential employers.


    Data Science in Bangalore

    ReplyDelete
  13. I recommend everyone to read this blog as it has some of the best data science content you will find. The best part is that the writer presented the information in an engaging and engaging way. Each line gives you something new to learn, and that says a lot about the quality of the information presented here.

    Kickstart your career by enrolling in this Data Science Certification Course in Chennai

    ReplyDelete
  14. Join our internship programme for new graduates in data analytics to obtain practical experience in the fast-paced industry.data analytics internship for freshers

    ReplyDelete