There's been some interest in translation technologies here at the lab over the last year, so I want to post some media items that came up in the last few weeks.
First is this video on Google Translate, which explains how it works on a very high level. It makes it sound so much easier than it really is!
I find it interesting that Google Translate tries to solve each pair of languages separately. For example, English->French learns from documents that are both in English and French, and Spanish->Hindi uses documents that are in both of those languages. They mentioned that they have lower quality on certain language pairs as a result.
I wonder if there's an approach where each language can be translated to some universal knowledge representation that can then be translated back into any language. This is, of course, much more difficult, because we'd have to manually translate lots of text into this new representation in order to start the machine learning algorithms. But if someone can get that to work, you'd have the machine not only do the actual translation, but also figure out the "meaning" of a sentence or phrase.
The following article gives me hope that with better machine learning techniques, we can get good translations with less training data than Google is using:
In this example, there is a dead language, and no translation texts whatsoever. Yet, a computer program was able to translate it by finding patterns in the letters and words and comparing it to a known language (in this case, Hebrew).
Admittedly, the research scientists knew in this case that Ugaritic was related to Hebrew. However, this demonstrates that with the right algorithm, we might not need the millions of examples that Google uses.