06/03/2009

Supercomputed Revival of Lost Languages

"Every 14 days a language dies. By 2100, more than half of the more than 7,000 languages spoken on Earth—many of them not yet recorded—may disappear, taking with them a wealth of knowledge about history, culture, the natural environment, and the human brain." (Source: National Geographic)

Perhaps languages that have been lost aren't irretrievably so. Even when the words in a text lose their meaning, they still retain their order, and based on this, there are only a limited possibilities for the meaning of every word, or, if the word occurs often enough in the text, only one, as it then becomes almost impossible to assign a different meaning to the word without causing inconsistencies.

In future, using advanced AI, it might be theoretically possible to relearn languages which have been forgotten. Suppose that, as a first step, a supercomputer would fill in the meanings of the words of every sentence in a text more or less at random; more or less, because it could have certain inclinations, such as the inclination of relating words of shorter length from both languages, since simpler words are usually shorter: in most languages, for instance, the equivalent of the word "the" is only one character long.

Suppose it would then after every sentence control whether or not the sentence makes sense: if it does make sense, it moves on to the next sentence and repeats the process. One way of doing this would be by using search engines: if a fragments of a sentence does not return any search results when stripped bare of adjectives and adverbs and having all words replaced by their most straightforward synonym, it probably does not make sense. If the adverbs and adjectives, placed before or after the words they are related to, do not return any results either, then they're probably incorrect too. (Note that to return results of an entire fragment on search engines today, quotes must be used.)

If it encounters a word similar, but not identical, to a word it already has, it can try all different forms of that word, and if that does not work, then words related to it or similar in meaning; if these possibilities are exhausted, the supercomputer goes back to random guesses.

So far, the supercomputer can still fill in the sentences with any meaning at all (that is, any that makes sense), as long as it has the same number of words. However, when it sees the same word twice, it will (usually) be forced to review its possible meanings; as it sees the same word more and more often, the possible meanings will be narrowed until only one or a few remain.

The more texts there are available of the language, the fewer words will be left out with an unknown meaning. Of course, if there is too little text material available, it will be almost impossible to translate it.

Post a comment