The best guess is that humans currently speak about 6,900 different languages. More than half the global population communicates using just a handful of them—Chinese, English, Hindi, Spanish, and Russian. Indeed, 95 percent of people communicate using just 100 languages.
The other argots are much less common. Indeed, linguists estimate that about a third of the world’s languages are spoken by fewer than 1,000 people and are in danger of dying out in the next 100 years or so. With them will go the unique cultural heritage that they embody—stories, phrases, jokes, herbal remedies, and even unique emotions.
It’s easy to think that machine learning can help. The problem is that machine translation relies on huge annotated data sets to ply its trade. These data sets consist of vast corpora of books, articles, and websites that have been manually translated into other languages. This acts like a Rosetta Stone for machine-learning algorithms, and the bigger the data set, the better they learn.
But these huge data sets simply do not exist for most languages. That’s why machine translation works only for a tiny fraction of the most common lingos. Google Translate, for example, only speaks about 90 languages.
So an important challenge for linguists is to find a way to automatically analyze less common languages to better understand them.
Read more: MIT Technology Review