English Leads In Speech Recognition, But Not For Long
There are as many as 1.5 billion English speaking people in the world, including those who speak English as a second language. That may sound like a lot, but that means four out of every five people do not speak English. Therefore, any speech recognition or natural language technology that is built primarily for English speakers will be missing out on 5.9 billion potential customers. That is a big opportunity; but with 6,500 spoken languages still in use throughout the world, it is also a very big challenge.
Speech technology has solid roots in American research. The first recognized work in speech modeling began at Bell Labs in the 1930’s. That work eventually produced digit recognition systems with a limited vocabulary by the early 1950’s. Over the next several decades, speech research took off at many major American universities, and in the early 1970’s DARPA began funding research at schools throughout the country. Because of this historical record, it should be no surprise that English is one of the most heavily researched and vetted languages in machine learning, even today (although other languages are quickly catching up).
Despite this bias, there is still difficulty even within variations across the English-speaking world. It is a well-known problem that many speech recognition systems are better at handling standard American English. Videos abound of people with accents arguing with their devices. Phonological construction will change slightly with dialectical shifts from region to region, struggling both with pronunciation and colloquialisms or local words. Building language models that encompass all variations on English would be incredibly difficult, both in complexity and efficiency. Many speech systems rely heavily on context to resolve confusion, and even more have moved toward model bundling to handle these regional issues.
It is important to note that English is not even a particularly difficult language to process. Tonal languages, like Cantonese, can be particularly challenging for speech recognition systems. And even when processed, verbatim transcription of Cantonese can be difficult for natural language systems because of formal differences between Cantonese and Mandarin. Additionally, languages such as German can be challenging for speech and language systems because of unusual conventions, such as the capitalization of all nouns.
None of this is meant to imply that languages other than English do not have great coverage. Heavy investment in speech and language research has produced excellent results in many other languages, and some have even begun to see word error rates dip below those commonly seen with English. Amongst these heavily researched languages are Arabic, which has received strong funding in recent years from another DARPA initiative called GALE (Global Autonomous Language Exploitation). French, Japanese, and Chinese have also netted high accuracy models, and these improvements will continue to accelerate.
As large companies continue to push speech and language systems into the consumer world, linguistic coverage and depth will grow exponentially. Within our lifetimes, I expect to see performant, high-accuracy speech and language models for nearly all of the world’s major languages and common dialects. While English may dominate the landscape today, soon there will be no linguistic divide. Machines will easily understand us, translate our words, and make everyone and every machine connected by voice.