Source: Teaching the Google Assistant to be Multilingual from Google Research
Posted by Johan Schalkwyk, VP and Ignacio Lopez Moreno, Engineer, Google Speech
Multilingual households are becoming increasingly common, with several sources  indicating that multilingual speakers already outnumber monolingual counterparts, and that this number will continue to grow. With this large and increasing population of multilingual users, it is more important than ever that Google develop products that can support multiple languages simultaneously to better serve our users.
Today, we’re launching multilingual support for the Google Assistant, which enables users to jump between two different languages across queries, without having to go back to their language settings. Once users select two of the supported languages, English, Spanish, French, German, Italian and Japanese, from there on out they can speak to the Assistant in either language and the Assistant will respond in kind. Previously, users had to choose a single language setting for the Assistant, changing their settings each time they wanted to use another language, but now, it’s a simple, hands-free experience for multilingual households.
|The Google Assistant is now able to identify the language, interpret the query and provide a response using the right language without the user having to touch the Assistant settings.|
Getting this to work, however, was not a simple feat. In fact, this was a multi-year effort that involved solving a lot of challenging problems. In the end, we broke the problem down into three discrete parts: Identifying Multiple Languages, Understanding Multiple Languages and Optimizing Multilingual Recognition for Google Assistant users.
Identifying Multiple Languages
People have the ability to recognize when someone is speaking another language, even if they do not speak the language themselves, just by paying attention to the acoustics of the speech (intonation, phonetic registry, etc). However, defining a computational framework for automatic spoken language recognition is challenging, even with the help of full automatic speech recognition systems1. In 2013, Google started working on spoken language identification (LangID) technology using deep neural networks . Today, our state-of-the-art LangID models can distinguish between pairs of languages in over 2000 alternative language pairs, using recurrent neural networks, a family of neural networks which are particularly successful for sequence modeling problems, such as those in speech recognition, voice detection, speaker recognition and others. One of the challenges we ran into was working with larger sets of audio — getting models that can automatically understanding multiple languages at scale, and hitting a quality standard that allowed those models to work properly.
Understanding Multiple Languages
To understand more than one language at once, multiple processes need to be run in parallel, each producing incremental results, allowing the Assistant not only to identify the language in which the query is spoken but also to parse the query to create an actionable command. For example, even for a monolingual environment, if a user asks to “set an alarm for 6pm”, the Google Assistant must understand that “set an alarm” implies opening the clock app, fulfilling the explicit parameter of “6pm” and additionally make the inference that the alarm should be set for today. To make this work for any given pair of supported languages is a challenge, as the Assistant executes the same work it does for the monolingual case, but now must additionally enable LangID, and not just one but two monolingual speech recognition systems simultaneously (we’ll explain more about the current two language limitation later in this post).
Importantly, the Google Assistant and other services that are referenced in the user’s query asynchronously generate real-time incremental results that need to be evaluated in a matter of milliseconds. This is accomplished with the help of an additional algorithm that ranks the transcription hypotheses provided by each of the two speech recognition systems using the probabilities of the candidate languages produced by LangID, our confidence on the transcription and the user’s preferences (such as favorite artists, for example).
|Schematic of our multilingual speech recognition system used by the Google Assistant versus the standard monolingual speech recognition system. A ranking algorithm is used to select the best recognition hypotheses from two monolingual speech recognizer using relevant information about the user and the incremental langID results.|
When the user stops speaking, the model has not only determined what language was being spoken, but also what was said. Of course, this process requires a sophisticated architecture that comes with an increased processing cost and the possibility of introducing unnecessary latency.
Optimizing Multilingual Recognition
To minimize these undesirable effects, the faster the system can make a decision about which language is being spoken, the better. If the system becomes certain of the language being spoken before the user finishes a query, then it will stop running the user’s speech through the losing recognizer and discard the losing hypothesis, thus lowering the processing cost and reducing any potential latency. With this in mind, we saw several ways of optimizing the system.
One use case we considered was that people normally use the same language throughout their query (which is also the language users generally want to hear back from the Assistant), with the exception of asking about entities with names in different languages. This means that, in most cases, focusing on the first part of the query allows the Assistant to make a preliminary guess of the language being spoken, even in sentences containing entities in a different language. With this early identification, the task is simplified by switching to a single monolingual speech recognizer, as we do for monolingual queries. Making a quick decision about how and when to commit to a single language, however, requires a final technological twist: specifically, we use a random forest technique that combines multiple contextual signals, such as the type of device being used, the number of speech hypotheses found, how often we receive similar hypotheses, the uncertainty of the individual speech recognizers, and how frequently each language is used.
An additional way we simplified and improved the quality of the system was to limit the list of candidate languages users can select. Users can choose two languages out of the six that our Home devices currently support, which will allow us to support the majority of our multilingual speakers. As we continue to improve our technology, however, we hope to tackle trilingual support next, knowing that this will further enhance the experience of our growing user base.
Bilingual to Trilingual
From the beginning, our goal has been to make the Assistant naturally conversational for all users. Multilingual support has been a highly-requested feature, and it’s something our team set its sights on years ago. But there aren’t just a lot of bilingual speakers around the globe today, we also want to make life a little easier for trilingual users, or families that live in homes where more than two languages are spoken.
With today’s update, we’re on the right track, and it was made possible by our advanced machine learning, our speech and language recognition technologies, and our team’s commitment to refine our LangID model. We’re now working to teach the Google Assistant how to process more than two languages simultaneously, and are working to add more supported languages in the future — stay tuned!
1 It is typically acknowledged that spoken language recognition is remarkably more challenging than text-based language identification where, relatively simple techniques based on dictionaries can do a good job. The time/frequency patterns of spoken words are difficult to compare, spoken words can be more difficult to delimit as they can be spoken without pause and at different paces and microphones may record background noise in addition to speech.↩