A team of researchers in India has devised a system for translating words into a different language and making it appear that a speaker's lips are moving in sync with that language.
Automatic Face-to-Face Translation, as described in this October 2019 paper, is an advance over text-to-text or speech-to-speech translation, because it not only translates speech, but also provides a lip-synced facial image.
To understand how this works, check out the demonstration video below, created by the researchers. At the 6:38 mark, you'll see a video clip of the late Princess Diana in a 1995 interview with journalist Martin Bashir, explaining, "I'd like to be a queen of people's hearts, in people's hearts, but I don't see myself being a queen of this country."
A moment later, you'll see her uttering the same quote in Hindi — with her lips moving, as if she actually spoke that language.
"Communicating effectively across language barriers has always been a major aspiration for humans all over the world," Prajwal K.R., a graduate student in computer science at the International Institute of Information Technology in Hyderabad, India, explains via email. He's the lead author of the paper, along with his colleague Rudrabha Mukhopadhyay.
"Today, the internet is filled with talking face videos: YouTube (300 hours uploaded per day), online lectures, video conferencing, movies, TV shows and so on," Prajwal, who goes by his given name, writes. "Current translation systems can only generate a translated speech output or textual subtitles for such video content. They do not handle the visual component. As a result, the translated speech when overlaid on the video, the lip movements would be out of sync with the audio.
"Thus, we build upon the speech-to-speech translation systems and propose a pipeline that can take a video of a person speaking in a source language and output a video of the same speaker speaking in a target language such that the voice style and lip movements matches the target language speech," Prajwal says. "By doing so, the translation system becomes holistic, and as shown by our human evaluations in this paper, significantly improves the user experience in creating and consuming translated audio-visual content."
Face-to-Face Translation requires a number of complex feats. "Given a video of a person speaking, we have two major information streams to translate: the visual and the speech information," he explains. They accomplish this in several major steps. "The system first transcribes the sentences in the speech using automatic speech recognition (ASR). This is the same technology that is used in voice assistants (Google Assistant, for example) in mobile devices." Next, the transcribed sentences are translated to the desired language using Neural Machine Translation models, and then the translation is converted to spoken words with a text-to-speech synthesizer — the same technology that digital assistants use.
Finally, a technology called LipGAN corrects the lip movements in the original video to match the translated speech.
"Thus, we get a fully translated video with lip synchronization as well," Prajwal explains.
"LipGAN is the key novel contribution of our paper. This is what brings the visual modality into the picture. It is most important as it corrects the lip synchronization in the final video, which significantly improves the user experience."