A team of researchers in India has devised a system for translating words into a different language and making it appear that a speaker's lips are moving in sync with that language.
Automatic Face-to-Face Translation, as described in this October 2019 paper, is an advance over text-to-text or speech-to-speech translation, because it not only translates speech, but also provides a lip-synced facial image.
To understand how this works, check out the demonstration video below, created by the researchers. At the 6:38 mark, you'll see a video clip of the late Princess Diana in a 1995 interview with journalist Martin Bashir, explaining, "I'd like to be a queen of people's hearts, in people's hearts, but I don't see myself being a queen of this country."
A moment later, you'll see her uttering the same quote in Hindi — with her lips moving, as if she actually spoke that language.
"Communicating effectively across language barriers has always been a major aspiration for humans all over the world," Prajwal K.R., a graduate student in computer science at the International Institute of Information Technology in Hyderabad, India, explains via email. He's the lead author of the paper, along with his colleague Rudrabha Mukhopadhyay.
"Today, the internet is filled with talking face videos: YouTube (300 hours uploaded per day), online lectures, video conferencing, movies, TV shows and so on," Prajwal, who goes by his given name, writes. "Current translation systems can only generate a translated speech output or textual subtitles for such video content. They do not handle the visual component. As a result, the translated speech when overlaid on the video, the lip movements would be out of sync with the audio.
"Thus, we build upon the speech-to-speech translation systems and propose a pipeline that can take a video of a person speaking in a source language and output a video of the same speaker speaking in a target language such that the voice style and lip movements matches the target language speech," Prajwal says. "By doing so, the translation system becomes holistic, and as shown by our human evaluations in this paper, significantly improves the user experience in creating and consuming translated audio-visual content."
Face-to-Face Translation requires a number of complex feats. "Given a video of a person speaking, we have two major information streams to translate: the visual and the speech information," he explains. They accomplish this in several major steps. "The system first transcribes the sentences in the speech using automatic speech recognition (ASR). This is the same technology that is used in voice assistants (Google Assistant, for example) in mobile devices." Next, the transcribed sentences are translated to the desired language using Neural Machine Translation models, and then the translation is converted to spoken words with a text-to-speech synthesizer — the same technology that digital assistants use.
Finally, a technology called LipGAN corrects the lip movements in the original video to match the translated speech.
"Thus, we get a fully translated video with lip synchronization as well," Prajwal explains.
"LipGAN is the key novel contribution of our paper. This is what brings the visual modality into the picture. It is most important as it corrects the lip synchronization in the final video, which significantly improves the user experience."
The Intent Is Not Deception, But Knowledge Sharing
An article, published Jan. 24, 2020 in New Scientist, described the breakthrough as a "deepfake," a term for videos in which faces have been swapped or digitally altered with the help of artificial intelligence, often to create a misleading impression, as this BBC story explained. But Prajwal maintains that's an incorrect portrayal of Face-to-Face Translation, which isn't intended to deceive, but rather to make translated speech easier to follow.
"Our work is primarily targeted at broadening the scope of the existing translation systems to handle video content," he explains. "This is a software created with a motivation to improve the user experience and break down language barriers across video content. It opens up a very wide range of applications and improves the accessibility of millions of videos online."
The biggest challenge in making face-to-face translation work was the face generation module. "Current methods to create lip-sync videos were not able to generate faces with desired poses, making it difficult to paste the generated face into the target video," Prajwal says. "We incorporated a "pose prior" as an input to our LipGAN model, and as a result, we can generate an accurate lip-synced face in the desired target pose that can be seamlessly blended into the target video."
The researchers envision Face-to-Face Translation being utilized in translating movies and video calls between two people who each speak a different language. "Making digital characters in animated films sing/speak is also demonstrated in our video," Prajwal notes.
In addition, he foresees the system being used to help students across the globe understand online lecture videos in other languages. "Millions of foreign language students across the globe cannot understand excellent educational content available online, because they are in English," he explains.
"Further, in a country like India with 22 official languages, our system can, in the future, translate TV news content into different local languages with accurate lip-sync of the news anchors. The list of applications thus applies to any sort of talking face video content, that needs to be made more accessible across languages."
Though Prajwal and his colleagues intend for their breakthrough to be used in positive ways, the capability to put foreign words in a speaker's mouth concerns one prominent U.S. cybersecurity expert, who fears that altered videos will become increasingly difficult to detect.
"If you look at the video, you can tell if you look closely, the mouth has got some blurriness," says Anne Toomey McKenna, a Distinguished Scholar of Cyberlaw and Policy at Penn State University's Dickinson Law, and a professor at the university's Institute for Computational and Data Sciences, in an email interview. "That will continue to be minimized as the algorithms continue to improve. That will become less and less discernable to the human eye."
McKenna for example, imagines how an altered video of MSNBC commentator Rachel Maddow might be used to influence elections in other countries, by "relaying information that's inaccurate and the opposite of what she said."
Prajwal is concerned about possible misuse of altered videos as well but thinks that precautions can be developed to guard against such scenarios, and that the positive potential for increasing international understanding outweighs the risks of Automatic Face-to-Face Translation. (On the beneficial side, this blog post envisions translating Greta Thunberg's speech at the U.N. climate summit in September 2019 into a variety of different languages used in India.)
"Every powerful piece of technology can be used for a massive amount of good, and also have ill-effects," Prajwal notes. "Our work is, in fact, a translation system that can handle video content. Content translated by an algorithm is definitely 'not real,' but this translated content is essential for people who do not understand a particular language. Further, at the current stage, such automatically translated content is easily recognizable by algorithms and viewers. Simultaneously, active research is being conducted to recognize such altered content. We believe that the collective effort of responsible use, strict regulations, and research advances in detecting misuse can ensure a positive future for this technology."