M+E Daily

HITS 2023: AppTek Touts the Advantages of Fully Automatic AI Dubbing

Artificial intelligence (AI) and machine learning (ML) company AppTek highlighted the advantages of video content makers using fully automatic AI dubbing on May 23 at the Hollywood Innovation and Transformation Summit (HITS) at The Culver Theater, during the session “Automatic Dubbing for Hollywood and Beyond.”

The presentation featured a demonstration of speaker-adaptive AI dubbing across multiple languages that automatically retains the speaker characteristics and timing of the original speakers when applied to a target language.

The company provided a look under the hood of how the technology works, the types of quality tiers available inside automatic dubbing and how to apply them across media archives, as well as a look into the future of what to expect, and what not to expect, from the technology.

AppTek specialises in AI and ML for human language technologies, Kyle Maddock, SVP, sales and marketing at AppTek, pointed out at the start of the session.

“So that’s automatic speech recognition or machine translation, natural language understanding, and text to speech technologies,” he noted.

“So what do we mean by automatic dubbing? Traditionally, when you think about dubbing, a lot of it has to do with lip movements and there’s computer vision tasks that we’ve seen that do this,” he told attendees.

“What we’re really going to be talking about is re-voicing: going from one source language into a target language while maintaining the same speaker voices and stylisation,” he said.

He provided a demonstration of fully automatic dubbing lasting about two minutes, telling viewers to look at it and “listen to the two different speakers and notice how their voices change and the timing stays with them.”

He then demonstrated how the technology works when implementing it into an automatic workflow, using a clip from the classic film Casablanca, taking the movie from its original English language into German. He took it step by step to show how the process works, beginning with the original source 25-second clip, then using speech separation.

“Speech separation serves three components,” he said, explaining: “First, by isolating the speech, we’re able to input that into the automatic speech recognition and get higher accuracy of that output. The next is we’re able to isolate the speaker voices and then reapply them at a  later step. And the third is we’re able to retain all of the background elements: music, dog barking, everything else, retain the background audio, and then overlay” the new speech on top of that.

The next step starts with automatic speech recognition. “So we convert the speech into text and we timeline it [and] we incorporate individual punctuation,” he said. There is a speaker change and “what we’re able to do is actually segment the two different speakers and then we go through it and we label them.”

The next step was machine translation because when you think about translation, it kind of comes in a bit of a black box in a sense,” he said, noting “it doesn’t really understand a lot of the real world context.” For example, if it’s Spanish being spoken, is it European Spanish or Spanish as it’s spoken in one of the Latin American countries? For this demo, AppTek went with “informal stylings of the German language” because the characters speaking in the film knew each other, he said.

One part of the dialogue in the scene was lengthy and if AppTek used the actual text, “I would’ve had to squeeze that thing in, push it up rapidly, increase that speaking rate and it would’ve sounded very unnatural,”  he explained. “So what we’ve incorporated is something called isometric machine translation,” he said, adding: “What we do is we look at that original sentence … [and] we count a character length of that. And what we do is produce a  translation that better fits” and we can “now keep that very natural flow to speech.”

Next in the demo was adaptive speech synthesis in which “we train about a thousand different speakers,” then, “between male, female and all these different speaking styles, you get a compiled model and, with that compiled model, all I need is a two-second vector from that source language and then I can reapply that to the target language and make that voice sound the same,” he explained.

He added: “Any of your voices would work through this same type of process.”

AppTek then goes in and “we take those segments; we reapply all those individual voices into those time limits [and] now our next step is speech, timing and placement.”

He then moved on to explain quality tiers, pointing out this is “something to think about when you’re incorporating this into your different workflows: What are the different levels of quality we can expect?” After do-it-yourself models, come four AppTek tiers, in which additional features are added to enhance quality. The only thing at a higher tier would be in-studio professional dubbing using professional voice actors.

Next, Dr. Volker Steinbiss, managing director of AppTek GmbH, invited everybody “to be excited but moderately excited … and be excited for a few years because this is a process,” he conceded, adding: “It’s kind of a journey and we have to make a few things” still.

He warned about the Gartner hype cycle that he said “typically starts with a technology trigger [where] you hear from a technology at a conference and you say, ‘Oh gee, you know, that basically solves a few of my biggest problems, right?’ And then you just start talking with your colleagues and then you read it in the media” and ask others if they’ve heard about it. “And everybody gets excited and it’s a positive feedback group, right? And, after a while, everybody is so excited that the excitement doesn’t relate well with the reality anymore. And some people find out and say, ‘Oh, it’s not so good actually. You know, it doesn’t solve all our problems and we have to do something.’”

Then everybody is “really annoyed … [and] you go down into this trough of disillusionment,” he went on to say. But “you ignore the hype – but you also ignore this frustration down there,” he said. “You have done your homework. For example, regarding automatic dubbing: You have made sure that your people are trained, that you got the data to train the machine learning…. You did basically everything right. You go up the slope of enlightenment and then, after a while, you reach the plateau of productivity and that’s basically where the researchers say, ‘You know, this is not interesting anymore. It’s just boring stuff that works all the time.’”

He added: “Technology doesn’t care what you think. It just gets better all the time and it ignores how it’s being viewed.” Then it crosses a line where the tech gets good and “people start talking about it and getting excited and when it crosses the second line, where it’s really, really good and improving, it is not so important anymore,” he said.

When AppTek demonstrated this technology for the first time, about a year ago in London, “everybody was talking about it…. It’s super technology.”

But he said: “Don’t be fooled. We have to do some homework.”

The Hollywood Innovation and Transformation Summit event was produced by MESA in association with the Hollywood IT Society (HITS) and presented by Amazon Studios Technology, with sponsorship by Fortinet, Genpact, Prime Focus Technologies, Signiant, Softtek, Convergent, Gracenote, Altman Solon, AppTek, Ascendion, CoreSite, EPAM, MicroStrategy, Veritone, CDSA, EIDR and PDG Consulting.