For the documentary "The Deepfake Project" the journalists and tech students ventured into new territory. The Vision: producing convincing deepfakes of German politicians. Within a few weeks. On a technical level, this meant pioneering work – and a constant "trial and error", because the technology, which has been talked about since 2017, is not yet fully developed.
Nameer Alkhayyat was responsible for the production of the deepfakes. Together with the team, he spent nearly eight weeks collecting and evaluating data and training AI models. The goal: a coherent sequence of voice and lip sync in German. "At first we thought production would be easier. But it was a really difficult process," the tech student said.
For lip syncing, the team quickly came up with a technical solution. "You need a good video in which the person looks straight into the camera and doesn't move their head too much," says Merlin Sugue, who was the link between tech and journalism. The fake was implemented into this video.
"You just overlay one area of the video, the mouth, the rest remains the same".
Faking the voices of politicians and celebrities authentically was more difficult. The software that is currently available does indeed create high-quality voice falsifications in English. However, the algorithms for the German language are not yet optimized. In addition, there were individual accents or dialects, as in the case of Dorothee Bär. "It would have taken weeks to train the model itself," says Nameer Alkhayyat. For the production of the voice samples, the team brought on board Resemble AI, a Canadian company that specializes in the production of AI-generated voices.
In the past, when no AI was used in the production of artificial speech individual linguistic phonemes had to be lined up, to create words or sentences. This involved tedious, sometimes manual labor – and the results often sounded choppy. These problems are now solved by machines. With help of the so-called "deep learning" audio files and the corresponding transcripts are fed into a kind of neural network. The network then trains itself independently. It recognizes language schemes and can independently produce new sentences. The results form a coherent sequence and sound more and more fluid. Zohaib Ahmed, founder and CEO of Resemble AI:
"The beauty of deep learning is that there are hundreds of thousands or even millions of parameters that we as humans wouldn't even recognize."
For the production of our own artificial voices, language samples were "clipped" for each experimental candidate. For this purpose, 600 individual sentences were cut out via video editing programs and sent to Resemble AI together with the corresponding transcripts.
However, the production of voices is comparatively delicate. Problems are mainly caused by the source material. This includes background noise as well as the amount of audio available. The more material available, the more accurately the neural network can produce uniform, artificial voice material.
The production of the voices varies greatly. If you want to train a language unknown to deep learning, you need up to twelve hours of raw material. However, if schemas of this language have already been used, a new voice can be produced much faster.
For the project's website, the Free Tech team had their own avatar created. English AI company Synthesia supported this endeavor.
Victor Riparbelli, CEO and co-founder:"With our approach, it is possible to create video content completely with your own browser. You don't need cameras, a studio or an actor." The classic way to create an avatar is very complex. You would have to take an actor, let them speak different sentences in front of the camera, edit the video files on the computer and then publish them. Synthesia offers the possibility to have these steps replaced by code. You only enter the desired text on the platform, choose a suitable actor and have an almost perfect result within a few minutes.
If you want to create your own avatar, you can't avoid the classic way. For our Avatar, actress and erotic model Micaela Schäfer read some predetermined sentences on English in front of a camera. She was recorded in front of a green screen so that the avatar could later be used with any background. The videos were fed into a neural network by Synthesia. There, both facial expressions and lip movements of Micaela Schäfer are recognized. A short time after, the Avatar was ready to be used. The original voice of Schäfer could not be used, as the creation of artificial voices would not have been sufficient in quality.