#002 Deepfakes – The Creation and Detection of Deepfakes: A Survey

#002 Deepfakes – The Creation and Detection of Deepfakes: A Survey

Highlights: What is real, really? The ‘Deepfake’ technology has made advances into the mainstream and made it difficult to prove the authenticity of videos and faces. Even though the technology itself isn’t all that bad, its unethical use has victimized many across the globe.

In this blog post, we will learn about the underlying algorithms of these so-called Deepfakes, based on Generative Adversarial Networks (GANs). We will also explore the creation and detection of Deepfakes in the real world and how current architectures can be improved through research. So let’s begin!

Tutorial Overview

  1. Introduction
  2. Types Of Deepfakes & Attack Models
  3. Creating Deepfakes Using Neural Networks
  4. Reenactment Deepfakes

1. Introduction

What Is A Deep Fake?

Deep Learning + Fake = Deepfake

Any content generated by Artificial Intelligence that looks authentic to the human eye can be called a Deepfake. The process of producing Deepfakes involves the generation and manipulation of human images.

Today, Deepfakes are used in many domains and media such as forensics, finance, and healthcare. Take, for example, the ongoing trend of creating music videos with the face of the famous actor, Nicolas Cage; or, shopping applications that involve virtual try-on for clothes; or, realistic video dubbing of foreign films.

The Good & The Bad Of Deepfakes

There are numerous positive applications of Deepfakes, especially some of the entertaining ones. However, Deepfake technology has also come into the limelight due to its unethical and malicious aspects.

One such incident in 2017 involved a Reddit user who used deep learning to swap the faces of celebrities into pornographic videos and posted them online. This caused a major uproar in the media and a number of other Deepfake videos began to surface. Even Buzzfeed, in 2018, released a Deepfake video wherein former US President Barack Obama is talking about Deepfakes. This video was created using FakeApp, a Reddit user’s personal software project.

Over time, the creation of Deepfake videos has raised concerns over identity theft, impersonation, and the spread of misinformation on social media. Have a look at the trust chart for Deepfakes, as shown in the image below.

In recent times, the academic discussion around Deepfakes has grown with over 250 research papers published since 2018, as compared to just 3 in 2017.

Going ahead in this tutorial post, we will understand the threats pertaining to Deepfake technology and ways to mitigate them. Before that, we will study how Deepfakes are created and detected, and the current stage the technology is at. We will also learn about the basic design of Deepfake architectures. Finally, we’ll study how the defender can get an advantage in the attacker-defender game.

Let us begin by going into detail about some of the categories of Deepfakes.

2. Types Of Deepfakes & Attack Models

Deepfakes are different from adversarial machine learning models. While adversarial machine learning involves fooling a machine with maliciously crafted inputs, Deepfakes involve generating content that fools human, not a machine.

In simple words, Deepfake can be defined as –

“Believable media generated by a deep neural network.”

Broadly, there are four categories all Deepfakes can be divided into:

  1. Reenactment
  2. Replacement
  3. Editing
  4. Synthesis

Have a look at the image below that shows some of the different types of Deepfakes. The source is denoted by \(s \) and the target identity is represented by \(t \). In addition, \(x_{s} \) and \(x_{i} \) signify the images of these identities. The Deepfake generated from \(s \) and \(t \) is written as \(x_{g} \).


Reenactment

A Reenactment Deepfake is where \(x_{s} \) is used to drive the expression, mouth, gaze, pose, or body of \(x_{t} \).

  • Expression: This is the most common form of reenactment wherein \(x_{s} \) drives the expression of \(x_{t} \). These technologies drive the target’s mouth as well as pose, providing a wide range of flexibility. Even performances of actors in movies can be tweaked in post-production using this technology.
  • Mouth: Here, either \(x_{s} \) or an audio input \(a_{s} \) drives the mouth of \(x_{t} \). This type of reenactment is also known as ‘Dubbing’ of realistic voices into another language.
  • Gaze: In this type of reenactment, the direction of \(x_{t} \) ‘s eyes, and the position of the eyelids are driven by that of \(x_{s} \), thereby, helping maintain eye contact during video interviews.
  • Pose: The head position of \(x_{t} \) is driven by \(x_{s} \), in this case. This type of reenactment is used for improving facial recognition softwares and face fractalization.
  • Body: This is also known as pose transfer and human pose synthesis. It functions similar to the above reenactment with the only difference being that here, the pose of \(x_{t} \) ‘s body is being driven.

One of the dangerous applications of these Reenactment Deepfakes is that it gives an attacker the tools to defame an individual or meddle with a piece of evidence. Impersonating an identity, spreading misinformation, tampering with surveillance footage, and generating embarrassing content for blackmailing purposes are also by-products of such reenactment Deepfakes. These incidents can either be online or offline. Such applications of Reenactment Deepfakes can be categorized as The Attack Model Based On Reenactment Deepfakes.

Replacement 

A Replacement Deepfake is where the content of \(x_{t} \) is replaced with that of \(x_{s}\), preserving the identity of \(s \). There are two ways in which we can create a Replacement Deepfake.

  • By transfer: In this case, the content of \(x_{t} \) is replaced with that of \(x_{s} \). For example, in the fashion industry, Facial Transfer can visualize an individual in different outfits.
  • By swap: Here, the content transferred to \(x_{t} \) from \(x_{s} \) is driven by \(x_{t} \). For example, Face Swap application is used to generate fun internet memes by swapping an individual with that of a famous celebrity.

Just like Reenactment Deepfakes, Replacement Deepfakes are also notorious for their malicious applications. From revenge porn to fake political statements to spreading misinformation, these kinds of Deepfakes, if not used ethically, can become tools for defamation, blackmail, and propaganda. Such applications of Replacement Deepfakes can be categorized as The Attack Model Based On Replacement Deepfakes.

Editing

An Enchantment Deepfake is where the attributes of \(x_{t} \) are added, altered, or removed. For example, Face App provides this feature to users to alter their appearance for entertainment purposes. Be it changing a target’s clothes, facial hair, age, weight, or even ethnicity, Enchantment Deepfakes find application in cases where there is a defined target.

The Attack Model Based On Enchantment Deepfakes will involve applications in misleading politics and defamation, wherein a sick leader is made to appear as healthy or sex predators create dynamic profiles online to carry out malicious activities.

Synthesis

Synthesis is when a deepfake \(x_{g} \) is created with no target in mind. Using techniques such as Human Face and Body Synthesis, royalty-free stock footage can be created for movies as well as games.

The Attack Model Based On Synthesis, similar to Editing or Enchantment Deepfakes, propagates the creation of fake online personas.

Among all the above different types of Deepfakes, Synthesis and Editing Deepfakes are active research topics. However, malicious applications of Reenactment and Replacement Deepfakes are a matter of great concern as these can become devastating tools for an attacker to gain control over an innocent’s identity.

Let us move ahead and learn how these Deepfakes can be created, the mathematics behind them, and the commonly faced challenges in their development.

3. Creating Deepfakes Using Neural Networks

The most popular method for creating Deepfakes is a combination of generative networks and encoder-decoder networks. Before we study the different variations and combinations of these networks, let us first study Loss Functions in the context of Deepfake creation.

Loss Functions 

Loss functions vary according to the learning objective. Take for example, when we are training a \(M \) as an \(n \) -class classifier, \(M \)’s output will be \(y \in \mathbb{R}^{n} \) , which is a probability vector. In order to train \(M \), forward propagation is performed and we obtain \(y^{prime}=M(x) \). Next, we compute \(\left(\mathcal{L}_{\mathrm{CE}}\right) \), which is the cross-entropy loss, by comparing \(y^{prime} \) to the ground truth label \(y \). We, then, perform back-propagation for updating the weights with the training signal.

The total loss over the entire training set \(X \), as represented by \(\mathcal{L}_{c E} \) can be calculated as shown below.

$$ \mathcal{L}_{C E}=-\sum_{i=1}^{|X|} \sum_{c=1}^{n} y_{i}[c] \log \left(y_{i}^{\prime}[c]\right) $$

In the above expression, the predicted probability of \(x_{i} \) belonging to the \(c \) -th class, is represented by \(y^{prime}[c] \).

There are many other loss functions that are popularly used in Deepfake networks. These include the \(L 1 \) and \(L 2 \) norms \(\mathcal{L}_{1}=\left|x-x_{g}\right|^{1} \) and \(\mathcal{L}_{2}=\left|x-x_{g}\right|^{2} \) .

However, \(\mathrm{L} 1 \) and \(\mathrm{L} 2 \) require paired images (e.g., of \(s \) and \(t \) with the same expression). They also don’t perform well when large offsets between images exist such as different poses or facial features. Such kind of a scenario occurs in Reenactment Deepfakes when \(x_{t} \) has a different pose than \(x_{s} \) which is reflected in \(x_{g} \), and in the end, we would like \(x_{g} \) to match the appearance of \(x_{t} \).

In the case of unaligned images, a common approach is to pass both the images through a Perceptual Model and measure the difference between the layer’s activations using Feature Maps.

This kind of loss is called the Perceptual Loss \(\left.\mathcal{L}_{\text {perc }}\right) \) and is used in image generation tasks. \(\mathcal{L}_{\text {pere }} \) is used in the creation of Deepfakes and is calculated using a face recognition network such as VGGFace. The idea behind using \(\mathcal{L}_{\text {pere }} \) is that the Feature Maps or the inner layer’s activations of the Perceptual Model, act as a normalized representation of \(x \) in the context of how the model was trained. Thus, when we measure the distance between the Feature Maps of two different images, we are measuring their semantic difference, which means how similar their noses and other finer details are.

Similarly, there is also a feature matching loss, represented by \(\left(\mathcal{L}_{\mathrm{F} M}\right) \), that uses the last output of a network. The intuition behind \(\mathcal{L}_{F M} \) is that we consider the high-level semantics captured by the last layer of the perceptual model, which is the general shape and the textures of the head.

Now that we have understood the kinds of loss functions used to create Deepfakes, let’s learn the various combinations and variations of generative networks that are used in the creation of Deepfakes.

Use Of Generative Neural Networks

There are six different kinds of networks that are combined in various ways and used in the creation of Deepfakes.

  1. Encoder-Decoder Networks
  2. Convolutional Neural Networks (CNN)
  3. Generative Adversarial Networks (GAN)
  4. Image-To-Image Translation (pix2pix)
  5. Cycle GAN
  6. Recurrent Neural Network (RNN)

Have a look at the image below that shows the pictorial representation of 5 of these networks.

  1. Encoder-Decoder Networks (ED): As the name suggests, an ED includes at least two networks – Encoder \(En \) and Decoder \(De \). It has layers that are narrow towards its center. This ensures that the network is forced to summarize the observed concepts when it is trained as \(De(E n(x))=x_{g} \). This summary of \(x \) with its distribution \(X \) is referred to as Encoding or Embedding, and is represented as \(En(x)=e \). Here, \(E=En(X) \) is referred to as the ‘latent space’.

    Multiple encoders or decoders are employed by Deepfake technologies that manipulate the encodings in order to influence \(x_{g} \), the output. In case, the encoder and decoder are symmetrical, and the network is trained with the objective \(De(En(x))=x \), the network is termed as an ‘Autoencoder’, wherein, the output is nothing but the reconstruction of \(x \), denoted by \(\hat{x} \).
    Variational Autoencoder (VAE) is another special kind of \(\mathrm{ED} \) wherein the encoder learns the posterior distribution of the decoder, if we are given the distribution \(X \). Interestingly, VAEs generate content more efficiently than Autoencoders. This is because the concepts in the latent space are disentangled due to which encodings respond better to interpolation and modification.
  2. Convolutional Neural Network (CNN): The convolutional layer in a CNN learns filters that shift over the input and form an abstract feature map as the output. As the network gets deeper, CNN makes use of up-sampling layers to increase dimensionality and pooling layers to reduce dimensionality. All these layers of convolution, pooling, and up-sampling help build better imagery.

    Contrary to fully-connected dense networks, CNNs learn pattern hierarchies in the data and turn out to be much more efficient at handling imagery.
  3. Generative Adversarial Networks (GAN): In 2014, Goodfellow proposed the first GAN that consisted of two neural networks working against each other – Generator \(G \) and Discriminator \(D \). The Generator \(G \) creates fake samples, \(x_{g} \), with the aim of fooling the Discriminator, \(D \). On the other hand, the Discriminator, \(D \) learns to differentiate between real samples, \((x \in X) \), and fake samples, \(x_{g}=G(z) \), where \(z\sim N \).
    There is always an adversarial loss in GANs that is used to train the Generator \(G \) and the Discriminator \(D \), respectively, as shown below.

    $$ \begin{gathered} \mathcal{L}_{a d v}(D)=\max \log D(x)+\log (1-D(G(z))) \\\mathcal{L}_{\text {ade }}(G)=\operatorname{minlog}(1-D(G(z)))\end{gathered} $$

    The idea of zero-sum is that using this the Generator \(G \) learns how to generate indistinguishable samples from the original distribution. Once trained, the Discriminator \(D \) is rejected and \(G \) is used to generate content.

    In the case of imagery, such an approach produces photo-realistic images. There have been many proposals made to improve the performance of GANs. Two of the most popular image translation frameworks that work on the fundamental principles of GANs are Image-To-Image Translation, commonly known as pix2pix, and Cycle GAN.
  4. Image-To-Image Translation (pix 2pix): This framework allows paired translations from one image domain to the other. Here, the Generator \(G \) generates the image \(x_{g} \) using a visual context \(x_{c} \) as an input. The Discriminator \(D \) differentiates between \(\left(x, x_{c}\right) \) and \(\left(x_{q}, x_{c}\right) \).

    The Generator \(G \) can also be seen as an ED CNN with skip connections from \(E n \) to \(De \). This is called a U-Net and it enables the Generator \(G \) to produce high-fidelity imagery by bypassing the compression layers when needed.

    pix2pixHD is another framework proposed for especially generating high-resolution imagery with better fidelity.
  5. Cycle GAN: This type of Generative Adversarial Network is an improved version of pix2pix in the sense that it enables image translation through unpaired training. The network converts images from one domain to another and then back again, by forming a cycle consisting of two GANs. A cycle consistency loss, represented by \(\left(\mathcal{L}_{\text {cyc }}\right) \), ensures consistency.
  6. Recurrent Neural Networks (RNN): An RNN can take care of sequential as well as variable-length data. It remembers its internal state after processing \(x^{(i-1)} \) and can use it to process \(x^{(i)} \) and so on. These types of neural networks can handle audio and video exceptionally well, which is why they are used in Deepfake creation. Further advanced versions of RNNs include Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU).

Feature Representations

There are multiple approaches that many Deepfake architectures use to capture and manipulate \(s \) and \(t \) ‘s facial structure, pose, and expression.

One of the ways to do this is through Facial Action Coding System (FACS) wherein we measure each face’s taxonomized Action Units (AU).

In another approach, monocular reconstruction is performed in order to obtain a 3D morphable model of the head from a 2D image. This 3D model is represented as \(3 \mathrm{DMM} \). Herein, the pose and the expression are parameterized by a set of vectors and matrices. These parameters are then used to arrive at a 3D rendering of the head itself. In some cases, a UV map of the head or the body is also used. This provides a better understanding of the orientation of the shape to the network.

Image segmentation is also one of the approaches that help the network differentiate between face, hair, and other bodily concepts. ‘Key Points’ or ‘Landmarks’ are one of the most common representations of a set of defined positions on the face or body. These positions can be efficiently tracked using open source CV libraries and are often presented as 2D images with Gaussian points to the networks, at each landmark.

Facial boundaries and body skeletons are also good ways to capture and manipulate the facial structure and pose in Deepfake architectures.

For audio, however, the most popular way is to split the audio into segments. Then, the Mel-Cepstral Coefficients (MCC) are measured for each segment in order to capture the dominant voice frequencies.

Feature representation is an important aspect of Deepfake architecture. But how are these Deepfakes actually created? Let’s get into the fundamentals of Deepfake creation.

Deepfake Creation Basics

Reenactment Deepfakes and Face Swap Deepfakes usually follow a particular process or its variation to generate \(x_{g} \). In this process, \(x \) is passed through a pipeline that performs the following actions.

  • Detects and crops the face
  • Extracts intermediate representations
  • Generates a new face based on a driving signal, say, another face
  • Blends the generated face back into the target frame

Have a look at the pictorial representation of the above process in the image below.


Overall, there are 6 approaches for driving an image.

  1. Let a network work on the image directly and then, perform the mapping itself.
  2. Train an ED network to disentangle the identity from the expression, followed by modifying or swapping the encodings of the target before passing it through the decoder.
  3. Add an additional encoding, say an AU or embedding, before passing it to the decoder.
  4. Convert the intermediate facial or body representation to the desired expression or identity before generation.
  5. Use the optical flow field from subsequent frames in a source video in order to drive the generator.
  6. Create a composite of the original content such as hair and scene, combining it with the 3D rending of the warped image or the generated content. Then, pass this composite through another network such as a pix2pix network, in order to make the output more realistic.

Generalization

When we create a Deepfake network, it is very hard to achieve an identity-agnostic model. This is due to the fact that the model learns correlations between \(s \) and \(t \) during the training period. Generally speaking, Deepfake networks are trained or even designed to work only when a specific target or identity is given.

Let’s see how we can generalize the creation of a Deepfake.

Assume \(E \) is a model that extracts features from \(x \). Let \(M \) be a trained model that performed Replacement or Reenactment. Now, we can categorize our generalization of Deepfakes into three main models.

  1. One-To-One: This model uses a specific identity to drive another specific identity, as expressed below.

    \(x_{g}=M_{t}\left(E_{s}\left(x_{s}\right)\right) \) 
  2. Many-To-One: This model uses any identity to drive a specific identity, as expressed below.

    \(x_{g}=M_{t}\left(E\left(x_{s}\right)\right) \)
  3. Many-To-Many: This model uses any identity to drive any identity, as expressed below.

    \(x_{g}=M\left(E_{1}\left(x_{s}\right), E_{2}\left(x_{t}\right)\right) \)

Challenges 

Every task comes with its set of challenges. Even the creation of Deepfakes has its own challenges, especially when it comes to creating realistic Deepfakes.

  • Generalization: High-quality images of a specific identity require a large number of samples of that identity. This is due to the fact that Generative networks, being data-driven, reflect the training data in their outputs. Additionally, a large dataset of the driver is usually much more accessible than that of the victim. This has resulted in research looking to minimize the amount of training data needed and enable the execution of a trained model on new target and source identities.
  • Paired Training: Training a neural network also sometimes involves presenting the desired output to the model for each input. Such kind of pairing of data is tedious and impractical, especially in the case of training on multiple identities and actions. There are three ways to overcome this challenge, which many Deepfake networks employ.
    • Using selected frames from the same video of \(t \) to train the Deepfake network in a self-supervised manner
    • Using unpaired networks such as Cycle GANs
    • Using the encodings of an ED network
  • Identity Leakage: When we are working on training a model on a single input identity or multiple identities with data pairing being done with the same identity, we face the problem of identity leakage. This means that in such a scenario, the identity of the driver, such as \(s \) in Reenactment, is only partially transferred to \(x_{g} \). This problem can be solved by applying proposed concepts like attention mechanisms, few-shot learning, disentanglement, boundary conversions, and AdaIN. There is another way to resolve this issue by skipping connections in order to carry the relevant information to the generator.
  • Occlusions: In the case when a part of \(x_{s} \) or \(x_{t} \) gets obstructed by hands, hair, glasses, or any other item, we face a problem of Occlusions. Another type of occlusion occurs when the eyes and the mouth region get hidden or change dynamically. This results in artifacts appearing unnaturally cropped or containing inconsistent facial features. One way to mitigate such occlusions, segmentation, and in-painting can be performed on the obstructed areas.
  • Temporal Coherence: Another challenge that we might face during the creation of Deepfake videos is the flickering or jittering of individual frames. Such an issue occurs due to missing contexts of the preceding frames, which is quite common in a typical Deepfake network. Some researchers suggest providing a context to \(G \) and \(D \), implementing temporal coherence losses, using RNNs, or performing a combination of one of these operations.

Now that we have studied the different types of Deepfakes, let us see understand specifically how Reenactment Deepfakes work.

4. Reenactment Deepfakes

Deep Learning-based Reenactment Deepfakes can be organized according to their class of identity generalization as summarized in the table below. Have a look.

Let us review this categorization in a chronological manner and identify the significance of each approach.

Expression Reenactment 

In 2003, models of 3D scanned heads were morphed by some researchers. It was later shown in 2005 that this can be done without a 3D model and even through warping with matching similar textures. Further, between 2015 and 2018, Thies proposed to use 3D parametric models to achieve high-quality and real-time results with depth sensing and ordinary camera.

As is clear, Expression Reenactment has been around long before the popularization of Deepfakes. By turning identity into a puppet, it gives attackers flexibility to achieve the desired impact.

Nevertheless, modern-day deep learning approaches are the simplest way to generate believable content. Let’s see what are the different approaches within Expression Reenactment.

In the following segment, the references are taken directly from the paper.

  1. One-to-One
    (Identity to Identity)

    In 2017, Cycle GAN for facial reenactment without needing data pairing was proposed by the authors of [176]. The two domains were video frames of \(s \) and \(t \). But, in order to avoid artifacts in \(x_{g} \), the authors noted that both domains must share a similar distribution as in poses and expressions.

    Bansal et al., in 2018, proposed Recycle GAN [15], a generic translation network based on CycleGAN, which improves temporal coherence and mitigates artifacts by including next-frame predictor networks for each domain. The authors employed facial reenactment by training their network for translating facial landmarks of \(x_{s} \) into portraits of \(x_{t} \).
  2. Many-to-One
    (Multiple Identities to a Single Identity)

    CVAE-GAN, a conditional VAE-GAN, was proposed by the authors of [16] in 2017. Herein, the generator is conditioned on an attribute vector or class label. But, manual attribute morphing is required to perform reenactment with CVAE-GAN. This is done by interpolating the latent variables, such as between target poses.

    In 2018, different methods to decouple \(s \) from \(t \) were proposed by a large number of published source-identity agnostic models. Let’s understand a couple of these approaches.
    • Facial Boundary Conversion: Here, the structure of the facial boundaries of the source was converted to that of the target’s and then, passed through the generator [174]. Using ‘ReenactGAN’, the authors transformed the boundary \(b \) to the target’s face shape as \(b_{t} \) before generating \(x_{g} \) with a pix2pix-like generator.
    • Temporal GANs: Here, the authors of [162] proposed a temporal GAN that generates videos while disentangling the motion and the objects proposed. This so-called ‘MoCoGAN’ improved the temporal coherence of Deepfake videos.

      A target expression label \(z_{c} \) and a motion embedding \(z_{M}^{(i)} \) for the \(i\) -th frame which is obtained from a noise-seeded RNN are used to generate each frame. MoCoGAN uses two discriminators – one for realism (per frame) and the other for temporal coherence (on the last \(T \) frames).

      Vid2Vid is another framework proposed by the authors of [169]. This framework is quite similar to pix2pix but for videos. Vid2Vid considers the temporal aspect by generating each frame based on the last \(L \) source and generated frames. It also performs next-frame occlusion prediction due to moving objects, by considering optical flow. Compared to MoCoGAN, this Vid2Vid framework is much more practical as the Deepfake is driven by \(x_{s} \), such as an actor, instead of crafted labels.

      This was taken one step further by the authors of [83], who achieved complete facial reenactment (gaze, blinking, pose, mouth, etc.) with just one minute of training video. In their proposed method, they used monocular reconstruction to extract the source and the target’s 3D facial models from 2D images. After that, they transferred the facial pose and expression of the source’s 3D model to the target, for each frame. Finally, they produced \(x_{g} \) with a modified pix2pix framework, using the last 11 frames of rendered targets, heads, UV maps, and gaze masks as the input.
  3. Many-to-Many
    (Multiple Identities to Multiple Identities)

    As a first-time attempt, the authors of [124], in 2017, utilized a Conditional GAN (CGAN) to create identity agnostic models.

    First, they extracted the inner-face regions as \(\left(x_{t}, x_{s}\right) \). Then, these regions were passed to an ED in order to produce \(x_{g} \) subjected to \(\mathcal{L} \) and \(\mathcal{L}_{\text {adv }} \) losses.

    A common problem faced with the use of CGAN was that the training data had to be paired such that images of different identities had to be combined with the same expression.

    The authors of [190] reenacted full portraits at low resolutions by decoupling the identities using a conditional adversarial autoencoder. With this, they managed to disentangle the identity from the expression in the latent space. Even still, their approach was limited to driving \(x_{t} \) with discrete AU expression labels (fixed expressions) that capture \(x_{s} \).

    A similar label-based reenactment was presented in the evaluation of StarGAN [29], whose architecture is similar to that of a CycleGAN but for \(N \) domains such as poses, expressions, etc.

    In 2018, GATH was proposed by the authors of [127]. This GATH could drive \(x_{t} \) using continuous action units (AU) as input, extracted from \(x_{s} \). The use of continuous AUs enables smoother reenactments over previous approaches ([29],[124],[190]). In this approach, the generator is an ED network that is trained on the loss signals by using three other networks – a Discriminator, an Identity Classifier, and a Pre-Trained AU Estimator. The classifier shares the same hidden weights as the discriminator to disentangle the identity from the expressions.

Reenactment is one of the many approaches used for creating ‘believable’ Deepfakes. The world of ‘Deepfakes’ has many advantages and equally numbered, if not more, disadvantages. The ethical use of the Deepfake technology can push Deep Learning into a really advanced era of the human face and expression editing. This tutorial was aimed at teaching you the positive aspects of this technology and warning you of its possible malicious applications. As researchers and believers in technological advancements, we must take responsibility for our actions and not let the dark side of technology rule our minds. With this thought, let’s close this chapter and revise what we learned today.


Creating And Detecting Deepfakes

  • Deepfake is a combination of the phrases ‘Deep Learning’ and ‘Fake’
  • Deepfake is basically AI-generated content that looks real
  • There are four categories of Deepfakes – Reenactment, Replacement, Editing & Synthesis
  • Reenactment involves impersonating identity
  • Replacement involves swapping of faces
  • Editing or Enchantment & Synthesis involves altering appearances
  • Most Deepfakes are created using combinations or variations of various generative networks
  • There are multiple Generative Neural Networks used to create Deepfakes – Encoder-Decoder (ED), CNN, GAN (pix2pix and Cycle GAN) and RNN
  • Deepfake creation has many challenges such as Generalization, Paired Training, Identity Leakage, Occlusions, and Temporal Coherence
  • Reenactment is of three types – One-to-One, Many-to-One, and Many-to-Many in terms of source and target identities
  • Deepfake has its advantages but also has some seriously malicious applications

Summary

That’s it for today, folks! We hope you enjoyed learning about the famous (or infamous) topic of Deepfakes. It is very interesting that such technologies are surfacing more and more in modern times. Maybe right now, we are using it negatively more than positively. But, we surely hope that the right use of such technologies can help Machine Learning and Deep Learning advance to newer heights. Do try to read more about Deepfakes and how to create them. Which celebrity would you like to swap faces with? Let us know in the comment section. We’ll meet you in our next tutorial. Until then, keep it real! 🙂