AI & Art with Cali Rezo (2): Generative models

To start off with this series of articles on the AI & Art project I did in collaboration with Cali Rezo, we’ll discuss some common generative models and how we applied them to her artwork to create new images in a “Cali-like” style.

The first idea we had when we started the project was to apply AI to image generation. So, I looked into well-known generative models, namely the Variational Autoencoders (VAEs) and the Generative Adversarial Networks (GANs).

In this article, I’ll recall the main ideas behind these 2 types of neural networks and show what results we got by applying them to our dataset.

Variational Autoencoders (VAEs)

First things first: what are plain Autoencoders (AEs)?

Autoencoders are a type of model network specifically designed for efficient data coding. They consist of two parts: an encoder and a decoder. The gist is to create a compressed representation of the input with the encoder (by dimensionality reduction) and then recreate an output that matches as closely as possible the input with the decoder.

Diagram of a simple Autoencoder with its two main components: the encoder and the decoder. The input is embedded into its representation in the latent space and then an output is reconstructed. The model tries to minimize the difference between the generated output and the initial input.

The nice thing is that you are not required to use the entire model every time:

  • by feeding inputs to the encoder and stopping at the end of the first phase, you get a small version of your input that represents it well – if your model is good enough and well-trained, that is!
  • by skipping the first phase and feeding already compressed inputs to the decoder, you can generate new outputs of the same form as the inputs, but not present in your original dataset

Another interesting property of AEs is that they are unsupervised learning models:  meaning that you only need your inputs as reference and no prepared labels for them. Indeed, at the end of each training step, the error the model made can be measured by comparing the original input with the reconstructed one that the decoder-encoder chain produced. Therefore, you don’t need to split your initial dataset into train and test inputs and you don’t need to manually label anything prior to training your model.

The middle, compressed version of your input is usually called its “latent representation”.

Note: here, I automatically assumed that the latent dimension was smaller than the input dimension, which is why I talked about compression. In theory, however, you can have a larger dimension at the end of your encoder… the only problem being that this will probably lead your encoder to learn, at best, the identity function – i.e., it will be able to copy exactly the input, which is quite useless. But some experiments have shown that encoding to a larger space can still create representations with some interesting features.

You may be thinking that, overall, using your encoder and decoder one after the other is not very useful: after all, you’ll only get back what you put in, so what’s the point? Well, a cool variation of the AEs are the Denoizing AEs. Suppose you have a bunch of images which have been corrupted in some way (for example, there is noise on your images) and you want to get back the “clean” images. A Denoizing AE can do this by being fed train images that are corrupted the same way and comparing to its output to their “clean” version; this way, the model will adjust its weights to “denoize” the images. When you give it your corrupted dataset, it should be able to do its magic and retrieve the “clean” content.

What about Variational Autoencoders?

Now, sadly, Autoencoders are a bit limited. To put it simply and avoid diving into too much mathematics, the issue is that the latent space that the compressed inputs live in can be complex and hard to interpolate in. So, picking out new compressed inputs from it to generate brand new outputs might not be an easy task!

For example, it is possible that, after reading your various training inputs, you have mapped a few relevant clusters in your latent space… but you have no idea what the rest of the space looks like! So, you cannot actually generate anything else than what you already saw during training, which is kind of silly.

Our training inputs can be mapped to points in the latent space and reconverted back to outputs, no problem. But, if the latent space is complex, we only have some knowledge about the immediate surroundings of our few mapped points (the purple zones), and we have no idea what the rest of the space of the space looks like. How, then, could we pick a point from it at random and decode it into an output?

So, how do Variational Autoencoders (VAEs) differ from classical Autoencoders? Well, they help us solve our problem by assuming a specific form for the compressed inputs: basically, you suppose that your latent space is nice; that it is, by design, easy to pick from. In other words, you won’t have trouble interpolating between whatever points you discovered while training and can generate new outputs from points in the latent space that weren’t produced by your initial dataset.

To do so, from a mathematical point of view, VAEs consider that your latent representations can be summed up by only 2 parameters, a mean and a standard deviation, and that they follow a Gaussian distribution.

Generative Adversarial Networks (GANs)

The problem with VAEs is that, even if they are really nice in theory, they make a very strong assumption about the latent space. Thus, they infer an overly simplified version of our inputs.

Generative Adversarial Networks follow a different path: as their name indicates, these generative models work by setting two adversaries against each other. More precisely, GANs rely on 2 neural networks, a generator and a discriminator, that are training in parallel to try and fool the other. While the generator tries to create new samples that look real, the discriminator studies hard to distinguish between fake outputs and real ones. So, both networks compete in this zero-sum game and make their best efforts to increase the loss of their adversary – i.e., the generator focuses on tricking the discriminator and, conversely, the discriminator is hellbent on exposing the truth.

In a GAN, a generator and a discriminator battle endlessly to win a game: whoever will trick the other the best!

By playing cowboys and Indians rather than theorizing some properties on the latent space, GANs allow for more complex – but also less understandable – loss functions that are dynamically dependent on the dataset and can therefore avoid the oversimplification problem we are faced with when using VAEs.

Last year, GANs were very publicized after NVIDIA presented a “style-based architecture for GANs” capable of creating quite incredible fake face pictures such as these ones:

Set of face pictures generated by NVIDIA’s “style-based architecture for GAN” model (image taken from the reference paper published by T. Karras, S. Laine and T. Aila in March 2019)

The authors have now published the reference paper on Arxiv.

Getting deeper with Deep Convolutional GANs (DCGANs)…

Basic GANs can be simple fully connected networks. However, research has shown that, when you deal with images, you’re usually better off using convolutional neural networks (or CNNs, in short).

For a DCGAN, the generator usually starts off with blank noise as seed and gradually learns  to create better outputs through a deconvolutional network; the discriminator, on its part, is a convolutional network.

Our results

We decided to train those 3 models; but, compared to the others, simple GANs did not give very interesting results.

The “mode collapse” issue

Even if it was able to work with the famous MNIST dataset, our GAN model wasn’t powerful enough for the project at hand.

On the simple MNIST images, we got okay results after training for a while. For example, here is a small animation of what our model could produce as it was gradually learning from the dataset (we saved 1 image every 2000 epoch and trained for a total of 100 000 epochs):

It is not a perfect generative model but, after training, the GAN does produce some recognizable character.

On the other hand, with our own dataset, results were not as good… We encountered a known issue with GANs called “mode collapse”. As  explained in this article, it can happen that the space you draw samples from is clustered into just a few groups, or “modes”. In this particular case, the generator can then keep switching from one mode to another to trick the discriminator… but, this way, it learns only to generate part of the space. In other words, if it manages to trap the discriminator with one pattern, it will just stick with it forever.

This unintended behavior lead our GAN to generate… this.

Not the most interesting generative model, to be honest.

Thanks to a lucky seed, I guess, we got one GAN model that generated interesting images, as you can see below (top row). However, those were quite noisy. Hence the need for a denoizer… for example, the Denoizing Autoencoder we talked about earlier! After training one to clean up Cali’s images, we could use it on the outputs from our GAN to get results that are a bit better (bottom row):

Still, this is not excellent and we lose a lot of info; we realized that GANs may not be the right way to go for our project. Indeed, there are a few solutions to try and avoid this “mode collapse” (that are mentioned in the article), mostly relying on forward or backward sample study so that the discriminator doesn’t get fooled by the “cluster jump trick”. But this is not a trivial problem to solve and we decided to rather focus on the VAE and the DCGAN that quickly yielded way better results.

More promising results with the VAE and DCGAN!

Compared to our GAN, our other 2 models produced some really interesting output.

Cali was even able to take some of the generated images and re-work from them to create some new stuff – she’ll talk about this more in depth during the article focused on her experience of the project.

Since we have images, our VAE relies on convolutional layers in both its encoder and decoder parts. The implementation is based on this article by Felix Mohr. The DCGAN is inspired by the examples in the Github repository of Aymeric Damien.

Just as a reference, here are the sort of images we gave our models to train on (realized by Cali Rezo):

The animations below show how the 2 models train – just like we did with the MNIST dataset for the GAN, by regularly generating images during the training phase, we can check that they actually learn to produce better outputs:

Training evolution of the VAE model (trained for 500 steps, generating 1 image every 10 steps)

Training evolution of the DCGAN model (trained for 500 steps, generating 1 image every 10 steps)

We already see that the VAE and DCGAN give very different images! As has been pointed out by some people, VAEs sometimes give “blurry” results. And it looks like it takes a bit more time before managing to get real outputs, whereas the DCGAN quickly makes something that is not too bad.

Now that our models are trained, we can ask them to generate as many images as we want! To do so, we only need to feed them with new “seeds”: the decoder of the VAE will transform it from the latent dimension to the output one and the generator of the GAN will use it to create an image capable of fooling the discriminator. We will also add some post-processing to improve the outputs – for these two models, it is mainly about forcing a black and white image and taking care of the contours.

If we apply the post-processing function to the outputs of our VAE, we reduce blurriness significantly!

Finally, here are a bunch of images generated by the VAE (top) and the DCGAN (bottom):

A set of 30 images generated by our VAE from random seeds

A set of 30 images generated by our DCGAN from random seeds

We see that the VAE is able to find very interesting patterns in Cali’s images and in particular it catches on the “eye” hole present in many reference inputs. The DCGAN is not that good at getting this pattern and tends to create some big black blobs. In contrast, the VAE is less prone to output just a black square, even if it sometimes does.

Future developments

Of course, these results could be further improved by:

  • having more training examples (600 is a small number compared to most AI datasets)
  • creating a more complex architecture (with more layers, or larger ones)
  • adding further components like regularization or dropout (to avoid overfitting and produce more “inventive” patterns)

However, all those improvements would also imply a more computationally expensive training, hence it would require more powerful hardware than we had at our disposal.

It is worth nothing that, today, a lot of services offer affordable or even free online computing power. If you’re interested in running your own AI models and think of doing it in the cloud, you can check out this video by Siraj Raval – who I already mentioned a long time ago but is still, to me, a great teacher for ML and AI stuff! But since the video is from 2017, keep in mind that things have changed since then and this is only a first reference…

Going further with VAEs and GANs?

We’ve just seen that DCGANs are a nice improvement of basic GANs for image generation… but other variations of the generative models we saw have been imagined!

For example, some people try to take the best of the two worlds and merge it into “VAE-GAN” models – the idea is to replace your simple VAE decoder by a discriminator and to place it after your encoder so that you can (hopefully) learn more complex loss functions and avoid the oversimplification of VAEs.

And there are loads of other GAN models with funky names that test different loss functions to stabilize and regularize the mathematical computations.

VAEs are also under scrutiny by many teams, but so far it’s been said more than once that they are not as good as (well-tuned) GANs for image generation, mostly because the oversimplification they imply gives off blurry images.

    1. Cali Rezo’s website:
    2. I. Wikimedia Foundation, “Autoencoder” (, March 2019. [Online; last access 21-April-2019].
    3. I. Wikimedia Foundation, “Unsupervised learning” (, April 2019. [Online; last access 21-April-2019].
    4. I. Shafkat, “Intuitively Understanding Variational Autoencoders” (, February 2018 [Online; last access 21-April-2019].
    5. I. Wikimedia Foundation, “Generative adversarial network” (, April 2019. [Online; last access 21-April-2019].
    6. I. Wikimedia Foundation, “Zero-sum game” (, February 2019. [Online; last access 23-April-2019].
    7. T. Karras, S. Laine and T. Aila, “A Style-Based Generator Architecture for Generative Adversarial” Networks(, March 2019. [Online; last access 27-April-2019].
    8. F. Mohr, “Teaching a Variational Autoencoder (VAE) to draw MNIST characters” (, October 2017. [Online; last access 24-April-2019].
    9. Reference page of the MNIST dataset:
    10. A. Nibali, “Mode collapse in GANs” (, January 2017. [Online; last access 26-April-2019].
    11. A. Damien’s Github repository:
    12. S. Raval, “How to Train Your Models in the Cloud” (, May 2017. [Online; last access 25-April-2019].
    13. CE Kan, “What The Heck Are VAE-GANs?” (, August 2018. [Online; last access 27-April-2019].
    14. H. Heidenreich, “What The Heck Are VAE-GANs?” (, August 2018. [Online; last access 27-April-2019].

Leave a Reply

Your email address will not be published.