AI & Art with Cali Rezo (3): Analytical models

In the last article, we focused on VAEs and GANs for image generation. This time, we’ll talk about analyzing images and trying to identify classes. We will also take this opportunity to talk about the usual traps and limits of AI classification.

To create analytical models, Cali and I decided on a set of “painting series” for our different examples then trained some models to group the images into these classes.

In total, we had 9 groups for 3 selection criteria:

As you can see, some images can be assigned to multiple groups, as long as the groups are not for the same criterion; for example, a circular shape is both a “circular” and a “1 stroke” shape. Our goal was to study how well-known classification algorithms could perform on this dataset, each model training focusing on one criterion at a time.

Note: in this article (and in Cali’s work in general, by the way!), when we talk about “eyes” we refer to areas of white that are completely enclosed into black. This is why most circular shapes are considered to be in the “no eye” category.

While labeling the training images, Cali came across something interesting: sometimes, only someone with an insight of how the painting was made can get the right category. In particular, some circular shapes might have been a “1 stroke” in theory but be partly outside the canvas and, therefore, seem like a “2 strokes” painting. She’ll talk about this more in depth in the fifth article of the series but it makes you wonder: how could a computer ever come up with this sort of intuition?

Some basics about AI classification

How to learn?

Machine learning classification can be broadly separated into 3 categories:

  • supervised learning: when you train your network, you provide it input-output pairs to learn the correct mapping
  • unsupervised learning: the model is only trained with input examples (but no corresponding output) and therefore performs self-organization and clustering on those inputs to create its own labels and groups
  • semi-supervised learning: this technique mixes the two above by using partly labelled data; the model tries to take advantage of the small amount of data that is labelled to make better predictions and “fill the holes” on the rest

Semi-supervised learning can be a nice compromise between supervised and unsupervised learning because it usually allows the model to perform better than it would without any label but doesn’t require as much preparation time.

What about those labels?

So, we know that models perform better when you give them labeled data. But there is, however, a small issue with this: how do you actually set those categories? Well, the answer is usually: with a very careful human agent.

In other words: in order for our AI models to learn well, first, we need to ask humans to prepare data carefully for it. That’s a pity. Does it mean that this “machine learning” is not as autonomous as people want us to believe? Yes and no.

It is true that, when you give it enough data, an AI model can be excellent at finding patterns and interactions between the various items. But it is also true that, in particular for classification, there is a huge pre-processing step that must be performed by humans to compile, clean, sort and organize training examples.

At this point, I’d like to mention a growing trend that we’ve seen develop for these past years: asking the public to label data. Although I’m not fundamentally opposed to the idea, I feel like it is not always obvious to most people we are actually having them do data science engineer’s work for a small wage – or even for free… Be it through the use of CAPTCHAs or actual missions offered by big players like Google or Amazon, it is now common to have people everywhere on Earth give some of their own time to feed great big databases. Once again, it is not really an issue in itself; to me, the problem is more that AI is not always well-understood by the public and that we often hear those models will soon be able to do anything and replace us… while ignoring how neatly prepared the data need to be for them to even work!

There is, of course, much to say about the place of AI in our society today, and I don’t want to gloss over this in a few paragraphs. In the final article of this series, I will go a bit more in depth into this topic and share with you some of my latest thoughts on the matter. But the key thing is that, I believe, given how omnipresent AI already is – and will continue to be – in our daily lives, all of us need to think about what it represents and be aware of the power but also the limits of AI.

And as a proof of that, let us take a look at some very simple machine learning models that may fool you into thinking they are good, when in fact they are just excellent parrots.

A basic clustering unsupervised classification

To begin with, let’s keep our labels aside for a while and try to do unsupervised learning with the simplest and the most common model for it: K-means clustering.

What is K-means clustering?

The idea of this algorithm is to choose a number of clusters beforehand and initialize as many “cluster centers” as needed at random; then, you go through all of your data and you compute the “distance” of each item to each of the cluster centers, and you assign the item to the group which center is closest; finally, you update the position of the cluster center after adding this new point in the group and you go on to the next item in your dataset. Most of the time, you stop after a given number of iterations or when the cluster centers don’t move anymore.

Now, when I talk about “distance”, here, I refer to common mathematical objects such as the Euclidean distance or the Taxicab geometry. Since computing those requires you to have numbers, you need the characteristic properties of your inputs to be expressed as numbers, usually called “features”. For many datasets, you have to do a pre-processing step of features extraction to transform your inputs’ properties into these numerical values.

This small GIF gives you a good intuition of how the algorithm works:

K-means clustering on a 3-clusters dataset with the evolution of the cluster centers step by step (from:

You can see that here, since we force the K-means clustering algorithm to think in terms of 3 clusters, even at the very beginning when it has randomly put two cluster centers very close to each other, it is already trying to make two groups (the “blues” and the “reds”) out of those. But it eventually realizes that it should actually move one of its cluster center up and on the left to have each blob matched to one cluster.

Applying this to the project at hand

How can we use K-means clustering in our specific case? As we’ve said before, the first thing is that we need to get characteristic properties for our items (namely, our images) as numbers.

While we could try to simply use the value of the pixels in our 100×100 images, doing so would result in a set of 10 000 features for each item… and therefore a 10 000D space to analyze! Given how theoretical and hard to picture this is, instead, we preferred to compute two features for each of our examples – this way, we can plot the clusters in the 2D plane like in the previous animation:

  • blackness: the percentage of black pixels
  • eye-fill: the relative size of the “eye”, if there is one (otherwise the eye-fill is 0)

With these 2 criteria, when we use K-means clustering with the Euclidean distance, the algorithm ends up with the two following clusters:

Resulting clusters after applying the K-means clustering algorithm to our dataset

This roughly means that the algorithm has separated between the images with a large eye and/or a lot of black around it (class 0, in blue), and the ones with about as much white as black and no big eye in the painting (class 1, in orange).

Although this is not a silly idea, looking at some of the images and their position in this 2D space we might be a bit surprised and confused by the class 0/class 1 separation… Look at some matches below:

Correspondence between some points in our 2D space and their images

While the images on the right could be grouped as “square” quite easily by the human eye, for the ones in the middle it is not always obvious why they should be in one group rather than the other.

Here, we see the limitation of both:

  • our features: clearly, even if they can serve as a first crude filter, “blackness” and “eye-fill” don’t encompass all relevant characteristics of Cali’s images
  • the model: making groups by simply looking at the Euclidean distance is not enough!

Switching to supervised classification: first, a simple algorithm…

Then, we added in labels but we tried to keep it simple by using a basic classification algorithm called the K-Nearest Neighbours (or KNNs).  It is very easy to implement but quite limited, too.

The basics of the KNN algorithm

The idea of this algorithm is to group inputs together by computing a distance for every point between this point and all the others and looking at the most common label in its K closest neighbors (K being a given integer specified by the programmer). Just as before, in the K-means clustering algorithm, you need your inputs’ features to be numbers.

Let’s take a small example to understand it before applying it to our own project. Suppose you have 40 dots separated in two sets: blues and reds. They are spread out in the 2D plane but the two groups overlap a bit.

We will assume – arbitrarily – that the blues are the “good” (or “positive”) ones and the reds are the “bad” (or “negative”) ones.

We can use a KNN algorithm to determine the best “blue region” and “red region”, in other words the areas that contain the points closest to our “blues” or our “reds”. This 2D plot shows the results of our algorithm – the inner color of the circles is their real class and the background color is the class predicted by the KNN:

Plot of the “blue region” and “red region” as predicted by our KNN algorithm (the inner color of circles is their real class and the background color is their predicted class)

The classification is not perfect: some blue points are in the red zone and vice-versa. Still, it is pretty good and the model, although it is simple and the clusters are quite mixed together, manages quite well.

Under the hood, the process is quite straight-forward. First, we gave the KNN our 40 labeled dots and a thin grid of points to assign to a group. Then, the KNN followed these steps, for each point in the grid:

  • it computed the Euclidean distance between the point and all the points in the dataset
  • it sorted the resulting list to have the closest points (its neighbors) at the beginning of the list
  • it looked at the K first neighbors and checked which group each of them was assigned to
  • it took the most common class out of the K results and added itself to this class in the list of predictions

Now that we have our two regions, if we get new data, we will simply need to compute features for those new items, plot them in the 2D plan and see in which region they are to label them.

Evaluating the performance of our model

Here, we have two categories: blues and reds. This is a case of binary classification.

To know how well our KNN does, we can rely on many classification metrics: accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, F1-score… For example, the accuracy is the percentage of data that was correctly labeled by the model (i.e.: the portion of predictions that match the real classes).

A lot of those metrics can usually be aggregated into what is called a confusion matrix. On our previous example, this gives us the following table (with the other aforementioned metrics on the right that are derived from the various counts):

Even if the accuracy is the most common metric for AI model evaluation – and one of the most intuitive -, the rest are also useful and sometimes better at assessing the performance of the model (e.g.: if our classes are imbalanced, meaning that one of the two sets contains way more points than the other one).

On our project’s dataset

Alright, let’s know see what a KNN could do with Cali’s images. We will keep the same two features, “blackness” and “eye-fill” (even though they are not optimal) and we will work on our “shape” criterion. This means we stay in the case of a binary classification with two possible labels: “square” (in blue) or “circular” (in red).

If we use the KNN algorithm on our dataset, here is the two regions we get:

Overall, this is a great classification! The KNN was able to create a “blue region” and a “red region” that give an excellent accuracy: about 99%! And all the other metrics would show the same thing.

But there is a small issue, though: the model has fallen into the famous pitfall of overfitting. Given the weird shape of the regions, we can safely assume that if we give it new inputs, it won’t label them all properly: in other words, it is very good on the data it has already seen, but not that great at generalizing to new data.

Once again, this is proof that our features could be massively improved – and that perhaps we could add one or two more – and that simple models are ok but not good enough for complex data like ours.

… before upgrading to real AI with CNNs!

An overview of Convolutional Neural Networks

Convolutional Neural Networks (or CNNs) are a type of neural network that is currently the top-solution for image analysis. They were designed by Hubel and Wiesel in the late 1960s who drew inspiration from the animal visual cortex: their idea was to mimic how we animals are capable of spotting things in an image. In particular, CNNs are very good at extracting patterns, even if they are translated or rotated.

I don’t intend to write a comprehensive explanation of Convolutional Neural Networks here, but this article by Sumit Saha does it well.

Here is, however, a rough overview of CNNs. These classification networks are composed of 3 parts:

  • features extraction layers
  • classification layers
  • a loss layer

The classification layers are simply fully-connected layers and the loss layer is usually the cross entropy between the predictions and the real labels. The features extraction layers are a bit more complex since they are composed of a convolution layer, a pooling layer and sometimes a regularization layer.

The convolution layers are the building block of the network; their role is to apply a striding window on the whole input to find a pattern in whatever position and rotation possible – in math terms, these “windows” are small matrices, or kernels, that you multiply a zone of the input image by. The pooling layers aggregate the results from the convolution layers together to reduce the size of the next layers (which loses some information but also lowers the computation time). Finally, a regularization layer can be added to avoid mathematical issues.

Another nice thing with CNNs is that their architecture lowers the number of weights greatly compared to a fully-connected network of equivalent size – this is called “weights sharing”. Basically, this is thanks to the fact that the various tiles are more powerful than a simple neuron but need as many parameters.

Although they are really cool for the case at-hand, Convolutional Neural Networks have two drawbacks:

  • setting all their hyperparameters (number and size of layers, size of the kernels, padding or stride…) properly is not easy
  • they are quite long to train

Nevertheless, here, they are the best choice for our project! So we implemented a basic CNN (inspired by the TensorFlow tutorial on this topic) with a few features extraction layer, a dense layer for classification, some dropout and finally the cross entropy as loss function.

Preparing the datasets – and doing a super quick analysis on-the-go

As mentioned earlier, we are going to try various networks, one for each of our 3 criteria. They will be called the “shapes model”, the “eyes model” and the “strokes model”, respectively.

Now, to train a neural network, remember that we need to split our data into two sets: the training and the test sets. This way, we will be able to detect if we are overfitting and assess the performance of our model correctly.

Note: if you want to optimize your model, you sometimes add an “hyperparameter tuning” step that requires you to split your data in three: a training, a test and a validation set.

This means that we will have 3 big datasets, each split into training and test sets:

  • the “shapes” training/test sets: 200/59 images
  • the “eyes” training/test sets: 370/111 images
  • the “strokes” training/test sets: 440/87 images

And each of these datasets have a given number of classes:

  • the “shapes” dataset has 2 labels:
    • label 0: square
    • label 1: circular
  • the “eyes” dataset has 3 labels:
    • label 0: no eye
    • label 1: 1 eye
    • label 2: 2+ eyes
  • the “strokes” dataset has 4 labels:
    • label 0: 1 stroke
    • label 1: 2 strokes
    • label 2: 3 strokes
    • label 3: 4+ strokes

We can perform a very simple and quick analysis of our data to check the class repartition – this will tell us if our categories are imbalanced or if they all roughly contain as many items.

“Shapes” dataset
“Eyes” dataset
“Strokes” dataset

These bar plots show us that our training sets are quite balanced – apart from the last one where the first class has a bit less images. As for the test sets, it doesn’t really matter if they are imbalanced or not since we’ll be looking at each image one by one anyway!

This is good because this tells us 2 things:

  1. common metrics that don’t necessarily deal with imbalanced classes will work: for example, accuracy should be relevant
  2. our “baseline accuracy”, i.e. the accuracy we would get by always guessing one the most common label, is in sync with our intuition:
  • for the binary classification on the “shapes” criterion, it would be 50%
  • for the 3-labels classification on the “eyes” criterion, it would be about 33%
  • for the 4-labels classification on the “strokes” criterion, it would be around 30% which is fairly close to 25%

Training our network

In order for the model to learn, you pass it the images and matching labels from the training set and it will look at all of them and tune its weights to reduce as much as possible its loss – which is basically a number representing how far off the real labels the predictions of the model were  at this round.

Here is an example of our training loss for the “shapes model” throughout training:

Evolution of the loss of our “shapes model” CNN throughout training

Note: we see that we have a low loss at the very beginning, because at this point we have only looked at a few batches, so we haven’t seen a lot of examples and the network can easily overfit on those to get a falsely excellent loss.

The results of our 3 models

To keep on working on binary classification, the first model we trained was the “shapes model”. This CNN tries to predict if an image falls into the “square” or “circular” category.

We get the following confusion matrix and interesting metrics:

Overall, we did manage to improve our accuracy compared the baseline (remember it is around 50%), even if the model isn’t excellent.

Sadly, the two other models – the “eyes model” and the “strokes model” – show kind of the same behavior, meaning that they increased the accuracy a bit from the baseline, but they were not successful in going over 50% accuracy… it seems like the other criteria were too hard to get for our AI models!

Taking a look at the model’s errors

But let’s go back to our “shapes model” for a moment.

If we take a look at some of the images that were wrongly classified by the CNN, we sort of understand why it can be difficult for the network to distinguish between the “square” and “circular” classes for them:

Some examples of images wrongly classified by the CNN; we show both the predicted and the true label (0 is “square”, 1 is “circular”)

Looking at the two top-left images that were truly labeled “circular” and “square” by Cali respectively, we notice that the difference is quite subtle and even a human eye could have trouble picking a category with absolute certainty. Thus it is not very surprising that our model predicted the wrong class for those examples…

What is worth mentioning here is that we are not aware of the uncertainty of our model when it makes its predictions. For example, for these two images: did it output the wrong classes with absolute certainty, therefore proving it learnt wrong, or was it in doubt and randomly decided to pick one label (because we forced it to chose)? If it is the latter, then the model is actually a bit like me: while Cali has insight on what she wanted and can decide on a category, I am a mere enthusiast amateur who has trouble choosing a label in this case! We will talk more about this problem of model uncertainty in the next article when we discuss the explainability of AI models.

What’s next?

So our conclusion is that, according to this – crude, I’ll admit – analysis, classifying Cali’s paintings is no trivial task and it is not truly possible for an AI model to perform well on this! Notice, however, that classifying abstract art is sometimes hard for humans too… in a way, it is to be expected an AI model has trouble with this!

In the fourth article of this series, I will talk a bit about a current issue that is arising in AI: the explainability of neural networks and machine learning models – or, more precisely, lack their of. In particular, we will focus on how this affect the industrial uses of AI and how people are trying to remedy that.

As I said earlier in this article, we will also talk about AI models uncertainty and explore various questions: how can we assess how confident a model is in its predictions? What tools do we have at our disposal to simulate doubt in AI? How important is the uncertainty in industrial applications?

  1. Cali Rezo’s website:
  2. I. Wikimedia Foundation, “Supervised learning” (, April 2019. [Online; last access 5-May-2019].
  3. I. Wikimedia Foundation, “Unsupervised learning” (, April 2019. [Online; last access 5-May-2019].
  4. I. Wikimedia Foundation, “Semi-supervised learning” (, April 2019. [Online; last access 5-May-2019].
  5. I. Wikimedia Foundation, “k-means clustering” (, April 2019. [Online; last access 5-May-2019].
  6. I. Wikimedia Foundation, “Euclidean distance” (, April 2019. [Online; last access 5-May-2019].
  7. I. Wikimedia Foundation, “Taxicab geometry” (, April 2019. [Online; last access 5-May-2019].
  8. G. Seif, “The 5 Clustering Algorithms Data Scientists Need to Know” (, February 2018. [Online; last access 5-May-2019].
  9. I. Wikimedia Foundation, “k-nearest neighbors algorithm” (, April 2019. [Online; last access 5-May-2019].
  10. I. Wikimedia Foundation, “Confusion matrix” (, February 2019. [Online; last access 5-May-2019].
  11. K. Ping Shung, “Accuracy, Precision, Recall or F1?” (, March 2018. [Online; last access 5-May-2019].
  12. I. Wikimedia Foundation, “Convolutional neural network” (, May 2019. [Online; last access 5-May-2019].
  13. S. Saha, “A Comprehensive Guide to Convolutional Neural Networks – the ELI5 way” (, December 2018. [Online; last access 5-May-2019].
  14. R. Gómez, “Understanding Categorical Cross-Entropy Loss, Binary Cross-Entropy Loss, Softmax Loss, Logistic Loss, Focal Loss and all those confusing names” (, May 2018. [Online; last access 5-May-2019].
  15. TensorFlow’s doc, “Build a Convolutional Neural Network using Estimators” ( [Online; last access 5-May-2019].

3 thoughts on “AI & Art with Cali Rezo (3): Analytical models”

  1. “not truly possible for an AI model to perform well on this” ? Depuis qu’un système informatique a battu le meilleur joueur de go, je ne crois plus à ces impossibilités. Et, après tout, c’était aussi une question de formes noires et blanches 🙂

    1. Yep, j’aurais sans doute dû dire : “nos modèles n’en sont pas capables” ! C’est bien vrai que classer ces images n’est pas simple et qu’on a pas pu y consacrer autant de temps que les petits gars de chez Google avec AlphaGo 😉

Leave a Reply

Your email address will not be published. Required fields are marked *