This article is also available on Medium.
Have you ever had tremendous accuracy on your machine learning model in just half an hour? Or conversely, did you ever struggle for days with no idea if you were moving forward or backwards?
Whenever you’re working on an AI project, your goal will be to give some data to an algorithm so it can gradually “learn” by adjusting its internal parameters in order to match as closely as possible the result you asked for. There are a lot of unspokens in this sentence: what does “adjusting” mean? what are those internal parameters? who says it’s “close enough”?…
Each of those would require several articles but I want to talk a bit more about the last point here. It is related to the notion of metric. When you start working on your project, you should decide how you are going to evaluate the performance of your model(s); this is done by choosing a specific metric (i.e. a mathematic formula) that will be applied to all your data and model predictions to get this “closeness”. The list of possible metrics is set by the type of problem you’re solving (regression, classification, multi-classification, clustering…), the data you have, the salient points you want to study, etc. All in all: choosing the metric is one of the key moments in a ML project that a data scientist is here for!
But now, here’s another question – after you’ve experimented dozens of settings, hundreds of hyper-parameters and tens of algorithms, and you’ve applied your metric to get performance, how do you know if this performance is actually good? Or in other words: what should you compare your results to to make sure they’re correct and interesting?
The context: a little personal anecdote
In a recent article, I talked about the importance of data in machine learning and how having a somewhat biased training dataset can completely overthrow a project. In particular, a very unbalanced dataset can have pretty catastrophic consequences on your work if you’re not careful.
I myself was confronted to this issue a few years ago, during an engineering-studies-related internship in ML. I had the chance to work on the Scamdoc product (more on this in an old article, if you wish 😉 ), an AI-based classifier that helps you separate “good” for “scamulous” emails and domain names. I won’t dive into all the details of the project in itself but rather focus on a problem we had with our training data.
When they contacted me to participate in the Scamdoc development, the men behind this idea, Anthony and Jean-Baptiste, had already been running another website for quite a while: www.signal-arnaques.com. This participative site lets users from all around the world report scams and it is now a huge search engine that lists over 250 000 scams of various types: phishing, job scams, fraudulent sites… Of course, this big database was an essential source of data for the creation of their new ML tool.
The problem? By definition, this database listed scam emails and scam domain names. It had been slowly filled with bad or at least suspicious stuff… and there were virtually no “good” emails or domains in it. We wanted to train a classifier to distinguish “good” from “bad” but we were missing examples for the first category. Therefore when we first fed this to our model, it ended up having the following expected but hardly valid behavior: “just say it’s scam all the time, that should work”.
Now, I’m only talking about the very first initial trials. Since then, we have obviously leveraged various solutions to mitigate this issue and used enough data of each kind during training to achieve correct predictions – and actually great performance, eventually 🙂
My point is that a naive or unexperienced data scientist could have looked at our results from this first afternoon in awe and said: “your model has an accuracy over 95%!”… without realizing it was no better than a very dumb approach (namely: always give the same answer). This is because he/she would lack the notion of “baseline performance”.
What are baseline results?
This idea of getting initial results using the “dumb” or “naive” approach is called “computing your baseline results”. It is a core step in every machine learning project. Why? Because it ensures that you have a point of comparison for all your future tests (the “not-naive” ones), which is a simple solution computed in the context of the problem.
For example, saying that “random is a 50/50% chance” is often inexact. In data science, random depends on the distribution of your training dataset; if you have 30 photos of dogs and 70 photos of cats, then a random guess for the animal (that gives either “dog” or “cat”) is more likely to match cats than dogs. Similarly, using the zero-rule algorithm, the “always give the same answer” process described above, would show this imbalance.
Basically, by applying those very direct computations to your dataset, you will be able to establish the base performance. Then, let’s hope your models do better than this! Whenever you find a more accurate model, you can refine your comparison by taking this one as baseline and then try to beat this new reference.
As explained in this great article by J. Brownlee, having a poor baseline performance is no reason to worry. It simply means that your problem is inherently quite hard to solve. On the other hand, if your complex models are unable to best this baseline, then there’s an issue. This shows that your tweaks and optimization are not enough to get an upgrade from the naive guess – you might need more data, another type of model, different parameters… or even to completely reframe the problem!
Examples of algorithms to compute initial baseline results – and the importance of explainability
When you are working on a regression problem, you can take the mean or the median of your initial data as the baseline result.
For classification (or multi-classification), the zero-rule or a simple guess at random can be basic approaches.
Some data scientists prefer to use techniques a bit more advanced for their baseline results computation. For example, Vivek likes the Naive Bayes algorithm because it is a bit more accurate than the previous naive techniques but still has a very interesting feature for baseline models: explainability.
Complex models can be hard – if not impossible – to understand, acting as a “black box” that simply takes in samples and spits out predictions with no way of determining its process. On the other hand, more basic algorithms like linear regressions, decision trees or Naive Bayes classifier and regressor are easy enough to interpret. This is very important especially in the real-life applications because it allows data scientists and decision makers to reflect upon the results. While an undecipherable model gives you predictions that you are forced to accept or reject fully, “explainable” models are more interesting decision-wise since they let you examine the process that lead to the result. Also, you can possibly correct some mistakes or perhaps nudge the model in the right way according to some exterior rules so the training is more accurate given the real-world situation.
Baseline results are essential to have a point of comparison during the more complex trials. Thanks to those, you are sure to take into account the particularities and difficulties of the problem at hand instead of blindly referring to some theoretical absolute values. This can be crucial when your training dataset is poorly balanced – churn rates, hardware fault detection or spam filtering are just a few examples of this issue in the industry.
Initial baseline results should be as easy to interpret as possible and there are some well-known techniques to establish the starting baseline performance. Then, the models you carefully parametrize and train will hopefully do better, according to the metric you chose, and you can gradually refine your training process by iteratively besting your new baseline.
To go further…
- J. Brownlee, “How To Get Baseline Results And Why They Matter” (https://machinelearningmastery.com/how-to-get-baseline-results-and-why-they-matter/), November 2014. [Online; last access 13-03-2021]
- Vivek, “Machine Learning — Baseline helps in comprehensibility” (https://firstname.lastname@example.org/machine-learning-why-baseline-is-important-cc63c857a56d), December 2018. [Online; last access the 13-03-2021]
- msg, “Zero rule algorithm” (https://machinelearningcatalogue.com/algorithm/alg_zero-rule.html) [Online; last access the 13-03-2021]