Bias in data (scientist) = bias in ML

Because data science is mainly about… data and data scientists, surprisingly 😉

This article is also available on Medium.

The field of machine learning (ML) is an ever-growing trend that aims at creating programs “as intelligent as us humans”. However, the word “intelligence” is, in truth, quite an overstatement – and experts in the domain have admitted that it should not have sticked. To be honest, machine learning allows you to solve very narrow questions pretty well but it does not resemble our human way of thinking. Rather, it is about giving a program the ability to extract patterns from a set of data, in order to approximate a function that can solve the problem at hand.

As I’ve mentioned in previous blog posts about AI, it all relies on this initial set of data (called the “training dataset”). A machine learning algorithm is basically a blank slate, like a small infant that looks at hundreds of examples to learn right from wrong – and to slowly make the connection as to what it means to be right or wrong! This is the training phase during which the algorithm tweaks its thousands of little knobs to test out millions of combinations: the algorithm searches for the combination that gives the correct answer in as many cases as possible when looking at the training dataset examples. Ultimately, the goal of machine learning is to have a somewhat “able” program that can predict a credible answer for a new example – i.e. an example that is not in the training dataset but in a totally different (and not overlapping) test dataset. The type of answer can be quite different from ML algorithm to algorithm (it can be the price of a house, the type of animal in a picture, the evolution of stock prices…).

Still, this prediction must be considered and used with caution:

  • first: one algorithm always answers the same type of questions. If you train your ML program to distinguish between photos of cats and dogs, it will not be able to predict what weather we’ll have tomorrow. This idea of having one program capable of solving multiple questions is commonly known as artificial general intelligence (AGI) – many researchers believe it can’t be achieved (either at all, or at least before a very long time).
  • second: there is always a degree of confidence for the answer. Since the algorithm has never been told an actual precise rule and had to infer from the initial set of data the most logical patterns, it has no way of asserting with absolute certainty that the prediction is exact. Instead, it needs to provide both the answer and the matching confidence.

Note: sadly, real-world applications often lack this part. We are seeing more and more examples of AI applied to our daily lives where the result of the algorithm is taken as a perfect and completely trustworthy scientific fact, such that it can be used for life-changing decisions… even though it has unexpected and unwanted consequences.

  • third: it all depends on the training dataset. This is what I want to focus on today: because the ML algorithm knows nothing besides the set of data it is fed during its training phase (this dataset is its entire reality), whatever twists, errors or bias are present in the training dataset will irremediably reappear in the logic of the algorithm once trained.

Suppose you decide to tell a child during her whole childhood that vanilla ice cream is dangerous and chocolate ice cream is the best flavor ever. As an adult, she will likely flee from vanilla and desperately ask for chocolate because she has been infused with this idea for years. She has been so consistently repeated and shown this fact that, in her world, there is no question and no need to even take a step back: vanilla is bad and chocolate is good, period.

But what does it mean for ML? What does it mean to have bias in your data? How far can these biases reach in our lives as AI starts to soak in more and more places?

Disclaimer: this article is more focused on starting up conversation and bouncing around ideas, I did not go into all the math details and stat tools (for example, when discussing “significantly different” values, I did not compute the p-value and only looked at some global trends or means).

A case study: 2018 Chicago salaries

For this article, I’ve decided to work on an open-source dataset published by the city of Chicago in 2018 that gives the name, job title, job department and annual or hourly salary of over 30,000 employees (some work full-time, others part-time). It is available on the famous Kaggle data-science platform, compiled into a single CSV file from the official source of open payrolls.

I wanted to know if I could spot the infamous women vs. men salary inequality. I had three specific questions I wanted to work on for this project:

  • could I indeed observe a gap of salary between men and women in the data?
  • after training my algorithm, if I fed it the exact same profile where only the gender differed, would predictions be similar or significantly different? (meaning: would my AI learn inequality?)
  • if there was a gap, was it possible to “force” the algorithm to learn a more equal set of rules by “rebalancing” the training dataset?

The full code (with the datasets and a readme file) is available as a zip archive over here.

A preliminary data analysis

I conducted my data cleaning, data transformation and data analysis using the famous Python libraries pandas ( and matplotlib ( The Kaggle CSV file contains the data for 33,183 persons.

Since the original data does not contain the gender, I’ve used common lists of male and female first names to get the most probable gender for each based on their name (here are the two lists, for men and for women, that I’ve used). This way, I could add a new categorical column to the dataset with either the “male” or “female” value. I’ve decided to drop the rows for which I could not determine the gender – I was left with 28,837 samples. This is not a lot for a solid machine learning project, but it’s alright for the purpose of this post.

I also converted all salaries to an annual salary: for salaries given with an hourly rate, I took the average amount of hours worked in a week and multiplied it by 52.

At that point, I could plot a histogram of salaries with the repartition by gender (woman or man):

This shows us a few things:

  • the dataset is not balanced, we have about 75% of men and 25% of women
  • the mean salary for men is above the mean for women ($86,590 versus $70,376)
  • but overall women and men salaries follow similar trends
  • there are 2 groups: lower salaries are distributed roughly the same between men and women but higher ones show a difference – we can check that the first group corresponds to the part-time jobs while the second one corresponds to the full-time jobs

Training two basic models: a linear regression and a random forest

Using the Python module scikit-learn, we can easily instantiate and train common machine learning models: regressions, decision trees, ensemble models or even neural networks for deep learning!

Here, I’ve decided to stick to two basic models:

  1. the linear regression: it uses numerical features to predict a numerical value – it is pretty easy to interpret because it associates coefficients to each feature that directly give us the importance of each in the prediction; but it is very (even too?) simple, especially when dealing with categorical features as is the case here
  2. the random forest: it is an ensemble model (which means it tests out multiple variants before picking out “the best one”, according to its evaluation criterion) and those are known for having good results – like the linear regression, we can quite easily extract the features importance and interpret the result but it is way better at handling categorical features

My target column is the annual salary (it’s the value I want to predict) and my features are: the job title, the job department, whether the job is full- or part-time and the person’s gender.

The training is fast because I don’t have a lot of data and I use simple models. Once I’ve trained my models, I can predict the annual salary for a bunch of made-up examples – I’ll simply draw random “profiles” of people and see what the algorithm advises for their salary. For example, I can simulate the computation for a male police officer working full-time and I’ll get a result of roughly $87,500 per year with the linear regression and roughly $82,660 per year with the random forest.  By comparison, for a female police officer working full-time, we get a salary of ~$79,105 with the linear regression and ~$82,650 with the random forest.

So, wait, what? One model tells us salaries are clearly lower depending on the gender while the other predicts about the same value… What’s the truth? Well, this is only one example – to get somewhat usable results, we need to work on a larger population and generate thousands of profiles. This way, we’ll be able to do some statistical analysis and get trends, like we did on the original training dataset.

Stepping down from the high horse, aka the bias in the data scientist

The problem with the linear regression is that it actually exaggerates our results way too much. Remember how I wanted to show a gap in salary between men and women? Well, if I wanted to convince a board of decision makers of that, I’d simply use the linear regression model that clearly demonstrates this. Just look at the feature’s coefficients (that represent their importance):

Feature Importance
Job title -9.3747
Job department 72.792
Full-time/Part-time job -66,825
Gender 8,389.9

Isn’t that amazing? Gender is evidently among the most important features and has a huge impact on the predictions! But, now, let’s look at the repartition of predicted annual salaries, grouped by gender as before:

See the problem? This repartition has nothing to do with the original previous. It is an over-simplification of our data that only took 2 features into account: the gender and the full-/part-time job. Why? Because linear regression does not handle categorical features well, and it is not able to understand our job title and job department features. For this model, only real numerical values or boolean (0/1) values make sense. So it only focused on the two boolean features it had and completely disregarded the rest. This makes a powerful but invalid argument for our case.

Using this model would be misleading because from a scientific point of view it is not valid in the studied situation. It would lead to exaggerated conclusions and eventually heavily biased judgments. This example is proof that a data scientist can re-arrange numbers and stats in more or less rigorous ways to put certain ideas forward… this “tainting of evidence” can be voluntary or not (especially when the bias goes your way and supports your initial instincts, like here). So always be careful when adding data science to your projects that you have properly studied your use case and chosen valid tools.

Going back on the path of reason… and finding the bias in the data

Clearly, the linear regression should not be used. The random forest, however, shows much better results in terms of repartition:

Here, we get something similar to the initial statistical repartition that presents the same 3 characteristics: two groups (full-/part-time jobs), globally similar trends but the men have a bit higher salaries than women.

Feature importances show us that the gender is not very significant in the predictions and that most of the difference comes from the job title and the full-/part-time job features:

Feature Importance
Job title 42.7%
Job department 10.7%
Full-time/Part-time job 44.5%
Gender 2.1%

This is related to our previous observation about the two groups of salaries: those two features have the most influence on salaries because they determine which group you fall in. Plus, the job title is still mainly what sets the salary – which is quite reassuring!

But our model learnt well: it reproduced the training dataset trends… including the bias! Note that we’re talking about statistical trends at a large scale; so although on a single example it might give equal salaries for a man and a woman with similar profiles, the model more generally tends to predict slightly lower salaries for women (for 40,000 samples with the initial 75%/25% male/female repartition, we get a mean annual salary of ~$55,020 for men and ~$52,148 for women).

“Rebalancing” the data

The final question is: if we had balanced data where men and women with similar qualifications had equal salaries, would the algorithm have this bias? To simulate this balanced data (that we sadly don’t have…), we’ll simply copy the salaries we have for men, apply a bit of randomness to avoid having the exact same values and take a third of the results to keep our 75%/25% gender repartition. This is very basic and would not hold for an actual research case study, but it’s enough for us and indeed gives us the expected new repartition with almost identical histograms for men and women:

What is interesting is that if we train our random forest exactly as we did before, we now get the same mean salary for women and men on our 40,000 samples (which is about $51,400):

We haven’t changed anything in our training process or our model settings; the only thing we’ve changed is the data. And after this modification, we see that our conclusions are completely altered! This shows the importance of the training dataset in the predictions an ML model can make.

Reflecting upon the results

Now, what does it all mean? Our analysis was “quick and dirty” and on too little data to consider it serious research. However we can still make a few useful remarks and drive the point home:

  • an ML model is only a mathematical function, so it is not inherently biased; on the other hand, the data scientist chooses a model from the big toolbox of models we’ve developed over the years and he/she has to be careful to take one that is adapted to the problem – otherwise, it might exaggerate or undermine the results and introduce flaws or inaccuracies in the project
  • also, your data can contain biased trends, unwanted patterns and unexpected behaviors that will be reproduced by your model and may go unnoticed
  • no matter if your data is biased or not, you’re still facing the “black box” issue so characteristic of AI: because of the way machine learning works (by deriving its own set of mathematic rules from the patterns in your training dataset), it is usually hard – if not impossible – to know exactly how the algorithm came to make a prediction. Even though I’ve mentioned that the linear regression and random forest models are quite easy to interpret, remember that we still can’t ask the algorithm to take us through its learning and understanding process…

We are at an age where we put AI everywhere: transportation, banks, housing, healthcare, security, education… The list of sectors where we add machine learning to help with everyday work keeps on growing! Though this is not bad per se, we need to be careful and really understand the benefits and the drawbacks of this technology. AI is still fairly new and scientists have discovered amazing tools so far; it is time to also discuss the ethics, limitations and unexpected consequences of this field so it does not escape us. Luckily, people have already started to ponder upon this question and the real risks of AI

If we don’t make the effort to check what we’re actually putting in our algorithms and how we’re using them, we might further deepen inequalities and biases that are already present in our modern societies: by blindly relying on this false idea that “machine learning algorithms are impartial”, we will silently (and perhaps unknowingly) embed those biases at the root of every decision in the near future. However, it does not have to be this way: by being always more informed, uplifted by advances and critic of excesses, all of us together can build a safer and more equal society for tomorrow.

Leave a Reply

Your email address will not be published.