Note: This article will present the goals of LabGenius, the company I worked at, as well as my role during this internship and an overview of the projects I worked on. Next week, the second and final article will dive more into the tools and technologies I used, and in the others skills I developed during this experience.
Nowadays, data science and AI technologies have grown everywhere and these new tools profoundly change our understanding of major fields. Biology is one of them: although many assays or lab experiments must still be thought of and realized by humans, an important part of the job also concerns examining and analyzing the resulting data. Hence, our current knowledge in statistics, data analysis and machine learning can be put to good use and help scientists improve their overall process. Moreover, given the amazing accomplishments machine learning has had these past few years and the immense amount of work the community has done to develop this domain, it is now possible and relevant to apply it to sensitive topics such as health.
LabGenius: mixing biology & AI in a new way
LabGenius is a biotech start-up specialized in drug discovery that uses AI to drive the research for new proteins. It was founded in 2015. Since then, it’s grown a lot and is now a team of roughly 25 brilliant specialists from various fields: biology, robotics, data science, software development, marketing…
Their aim is to “harness evolution with AI”, in other words to use the power of AI to focus the attention of the protein engineers on the most promising proteins. Indeed, because of how complex the underlying biological mechanisms are, even though molecular biologists can roughly shape viable “candidates”, it is not yet possible to design a de novo protein with a set of exact desired properties in a lab; biologists need iterative cycles of trial and error to gradually refine the structure and eventually identify a candidate drug for their specific goal. LabGenius is working on taking advantage of AI to orient the improvements for the next cycle and thus focus on the changes that are the most likely to result in a positive outcome. With this technique, they won’t need to browse all of the huge combinatorial space of possible candidates (which is just intractable in practice!).
My role in the company
At first, my internship was solely about data science, but I quickly joined the Software Engineering team. It was really great to work in both teams at once because, to me, data science and software development offer very different yet complementary time scales:
- data science is usually a lengthy process consisting in many iterations of hypotheses, implementations, optimizations, discussions and reshaping; this leads to a sense of ownership on the projects and an in-depth understanding of the concepts, but it can also be source of frustration if a hardware issue slows down training and prevents you from proceeding to the analysis for example
- software engineering, on the other hand, was for me more a series of one-off projects: in two or three days, I would participate in or conduct the UI mockup, realize a first prototype and submit it to the biologists or data scientists to get some feedback; though maintenance is never truly over and I cannot consider the tools “done”, the core part of the project was set up in a short time with immediate results and the follow-up suggestions by the users will require small tweaks on the product during an indeterminate period of time
This is why having these two types of missions was beneficial: they are very complementary in terms of instant payoff and long-time rewards.
Another difference was that I had more skills and experience in software engineering beforehand than in data science – specifically when applied to protein analysis. I could therefore contribute directly and very early on on LabGenius tools and apps which allowed me to already be efficient while I was learning more of the biology required for the data science projects.
An overview of my projects
Because of non-disclosure agreements, I won’t be able to give all the details about the projects I worked on. But I can still give you an idea of the concepts I discovered and used during those 6 months both in data science and in software development.
Statistics on protein sequences
The first project I worked on when I arrived was about applying point mutation statistics and basic math tools to protein sequences analysis.
As you might be aware, there are 20 natural amino acids; they all share a common part but a specific piece of each acid specifies which one it is exactly. All these acids are then linked together into long chains called “proteins”.
This means that a protein can be represented as a string of characters taken from a list of 20 possibilities: A, T, G, C, E, S… On the other hand, proteins can be tested for a whole set of properties in the lab – LabGenius is interested in several, be it protein stability, immunogenicity or fitness for specific drug functions.
Let’s say that for a given property, we label as “positive” (resp. “negative”) proteins that do (resp. don’t) have this property.
The question we were interested in at that point was: given a set of proteins that have been labelled beforehand, is it possible to identify statistically relevant character-spot combinations? For example, based on the knowledge we have from our dataset, can we say that an
A in first position indicates a positive label? Or, to the contrary, is an
E in 47th position a hint for a negative label? Or are those combinations simply not statistically relevant?
To answer this question, we followed a few steps:
- first, we did a “positive”/”negative” accounting for each combination (in other words, we checked how many times it appeared in sequences labelled as “positive” – we call that “successes” – and how many times it appeared in sequences labelled as “negative” – the “failures”)
- second, we used beta distributions to transform those successes and failures into a probability of being positively labelled
- third, we chose a rule to decide on whether or not the probability was statistically different from the baseline
I won’t dive into too much detail about beta distributions here but the important thing to note about them is that they are completely characterized by two numbers: the successes and the failures. We can see two basic behaviors of beta distributions by looking at its PDF as we toy around with the ratio of successes and failures and the total number of items:
- as the ratio of successes over failures increases, the peak of the PDF shifts from 0 to 1 which corresponds to the fact that it is more and more probable we have a “positive” label for this input:
- for a constant ratio, as we increase the total number of items, the spread of the PDF decreases; in other words, we are more and more certain of the result:
This allowed us to produce some nice plots that clearly show the relevant acid-spot combinations given our dataset:
In this diagram we have an example where having a
G in first position indicates a “positive” label whereas having a
S indicates a “negative” label.
Because we had some interesting results with single spot examination, we decided to pursue the project further and study acid pair-spots combinations. By applying the same overall technique and making sure that neither of the acid-spot in the pair are relevant on their own, we were able to create the same sort of graphs for pairs:
Here, having a
D in first position is not interesting, but having a
D combined with a
P two spots further is!
The project could go even further if we studied combinations of 3, 4, 5 acids… however, even though it is a nice theoretical approach of the problem, it might be limited because of a biological phenomenon called “epistasis“. To put it simply, epistasis makes understanding relationships between acids in a protein harder than just looking at linear combinations. Indeed, since proteins fold in 3D space, acids that are “far away” in the sequence may actually be very close to each other “in real life”.
This is why we also devoted some time to comparing this “manual” approach to neural networks implementation; as opposed to basic accounting and beta distributions, those models could hopefully catch on these more complex relationships. It did give some interesting (and, fortunately, consistent!) results but faced us with another problem: black-box models and AI explainability. I’ve already talked about it into a previous article over here.
All in all, this project was a really nice introduction to protein analysis and a neat way of learning more about statistics and distributions.
Genetic algorithms & Multi-objective optimization
At LabGenius, scientists are interested in proteins that are optimal for several characteristics at once; they don’t just want a protein that can perform well against a particular disease, they also want one that binds in the right place, does not trigger unwanted reactions…
This means that you cannot optimize for one objective while completely disregarding the rest – because all these goals are in conflict, you need to use specific tools to treat this problem. The ideal case of several objectives that can all be optimized “in harmony” is rarely the one you get in real life; to get a feel of why this is, let’s take a small example. Suppose a factory has a production time, a production cost and a matching profit for the produced items; its production is also subjected to given constraints.
If you want to minimize the production cost and time, then it’s quite obvious that the “ideal” solution (mathematically speaking) is to do nothing. On the other hand, if you want to minimize the production time while maximizing the profit, then it will be trickier:
- firstly, there is no clear optimal solution because, in theory, you could have an infinitely huge profit
- secondly, you cannot reach even a sub-optimal solution like the gray point because of the production constraints
This means that there is necessarily a trade-off: you need to sacrifice something in one objective to improve the other one.
Another important thing to note is that you usually don’t compute a single optimal solution but rather a set of equivalent solutions, called a “Pareto Front” – then, a human expert has to cherry-pick the best one from this set by providing additional information or deciding on weights for the objectives.
However, it is not always easy to compare sets of solutions: comparing single-objective solutions (i.e. single points) is straight-forward since you simply need to check their values along each axis; but comparing multi-objective solutions means comparing groups of points, so how exactly can we do that? In theory, Pareto Fronts should try to have both the best “correctness” (be as close as possible to the theoretical optimal solution) and the best “diversity” (have as many different points as possible to offer more choices to the human expert to pick from).
The data science team is currently exploring how to use multi-objective optimization at LabGenius. It is still in research stage at the moment but they are studying various methods and in particular how genetic algorithms can help compute a set of relevant solutions. During this project, my task was to get a better understanding of multi-objective optimization and genetic algorithms; I then had to do R&D on scientific articles to identify relevant comparison metrics and to come up with a standardized process to compare Pareto Fronts.
AI model’s uncertainty
I’ve mentioned uncertainty measures in AI models previously. At LabGenius, this is still under research but it could be a nice way of increasing models’ accuracy and of identifying out-of-data examples.
For now, we have focused on epistemic uncertainty (the so-called “model uncertainty”) that only comes from the model itself (be it you chose the wrong type of network altogether, or that your weights could be tuned in a lots of different ways to well-predict your data). It can be reduced by adding more training data.
Note: We’ve therefore left aleatoric uncertainty (or “data uncertainty”) on the side so far.
As explained in the aforementioned article, the nice thing with the uncertainty implementation we tried out is that it is quite easy to add to a model: for epistemic uncertainty, you just need to:
- have dropout in your model both during training and testing
- run inputs several times through your “dropout-ed” model and use the various results to compute the uncertainty of the model for this specific input (this follows the Monte Carlo logic)
The project focused on the usage of various uncertainty measures and on the comparison of two implementations of model uncertainty:
- in the first case, we had one model with dropout and asked it to perform N times on each input
- in the second case, we had N models with different random seeds and asked each to perform 1 time on each input
For each case, we did N=50 trials and computed the standard deviation, the entropy and the mutual interaction of our N results for each input. We had three types of inputs: positively labelled ones, negatively labelled ones and random sequences that represented out-of-data examples.
To be honest, our tests suggest that the difference between those 2 approaches is not that huge in our case. Still, it is possible to do a bit of optimization, given our results.
Basically, it seems like the “1 model-N trials” is more confident in general, which is good for the “positive” and “negative” labels but indicates too low an uncertainty on random examples. On the other hand, the “N models-1 trial each” case is able to show quite a high uncertainty on random examples but it also has less “positive” and “negative” labels with a low uncertainty.
Perhaps doing more trials would mitigate these results and perhaps implementing uncertainty with other techniques than dropout could also modify our conclusions but, at the moment, we felt like it was better to stick with the “1 model-N trials” approach because it is less computationally intensive.
Immunogenicity classification model
Another project I worked on was about using basic classifiers (from
scikit-learn) and simple neural networks (from TensorFlow/Keras) for the classification of a specific type of data: binary “immunogenic”/”non-immunogenic” labels.
Immunogenicity is an important property for proteins in drug discovery because it impacts how the body will react to the protein: an immunogenic protein will trigger a response from the immune system and thus be “rejected” by the patient. This is why LabGenius is interested in seeing if we can use AI to identify non-immunogenic proteins in the current candidates.
I was tasked with comparing the results from some simple classifiers – after a quick analysis of our training/test sets, I applied a bunch of models for this binary classification:
It is still an ongoing project but so far, we have had interesting results and data scientists are optimistic; it seems like we could train pretty efficient models to link a protein sequence to a reaction from the immune system.
Apps and tools for biologists, data scientists, marketers
The other big part in my role at LabGenius was to be an intern software developer. This is how I got the chance to participate in the development of many tools for the company.
Since LabGenius is currently growing at a fast pace, biologists, data scientists and marketers continuously discover the need for new in-house tools that can facilitate their everyday-work. This can be an adapted database browser to find specific proteins in the lab more easily; or protein design inference software that relies on AI; or interfaces for the tracking of protein production; or a marketing dashboard to quickly glance at LabGenius’ influence through several social networks.
In total, I helped develop 6 apps – 5 of which are now in production at LabGenius – that are used by many people in the company. The projects were mostly about web development. Depending on the tool, I worked either as a front-end or a back-end engineer (or as both on some occasions). This gave me a complete overview over the projects and improved my skills with lots of web languages or frameworks and cloud-related technologies: ReactJS, Flask/Bottle, the Google Cloud Computing tools, Docker…
Note: More on these technologies in the following article!
My own little Python lib: “Biotools”
At the end of my internship, at my own initiative, I also took some time to gather all my learnings and scripts into one place to create the Biotools Python package. This small library is meant to be a user-friendly interface to fetch, preprocess and do statistical analysis on data, to use genetic algorithms for sequence optimization, plus to provide easy-to-use methods for TensorFlow models training and prediction.
This Python package also has a documentation generated with the Sphinx to automatically create an API reference that lists all the methods and classes in the library with their inputs, outputs and matching types. The documentation contains a full tutorial to show how to use the utilities offered in the libraries.
Hopefully, this package could serve as a building block for future data science tools and libraries at LabGenius!
Until next time…
As you can see, this internship was an opportunity to take a look at lots of fields: biology, computer science, maths, marketing… I also had the chance to have various mentors and therefore worked on a whole set of skills both in machine learning and data science, and in software development.
Next time, I’ll talk more about the tools and technologies I’ve used and also about all the other skills I improved during this internship in terms of networking, teaching, documenting, etc.
- LabGenius’ website: https://www.labgeni.us/
- I. Wikimedia Foundation, “Amino acid” (https://en.wikipedia.org/wiki/Amino_acid), September 2019. [Online; last access 27-September-2019].
- I. Wikimedia Foundation, “Beta distribution” (https://en.wikipedia.org/wiki/Beta_distribution), September 2019. [Online; last access 27-September-2019].
- I. Wikimedia Foundation, “Epistasis” (https://en.wikipedia.org/wiki/Epistasis), September 2019. [Online; last access 27-September-2019].