My end-of-studies internship at LabGenius! (2)

Last week, I posted a first article about my internship at LabGenius, a London-based biotech startup. Today, let’s continue with a focus on the tools and various skills this experience helped me develop.


Note: This article is the second one in the series. If you want to learn more about the company I worked at, LabGenius, and about the role I had over there, please check out the first one.

Tools & Technologies

Quick peek at the tools

Just to give you a rough idea (and perhaps have you play a guessing game?), here is a snapshot of the logos of the tools I worked with during these 6 months:

Some are very-well known, like Python, TensorFlow or ReactJS; but others were a bit more specific to our projects, such as DEAP.

Python

Python is, to me, the go-to programming language nowadays. Because it is a high-level general purpose language that is easy to write, and with all the packages the community developed, it can now be used in so many ways you can virtually do anything with it: super fast script tests, desktop GUI applications, scientific computation, statistics or data study, machine learning, web servers…

Python is now in the top 5 of programming languages, along with Java and C.As explained in this article by Mindfire Solutions, it allows for both object-oriented and functional programming, is supported by most operating systems and offers an impressive range of features between the standard library and the additional packages. The fact that it uses duck-typing can be seen as both a strength or a weakness depending on the programmer; it lets you do some crazy stuff with your variables but requires that you control your data flow neatly. However, it is an excellent tool for complex software development or maintenance because it provides you with plenty of user-friendly interfaces for difficult problems solving: AI architecture design, data analysis and visualization, web applications and many more!

However, it’s worth noting that, just like any language, it is not as optimal for each of those tasks. An important thing is that this language is interpreted and not compiled, as opposed to C, Fortran, Go… This means that it will always be a bit slower than those on its own. Also, it’s inherently single-thread system prevents you from doing true parallelization out-of-the-box (the libraries that do implement it actually rely on routines written in low-level languages like C, or even computer-level BLAS routines).

So, while it does not necessarily win the benchmark contests in terms of pure code execution time, it still manages to steal the spotlight because of how it lowers the overall code writing time and the incredibly wide diversity of applications you can use it for.

DEAP

DEAP (or Distributed Evolutionary Algorithms in Python) is an open-source Python package meant for easy genetic algorithms implementation. Thanks to this library, you can quickly model a single- or multiple-objective problem and solve it using the evolutionary tools we discussed in the last article.

Note: The small Python lib I offered in my article on genetic algorithms a long time ago had the benefit of being very simple to use, but it was also very limited! DEAP is quite flexible and can adapt to many problems. Plus, it is compatible with and optimized for parallel computation.

The overall philosophy is to prepare a DEAP “toolbox” that contains all the objects that will be useful to implement your problem: the definition of an individual, your objective(s), your constraint(s), your selection/cross-over/mutation algorithms… You then use this toolbox during the evolution phase by picking the necessary bits from it at each step and eventually end up with the (hopefully!) optimized evolved population.

Note: Under the hood, DEAP relies on a key-value pair system with dictionaries so that the ready-to-use evolution algorithms that are offered in the package can go and extract the right tool when needed. For example, the term “individual” represents a specific object in the DEAP’s framework and will be used as a base block throughout most of the computation.

A DEAP script usually consists of 3 parts:

  1. initialize the toolbox, i.e. defining your problem
  2. have your starting population evolve for a given number of generations
  3. analyze your results either by looking at just few of the best individuals or the final Pareto Front for example

If you’re interested, I encourage you to check out their tutorials that are really cool and well-written.

TensorFlow: Google’s framework for AI

TensorFlow (TF) is an open source tool developed by Google. It was first released in 2015 and has since been offered the community. It is meant to help data scientists create, train and study AI models in depth. It has APIs in several languages (Python, C++…), is cross-platform, runs on CPUs or GPUs and provides engineers with a large palette of resources for machine learning either in a production or a research context. The goal is to ideally abstract away all the complexity and allow user-friendly yet efficient implementation of AI architectures.

The core idea of TensorFlow (at least TF 1.0…) is to use data-flow graphs:

Schematic representation of a TensorFlow’s data graph (image by Google.com)

To define an AI model with this tool, you compose a pipeline of operations acting on “tensors”. Those are placeholders that will eventually be filled with your data when the network is truly executed. It is worth noting however that the new concept of “eager execution” (introduced in 2017) offers another type of TF programming where you create, compile and execute your graph all at once. This other mode can be very useful for debugging but once the code works and is ready it is often better to convert it to a graph to save, port or efficiently run it.

I’ve already talked a bit about TensorFlow, the high-level interface Keras and the possible novelties TF 2.0 may introduce in a previous article (scroll down a bit if you’re just interested in a the TF section…).

Cloud computing

Broadly speaking, cloud computing refers to the usage of online services to run software while taking advantage of specific data storage or computing power without having to manage it ourselves.

Even though companies like IBM already offered such services in the 1960s, the term “cloud” only appeared in a research paper in 1996; and in truth, it wasn’t until the Web 2.0 in 2003 that it could truly be used. The term “cloud” was brought to public knowledge in 2006. Nowadays, and partly because of how hardware has developed (in particular with GPUs), cloud computing has grown dramatically. Many big companies now have some cloud computing service (be it Google’s GCP, Amazon’s AWS, IBM Cloud…). These tools are not specifically dedicated to AI but the computing power required for this type of algorithms is usually so important that data scientists can benefit from these services.

There are various forms of service models (Infrastructure as a service, Platform as a service, Software as a service…). In companies, cloud computing is usually used for three types of tasks: automation of data science analysis, AI model training and web applications deployment. In my case, we focused on the Platform as a service (PaaS) model. In other words, we first request for (and rent) a virtual machine; then we can log in the virtual machine, set up a virtual environment with some libraries and run our own programs, even tweak some of the environment variables; but we do not need to handle the underlying complexity of hardware.

For AI models training, the advantage of this PaaS model is that we can run our training on a machine with specific base software and hardware settings (e.g.: “a Linux OS virtual instance running on a 4-GPUs machine”) that is often more efficient than our own set up. However, we can often run into an issue: because these instances have a predefined quantity of memory for the CPU and the GPU respectively, if you want to take advantage of the computing power of the GPUs, you might actually need to reduce the amount of data fed at each step (or you will suffer a resource exhaustion error). You then face the common problem of “efficiency-VS-memory” trade-off: in many high performance computing projects (and AI very quickly falls into this category when you scale up to large datasets), you need to balance between the size of our data and model and the actual performance of the model. On the one hand, you want to treat as much data as possible in one step, to reduce the total number of required steps and therefore speed up the training; on the other hand, if you ask the GPU to handle too much data at once, then you exhaust the memory limit and the program simply crashes. Therefore, you gave to delicately search for the optimal batch size for a specific model in order to make the best out of the hardware.

The company also uses cloud tools for web applications hosting and serving. This is a simple way of allowing the rest of the company to test a tool in development or production mode. Thanks to this, it is possible to have the server of the application (that receives the incoming requests and goes through databases or specific computation functions) to run continuously so that people can use the web service whenever they want to.

A handy tool when doing cloud deployment is the usage of Docker files. Thanks to this technology, when you send your application code to the virtual machine, you can also send a specific file with it to define the entire context in which your app will run: a virtual environment to activate, a set of environment variables, what libraries are required or even a series of checks to run when the application is update. By integrating these unit tests directly, we can make sure that we do not introduce any breaking change: when we send an update to the cloud version, it automatically checks that all previous features still work fine and warns you otherwise. To further secure the application, with all of these services, you can also implement some login mechanism with specific credentials to insure that only members from the company can access this application.

ReactJS

ReactJS is a JavaScript library that was developed by Jordan Walke (a software engineer from Facebook) and launched in 2013. Since then, it has become open-source and is now maintained both by Facebook and by the community.

This project was born out of the need Facebook had to deal with code maintenance issues and an increasing number of features. When it bought Instagram in 2012, the early prototype of an efficient reactive render engine suddenly became a great idea and it was quickly turned into the first version of ReactJS.

The library’s top advantages, in the eyes of many developers (me included), are that:

  1. it allows a programmer to create a full application quickly
  2. it is simple to learn, to test and to deploy
  3. it gives you a lot of freedom in how you structure your codebase because it relies on numerous packages and dependencies
  4. it is very efficient thanks to a “reactive rendering” (hence its name…)

If you want to learn more about this library and how to use it in practice, I will soon do a series of articles on the topic with a “hands-on” tutorial, like I recently did in Golang. If you can’t wait, there already is a repository on my Github to go with it that has PDF and Markdown versions of the tutorial.

Here is, however, a quick overview of ReactJS’ core features.

Reactive Rendering

One of the core feature of ReactJS is its huge performance gain in browser rendering updates compared to a basic solution. The library is meant to only re-render the parts of the page that need to be – after data has changed, for example. This is possible thanks to the use of a virtual DOM: prior to updating the browser, ReactJS computes the differences that require update and only changes these elements in the page. All this happens without any overhead for the developer since the programmer can still write the code as if the whole page was re-rendering.

Using JSX for a Component-Driven Philosophy

This decoupling between the items that need to be updated and the rest of the page is made possible thanks to ReactJS using a syntax extension of the JavaScript language: JSX.

Before libraries and frameworks like ReactJS appeared, with the base web development languages that are HTML, JavaScript and CSS, programmation followed more a technology-driven philosophy: web pages were built gradually in “layers” with the markup separated from the logic and the styling both in files and in languages. You would write the skeleton of your page in HTML, add dynamism through JavaScript and eventually style your elements with CSS.

Because it mainly uses JSX, ReactJS changes our mind frame by taking another design principle based on components. JSX is a strange mix of strings and HTML that allows you to build whole elements at once, in one file: this way, we centralise the markup and the logic in the same place.

Here is a metaphorical representation of the difference between a tech-driven philosophy and a component-driven philosophy:

Technology-driven approach: each “layer” is treated separately and markup and logic are independent

Component-driven approach: markup and logic are centralized into one component

Components are also great for reusability: by creating little self-contained “chunks” for our app, we (hopefully) have general enough components that we can reuse them somewhere else in the app very easily. For example, a form page may have various input fields but the underlying logic is always the same (namely: enter data, optionally validate it, then submit it and usually redirect to another page). One “form” component can thus be created to encapsulate this whole block of logic, plus the corresponding actual displayed HTML elements (with some styling if need be), and the component will be fed and “specialized” depending on the data it receives upon creation.

Page Component Hierarchy & One-Way Data Flow

In ReactJS, pages are composed of containers and simple components inside them. This clear hierarchy allows us to create our website mock-up rapidly. For example, if our initial 2D UI mock-up is well-defined, it is quite straightforward to go from it to its ReactJS equivalent because we simply need to reproduce your layout:

  • each element in your page will be one ReactJS component that is stored in one file
  • containers can directly import their inside components from their own files (using common JS imports)
  • global styling can be applied, or a component-specific one thanks to dedicated packages

ReactJS therefore defines a parent-child hierarchy in our page.

This is to put in relation with the data flow management in ReactJS. As opposed to other comparable tools (see the remark at the end of this section), ReactJS relies on a one-way data flow with parent components pushing data to the children through props and the children answering back through callback functions. This behavior is often summed up by the phrase: “properties flow down; actions flow up”.

To bypass this “locality” of the data flow, it is possible to use state managers. These are useful when one or more component(s) share state with another component or mutate the state of another component. The goal of a state manager is create a single source of truth instead of lifting states down and up with props and callbacks. We can then fetch or update data to/from any component in our page without having to implement a far-fetched and overly complex chain of callbacks through the whole hierarchy.

Component’s State & Lifecycle

Finally, ReactJS’ has an important feature: stateful components with a neatly prepared and user-friendly lifecycle. We have seen before that our page is built from many autonomous components (besides direct interaction with their parent or global interaction through state managers). In particular, we said that only the components that needed to be re-rendered would change on data update.

But how is this update event tracked in a component? In ReactJS, components have a local personal state that contains the data they need to display and that will trigger a targeted re-render whenever it is updated. The props passed down from the parent also have a say on the component’s lifecycle since they, too, can trigger a re-render if they are changed and the new ones are passed again to the child component.

This component lifecycle insures that the displayed data is always up-to-date, and the official ReactJS’ documentation provides us with a nice digram of the components’ lifecycle:

Flask (+Bottle)

Flask is a micro web framework meant to design small applications easily in Python. As indicated on Flask’s first documentation page, being “micro” means being simple and extensible: the framework does not impose any specific architecture, database or other tools and lets you build your application the way you prefer.

While this requires us to handle this by ourselves, Flask offers many plugins to help us and this gives us a lot of freedom on how we shape our web application. Many extensions have been proposed to answer the most common needs: database abstraction (with ORMs for example), form validation, user authentication…

The strength of Flask comes from its integrated debugging and testing systems, its great documentation, its really nice learning curve and its numerous extensions. The reason it is straightforward to learn for developers is partly due to:

  • its intensive use of templates that allow us to create HTML pages with some dynamic data entry points that are fed by the server upon URL access
  • the ease with which you can define server endpoints for your app

Bottle is another similar small lightweight web framework that is really easy to use.

Other tools

In the software team, we also used Git which is a version-control system designed to coordinate work in a software team. By tracking the changes on all the files in a folder and handling multiple versions to be compared and remerged, it makes the life of programmers easier, supports non-linear development and insures data integrity inside of a project. The online platform GitHub is one of the most famous code-sharing platforms: it incorporates Git and allows for easy sharing of a repository between multiple coders.

On another topic, LabGenius (and in particular the software team) relies on two main management tools: Slack and Trello.

Slack is a direct messaging service that facilitates the communication inside the company. LabGenius has defined many channels each devoted to a specific topic that help organize the conversations and keep the right people up to date with the projects they are part of.

In the software team, we use Trello to manage and track the progression of the various projects under development. This management software works by creating cards to represent project tasks and placing them in a progress state (usually “To-do”, “Doing”, “Under review” or “Done”). It gives us a clear overview of the current state of a project, lists all the tasks to be discussed or checked out and the ones that have been validated. It also helps keeping the people who work on the project informed and assigning every one their tasks so that work is well distributed among the team. Finally, you can use color tags to link a card to one or more roles (e.g. “Front-end”, “Back-end”, “API”…).

Sharing the knowledge

At LabGenius, the desire to share knowledge both externally and internally is anchored deep in the company’s values. People are always open about their job and will always manage to take some time to bring their expertise to your project if need be.

Of course, there is the usual documentation of the projects you work on so that the tools can actually be used by the end users after they’ve been developed. It was really interesting to write down “readme”s for our various apps… but I decided to take it one step further and to make small video tutorials! Honestly, it was quite fun to do these little “demo tutorials”; the idea was to give users a better feel for the tool and to guide them through all the features they would be able to play with. Listing all the elements I wanted to present, preparing the script, recording the screen and the voice and finally editing everything together was a neat way of revisiting the project with a fresh eye, trying to put myself in the shoes of someone who’s just discovering the tool.

But I also jumped on the “sharing knowledge”-train and took this opportunity to improve my teaching skills by conveying some software engineering skills to other members of LabGenius’ team. Be it in the tech team or amongst the biologists, a few colleagues gradually started to take an interest into how the applications they were working with actually were coded.

To respond to this question that they had (the “how does it look under the hood?” interrogation), my manager pushed the Software team to share their knowledge through presentations and by setting up new apps development schedule to include pair programming sessions, where a software engineer would team up with another member of LabGenius to implement some features together.

Note: I also prepared the ReactJS tutorial I will be posting as a series of articles very soon (see the next section for more details) and tested it on some volunteers who were interested in learning more about this user-friendly quick website prototyping Javascript library.

Moreover, a very cool event that takes place almost every Friday at LabGenius is the “tech talk”. At lunchtime, one person from the company presents for roughly an hour on a topic that is related to the work that they do at the moment, or a technology they spotted that could be of use, or just something that they’re keen on getting across to the rest of the team. I, myself, talked about unsupervised machine learning and how it might be interesting in the case of protein analysis or engineering.

Between all of that and actually preparing my internship report and oral presentation, my time at LabGenius taught me a lot on science but on “talking science”, too; and that is something I’m always glad to improve on!

Networking: meeting new people from the world of AI

In addition to improving my technical skills both in mathematics and programming, this internship also taught me a lot about work relational skills and networking.

As part of the start-up’s team building experience, LabGenius organizes a company lunch every Wednesday. It brings all employees together and these events are often an opportunity to invite someone from outside LabGenius to share their knowledge. During the course of my internship, we met CEOs or engineers from other companies in the domain of health care but also people and talent specialists for example.

To be able to meet such diverse people was a real plus because it helped me grasp what our current knowledge of AI applications to biology is and to investigate many ideas of projects at LabGenius.

I also had the lucky privilege of attending the CogX 2019 Festival that took place in the heart of London from the 10th to the 12th of June and brought together AI and technology experts from all around the world. Be it in the context of mathematics, physics, finance, education, ethics, politics, climate change or medicine, hundreds of speakers presented applications of AI and discussed related questions during two days and a half.

CognitionX was founded by Charlie Muirhead and Tabitha Goldstaub; it aims at connecting the AI community all over the world and explaining this emerging technology accomplishments to the public through various events, such as the CogX festivals.

Organized by the CognitionX company, CogX is now one of the largest AI events in the world; the goal is to show- case pioneer ideas in the fields and to reflect upon the latest discoveries and their possible consequences.

I had the chance to see world-renowned AI specialists, mathematicians or writers (for example Stuart Russell, Sarah Dillon, Conrad Wolfram or Oliver Morton) and to explore many ideas in a very short timeframe. A very interesting point of this festival is that it gathers people in academia with entrepreneurs or investors; it gathers engineers with journalists; and it gathers a whole palette of views on AI. It is also the perfect moment to discover more about the projects currently in research in a number of companies or in universities.

The only downpoint, in my opinion, is that in some talks the targeted audience is too broad and the allocated time too short for the speakers to be able to truly detail their research; as a data scientist, you sometimes feel like they only scratched the surface.

Another core goal of the CogX festival is to extend your network and get in touch with other people in the field of AI. Through a dedicated phone app as well as face-to-face booth meet-ups, the whole event revolves around the idea of helping the growth of AI world-wide both in terms of projects and participants.

With a colleague from LabGenius, we also attended the “AI For Women” dinner event on Monday evening that was hosted by Google and organized by a company they fund called X.

To conclude

This internship was a great experience that taught me a lot in terms of maths, programming, workflow management, biology and many other fields. I met a great team of amazing scientists that are working hard to help the world and that are beginning to be a big deal in the field; I’m happy I got to be there at that time and that I’ll keep in touch with these people.

And of course: a huge thank you to everyone at LabGenius for making me feel at home, and special thanks to Eyal, Katya, Eddie, Sam and Matthew for guiding me and collaborating with me during these 6 months!

By the way, if you’re interested in learning more about LabGenius, they will be at BioEurope 2019 in Hamburg next month where there will be various workshops and panels about the latest discoveries in the world of biotech and pharmaceuticals. Make sure to check it out!

 

And now, it’s time for new adventures. What exactly? I don’t know. But I’m up for the challenge and I’m sure I’ll manage to find new projects that are just as exciting as this one was!

REFERENCES
  1. LabGenius’ website: https://www.labgeni.us/
  2. DEAP’s documentation: https://deap.readthedocs.io/en/master/
  3. TensorFlow’s website: https://www.tensorflow.org/
  4. Docker’s website: https://www.docker.com/
  5. ReactJS’ website: https://reactjs.org/
  6. Flask’s documentation: https://flask.palletsprojects.com/en/master/
  7. Bottle’s documentation: https://bottlepy.org/docs/dev/
  8. CognitionX’s website: https://cognitionx.com/
  9. BioEurope 2019’s website: https://ebdgroup.knect365.com/bioeurope/?_ga=2.158519866.12972747.1571040172-926145317.1571040172
  10. Mindfire Solutions, “Python: 7 Important Reasons Why You Should Use Python” (https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b).  October 2017. [Online; last access 07-October-2019].

Leave a Reply

Your email address will not be published. Required fields are marked *