An in silico hope for biology: machine learning
How EMBL scientists are using machine learning to advance biology
“I’m excited by the problems EMBL biologists want me to help them solve using image analysis!” exclaims Anna Kreshuk with a smile. Kreshuk is one of many researchers across EMBL’s sites who use machine learning to solve problems in biology. Just months after starting as a group leader at EMBL, she has a growing list of collaborators who want to use her methods to automatically extract information from microscopy images.
After a degree in mathematics, Kreshuk worked for three years at CERN as a scientific programmer before pursuing a PhD in machine learning. Since the completion of her PhD in 2012, the field of machine learning has exploded.
What is machine learning?
Babies start learning the moment they’re born. Whether it’s holding a spoon or mastering French irregular verbs, we learn by taking in new information and improve through repetition. But the ability to learn and improve at a task is not confined to humans or animals: computers can do it too.
Machine learning brings together statistics and computer science so that computers can learn to perform a specific task without being programmed to do so. For a computer to learn, it needs to have some initial data on how to do a specific task. The computer finds statistical patterns in the data that enable it to establish an algorithm by which future data will be sorted. The more useful data the machine has access to over time, the more finely tuned its algorithm will become and the more accurate its decisions will be. The ultimate goal of machine learning is for the algorithm to be able to generalise beyond the information it has seen and successfully interpret new data.
Machine learning is already widely applied: whether it’s filtering spam emails, autocorrecting your texting mistakes, or suggesting what movie to watch next, you probably benefit from machine learning dozens of times a day without knowing it.
Machine-learning algorithms don’t use a specific set of instructions to accomplish a particular task. Instead, the machine learns how to perform a task by using large amounts of data and learning the data’s internal structure. Machine learning can work in multiple ways. The simplest way is supervised – you show the system examples of what you want, and it learns the characteristics that will help it find such examples again. These methods are widely used to classify data or make predictions. At the other end of the spectrum are unsupervised methods, where the machine finds motifs in the underlying structure of the data and uses them to cluster the data into categories.
Machine learning techniques often involve the use of artificial neural networks (ANNs), which consist of a set of nodes – known as artificial neurons – with connections between them. In the above image of an ANN, the white circles represent artificial neurons. The lines are connections from the output of one artificial neuron to the input of another. In this case, the network has four nodes in the input layer, six so-called hidden nodes and two nodes in the output layer.
One class of ANNs are deep neural networks (DNNs). In a DNN, instead of having one layer between the input and output layers, there are many hidden layers all interconnected. At the output end, a back-propagation algorithm goes back though the layers, adjusting the mathematical weight given to each of the connections in the network until the final result matches the output of the training data.
A specialised type of DNN is a convolutional neural network (CNN). When using a non-convolutional DNN for image analysis, each neuron in the first layer takes the whole image as its input. In a CNN, by contrast, individual neurons do not respond to the whole image, but only to a restricted region of it called the receptive field. This reduces the complexity of the neural network while still allowing it to outperform other types of neural network on image analysis tasks.
One of the most popular applications of machine learning is in image analysis. Jonas Hartmann, a PhD student in the Gilmour group, is interested in using image analysis to understand how the cells in a tissue interact. “I’m fascinated to observe how cells come together and create new behaviours that you couldn’t easily see in a reductionist way,” he explains. To understand how this works, Hartmann studies the zebrafish posterior lateral line primordium (pLLP), a group of about 100 cells in the zebrafish embryo that move collectively, differentiate, and make different shapes. Hartmann wants to learn how such processes are integrated and coordinated within a tissue. To do so, he is building an atlas.
“The idea is to build a cell atlas where you have a reference measurement that you can use as a coordinate system, allowing you to superimpose other measurements. You can then map all your information together and see the relationships between the different features.” To make his atlas, Hartmann used microscopy images of the pLLP with both the cell membranes (the reference measurement) and one of many other proteins of interest highlighted. An example of one such protein is actin – a filament-forming protein involved in cell movement and changes in cell shape. Hartmann applied visual filters and feature-extraction techniques to segment each of the cells in the tissue and numerically describe their shapes. Finally, he used machine learning to find the relationship between the reference measurement (the membranes) and the measurement of interest (e.g. actin) to create the atlas.
Machine-learning methods fall into the category of ‘narrow artificial intelligence’: given a narrowly defined task and the right training data, machines are able to learn how to perform specific tasks as well as, or in some cases better than, humans can. Also – an especially appealing feature to some – machines can work non-stop.
Following cells in real time
“In today’s world, I think everybody wants to work better and faster,” says Rajwinder Singh, a PhD student in the Hufnagel group. Singh is studying the early stages of cell differentiation in mouse embryos. During the first stages of embryonic development, all the cells are the same, but when the embryo undergoes the transition from the 8- to the 16-cell stage, its cells start to differentiate. When a cell divides at this stage, the two daughter cells that form are slightly different to each other because they will belong to different kinds of tissues. When this happens, Singh extracts the two daughter cells to see how their patterns of gene expression differ. Unfortunately for him, it’s impossible to predict when these cell divisions will happen, so he needs to sit at the microscope for five or six hours, waiting. He’s therefore keen to teach a computer how to recognise this event.
In collaboration with Kreshuk, Singh plans to teach the machine to segment the image in real time. By providing enough images of the kind of cell divisions he’s interested in – which occur radially and below the surface – the machine will be able to learn exactly what Singh is looking for. When acquiring the data in real time, it will be able to determine whether a particular cell division is an event of interest.
Machine-learning algorithms are being used around the world every day, filtering your spam emails or recognising faces on Facebook. But it’s almost impossible to understand how a machine makes a particular decision or prediction – a concept known as uninterpretability.
Once the data is fed into the machine, the input nodes start abstracting it, passing the information forward and connecting it to the different nodes of the system, which in the case of a deep network may exist in very large numbers. As Kreshuk puts it, “The calculations are happening in a multidimensional space. Even if you can soak in all the parameters and imagine what it’s doing with them – because it’s not doing anything complicated – there are just too many of them. It’s very hard to interpret what’s going on inside.”
That situation might soon change, however. “For image analysis applications, uninterpretability can still be alright,” continues Kreshuk. “But in clinical applications, for example, it’s different. There are a lot of people working on making the black box more interpretable and understanding what drives and influences these decisions. Everyone wants to know, and I think we’ll see a breakthrough in this direction in the next few years.”
Lara Urban, a PhD student in the Stegle group at EMBL-EBI, is combining the human genome and CNNs to predict splicing patterns – changes in the way genetic information is used to make proteins, which allow a single gene to code for more than one protein.
In her project, Urban wanted to assess which patterns in the genome are important for predicting splicing, and to see if DNA methylation – the addition of a specific chemical group to the DNA molecule – plays a role in splicing. Urban used different machine-learning methods to tackle this problem. “It depends on what you want to do,” she says. “If it’s about making predictions, many machine-learning models work well enough, but to find novel DNA motifs that influence splicing patterns, convolutional neural networks are the perfect tool.”
Urban is an ecologist by training, and although she currently works on cancer genomics, she hopes to apply her machine-learning skills to ecology one day. “I’d like to apply machine learning to the genomes of endangered species to investigate their susceptibility to various diseases,” she explains. “Or see how the genome changes as a result of evolutionary or environmental pressures.”
Kreshuk and her group hope to put an end to one of the most time-consuming parts of biological research. “I want to remove the bottlenecks that exist in biological image analysis pipelines. I want to enable people to do more ambitious experiments, to do things that just take too long to do by hand. Things that people are not even planning because it would take too much time!” she says. “That way, they’ll be able to think about more interesting things and have the freedom to be truly creative.”