Machine learning to identify and prioritise drug targets
Open Targets is using artificial intelligence and machine learning to identify and prioritise drug targets
Artificial intelligence (AI) and machine learning (ML) have applications across all areas of drug discovery and development. Open Targets – a drug discovery consortium that brings together expertise from EMBL’s European Bioinformatics Institute (EMBL-EBI) and six other partners – focuses on drug target identification and prioritisation. Using AI/ML aids this process and helps Open Targets streamline effective target prediction.
The code and data from the Open Targets Platform and Open Targets Genetics are publicly available. This enables their reuse in AI applications, such as identifying novel target-disease associations using machine learning, building knowledge graphs to aid drug discovery, and benchmarking new computational methods to prioritise drug targets.
In this article, we take a look at some of the approaches to AI/ML that Open Targets has explored and implemented, including predicting the genes most likely to be associated with a disease based on Genome Wide Association Studies (GWAS) loci, using knowledge graphs to explore scientific literature, and classifying the reasons why clinical trials stop early.
Ultimately, the predictive power of machine learning is dependent on high-quality data — so we need to keep generating and sharing it.
Using machine learning to prioritise causal genes at GWAS loci
Predictive machine learning approaches are most effective when significant amounts of well-defined training data are available to answer specific questions. Open Targets focuses on applications of machine learning where data matched to answer a specific question are available.
Genetic evidence is often collected through GWAS, which connect genetic variants to a disease. However, linking the implicated variant to a specific drug target is a challenge, especially since most variants identified in GWAS are in non-coding regions of the genome. Open Targets Genetics was created to systematically connect GWAS associations to the likely disease-causing gene.
To address this challenge, Open Targets researchers created the locus-to-gene (L2G) method, implemented in Open Targets Genetics. L2G prioritises and scores likely causal genes at each GWAS locus based on the relative strength of genetic and functional genomics features. The machine learning method — XGBoost — was trained on a gold standard set of GWAS loci from which there is high confidence in the gene mediating the association. L2G has been included in Open Targets Genetics, and the L2G scores are the main source of common genetic evidence in the Open Targets Platform.
Knowledge graphs to explore scientific literature
Another AI/ML approach Open Targets researchers have used is the creation of knowledge graphs. These are useful when integrating heterogeneous data from different sources. Knowledge graphs provide a visual representation of relationships between entities, and may help to infer previously unknown links.
The LIterature coNcept Knowledgebase (LINK) is a knowledge graph that was previously used in the Open Targets Platform. LINK uses natural language processing (NLP) of PubMed abstracts to extract key concepts and relationships between a defined set of entities: genes, diseases, and drugs. The LINK library, including a pipeline, API, and web interface, allowed users to explore half a billion relations between these entities – in this case genes, diseases and drugs – aiming to create a comprehensive graph of biomedical knowledge.
LINK has since been replaced by another ML pipeline developed by Europe PMC, which uses named entity recognition to identify when targets, diseases, and drugs are mentioned together within published articles including the open access full text.
Using this information, the Word2Vec model – a type of NLP – enables Open Targets to infer information about relationships between these entities. The results of this analysis are presented in the Bibliography widget on the Open Targets Platform, allowing users to explore these relationships.
What AI/ML can tell us about why clinical trials stop
Almost 80% of clinical trials fail due to a lack of efficacy or unpredicted safety issues. Analysing the reasons for which clinical trials are unsuccessful could help reduce these high attrition rates and help inform target prioritisation.
A recently completed Open Targets project systematically assessed why clinical trials stopped early. This was done using NLP of the freetext reason listed on ClinicalTrials.gov, to classify the stop reasons into 17 broad categories.
When browsing the clinical trials evidence in the Open Targets ChEMBL widget, users can view both the category and original reason for why a clinical trial stopped early. The categories include negative, neutral, and positive reasons, which are reflected in the scoring of the evidence.
When contrasting the clinical trial stop reasons with the available genetic evidence for the therapy under investigation, Open Targets researchers found that trials are more likely to stop due to lack of efficacy when there is little evidence from human genetics or animal models.
The overarching question Open Targets wants to answer is: can we predict the best targets for new, safe, and effective therapies?
Some Open Targets projects apply machine learning to identify the characteristics of a ‘good’ target in different therapeutic areas, but this is hampered by the lack of gold standard targets from which to learn.
“This is a very complicated question, for which there isn’t enough training data to answer accurately using a machine learning approach,” said Ian Dunham, Director of Open Targets. “Instead, we have broken down the question into smaller components, for which we do have the data.”
“Ultimately, the predictive power of machine learning is dependent on high-quality data — so we need to keep generating and sharing it,” he concluded.