Case study: AlphaFold uses open data and AI to discover the 3D protein universe

Open data stored at EMBL-EBI played a pivotal role in the development of the AlphaFold AI.

Decorative image showing a nuclear pore complex AlphaFold prediction on blue and green background. — Credit: Nuclear pore complex prediction by AlphaFold. Edited by Karen Arnott/EMBL-EBI. Background image from Adobe Stock Images.

Challenge: Predicting how proteins fold

Proteins make up all living things. There are millions of proteins and each one has a unique shape, also called a structure. But protein structures can be elusive, and it can take years of experimental work to determine them.

Since 1958, when John Kendrew determined the world’s first protein structure by decoding myoglobin – a protein found in the heart and skeletal muscle tissue of mammals – almost 200,000 protein structures have been demonstrated using experimental methods. But this is a drop in the ocean, compared to the over 200 million known proteins.

John Kendrew (left) and a myoglobin structure available in the Protein Data Bank in Europe (PDBe) (right). Credit: Wikipedia and PDBe. PMID: 28795725.

To address this challenge, scientists looked towards computational methods, to help predict protein structures. How a protein folds itself into its unique shape is one of biology’s greatest mysteries – and it remains unsolved to this day.

Why are protein structures important?

Protein structures help us understand how proteins work and what function they fulfil. This, in turn, helps researchers develop hypotheses about how to control or modify a protein’s function. For example, in the case of proteins linked to diseases, the protein’s structure is useful for developing drugs to treat the disease. Similarly, many diseases happen when proteins don’t fold correctly.

In 1994, computational biologist John Moult set up the Critical Assessment of protein Structure Prediction (CASP), an international competition where groups of researchers develop computational models to predict protein folding. For many years, computational methods struggled to reach the confidence threshold of the more accurate, but time-consuming, experimental methods. But 2020 marked a shift in the tide.

At the end of the CASP14 competition that year, the judges announced to a stunned audience that the AlphaFold 2 system developed by London-based AI company DeepMind had achieved a level of accuracy comparable to experimental methods. AlphaFold 2’s average was less than one tenth of a nanometer off the positions that were determined by experiments. This had never been achieved before.

Solution: AlphaFold – built on open data

AlphaFold 2 is an attention-based neural network system, trained end-to-end, meaning all of the different steps of the process are simultaneously trained instead of sequentially. This gives more flexibility to the network, allowing it to dynamically learn interactions between non-neighboring nodes. AlphaFold 2 also uses evolutionarily related protein sequences, multiple sequence alignment, and a representation of amino acid residue pairs to refine its predictions.

Additionally, AlphaFold 2 offers two reliability metrics which provide confidence estimates in the predicted protein structure. This tells scientists how accurate the predictions are likely to be.

“Public data were essential to the development of AlphaFold.”

John Jumper, Senior Staff Research Scientist at DeepMind & AlphaFold Team Lead

Like any AI method, AlphaFold required a lot of training data to ‘learn’ how to make accurate predictions. DeepMind trained AlphaFold on publicly-available data, such as those managed and supported by EMBL’s European Bioinformatics Institute. DeepMind used, among other data sources:

Experimentally-determined protein structures from the Protein Data Bank
Protein sequences and annotations from UniProt
Metagenomics data from MGnify

“Public data were essential to the development of AlphaFold,” said John Jumper, Senior Staff Research Scientist at DeepMind and AlphaFold Team Lead. “The careful curation of such large data resources, representing the collective output of an entire subfield of biology, is exactly what enables our machine learning models to generalise well across such a huge range of proteins, enabling further breakthroughs in machine learning in other scientific areas.

“We expect that the expansion of metagenomic efforts like MGnify will be very important for increasing both AlphaFold accuracy and the impact of predicted structures on discovering novel enzymes in metagenomic samples.

“In addition to training resources, we found the annotations provided important to understanding and debugging model performance during the development of AlphaFold. Particularly, the residue annotations in UniProt were essential in establishing the link between AlphaFold confidence and protein disorder. Before consulting these UniProt disorder annotations, we had no idea how to interpret the long stretches of low confidence that would accompany many AlphaFold predictions.”

What followed: Sharing AlphaFold predictions with the world

During the months following DeepMind’s success at CASP14, there was endless speculation about what the company would do with its revolutionary AI, specifically whether it would make the code open source for the community to use.

In 2021, DeepMind announced that not only would it publish the AlphaFold 2 source code, but it would also make its predicted structures freely-available to all through the AlphaFold Protein Structure Database, a collaborative project with EMBL-EBI.

“EMBL’s support in developing the AlphaFold Database was crucial; it significantly amplified the impact and reach of AlphaFold predictions across the global scientific community.”

John Jumper, Senior Staff Research Scientist at DeepMind & AlphaFold Team Lead

DeepMind approached EMBL-EBI to support the development of the database because of the institute’s decades of experience in managing the world’s biological data. EMBL-EBI could ensure the AlphaFold Database would fulfil the requirements of the scientific community. It could also support the integration of AlphaFold prediction with other molecular databases, to help give researchers the most comprehensive information possible on the protein of their choice.

“EMBL-EBI champions open data, so when DeepMind approached us about developing a database for AlphaFold predictions, we were quick to respond,” explained Sameer Velankar, EMBL-EBI Team Leader for the Protein Data Bank in Europe. “Our technical and scientific teams worked together and we managed to develop, test, and launch the database in record time.”

EMBL-EBI’s Protein Structure Database in Europe (PDBe) team worked closely with DeepMind to develop the AlphaFold database, which launched with approximately 350,000 protein structure predictions, including the vast majority of human proteins – a key dataset for healthcare research.

EMBL-EBI’s contribution:

Co-development of the AlphaFold Database
Data standards – to ensure high quality
Data curation – to make predictions easy to find and analyse
Data integration – cross linking AlphaFold predictions in other biomolecular data resources

And this was only the beginning. The DeepMind and EMBL-EBI teams continued to work together to update the database. One of the challenges they faced was the sheer size of the dataset, and the need to quickly scale up to make millions more structure predictions available.

In January 2022, they added protein structure predictions for 17 organisms on the World Health Organisation’s neglected tropical diseases list and 10 organisms on its antimicrobial resistance list, bringing the total number of predictions in the database up to almost 1 million.

Exceeding expectations: 200 million protein structure predictions

Just one year after the launch, in a gargantuan effort, EMBL-EBI and DeepMind released a major update, covering over 200 million protein structure predictions. This includes almost all ‘known’ proteins that have sequences in the UniProt database.

This meant that for the first time ever, scientists had a comprehensive view of the protein universe in 3D, and a prediction for almost every protein that has ever been catalogued. In early 2022, the number of researchers accessing the database since launch was up to to nearly one million, with these users viewing 3 million protein structures.

AlphaFold database in numbers

200 million protein structure predictions
1 million organisms
2 million users in 190 countries
AlphaFold Nature papers cited more than 28,000 times

*between July 2021 – January 2023

“The popularity and growth of the AlphaFold Database is testament to the success of the collaboration between DeepMind and EMBL. It shows us a glimpse of the power of multidisciplinary science.”

Edith Heard, Director General of EMBL

EMBL-EBI has also helped to integrate the AlphaFold 2 protein structure predictions into other public data resources, so the global scientific community can use this additional data source to gain new insights and answer pressing questions. Making AlphaFold predictions FAIR (Findable, Accessible, Interoperable and Reproducible) is essential for maximising the impact of the data.

AlphaFold data flow

AlphaFold was trained on UniProt, MGnify, RCSB PDB, Uniclust, BFD.

AlphaFold data feeds into a large number of data resources, including UniProt, PDBe-KB, InterPro, Ensembl, Open Targets, 3D-Beacons, Swiss-Model Repository, Jalview, RCSB PDB, MobiDB, GrainGenes, STRING.

Only the beginning: AlphaFold’s impact so far

The AlphaFold system and database have disrupted the way biology is done, and have stimulated new directions in the field of structural biology.

AlphaFold predictions have been used, among other things, to:

Fight antibiotic resistance
Advance research into both cures and vaccines for diseases, from leishmaniasis, to malaria
Understand the nuclear pore complex
Accelerate the fight against plastic pollution
Understand and protect honey bees
Develop bioinformatics tools, such as Foldseek and Dali, which enable users to search for entries similar to a given protein

“Scientists build on the shoulders of giants. In fact, most often, those shoulders are data,” said Janet Thornton, Director Emeritus at EMBL-EBI. “Having these millions of structure predictions will change the face of biology. This is useful to medicine, agriculture, biotech, everything – it’s just fantastic.”

Video credit: DeepMind.

Case study: AlphaFold uses open data and AI to discover the 3D protein universe