Millions of previously uncharacterised proteins in the UniProt database have now been annotated – given a functionally relevant name – using a natural language processing model. This model was developed by a collaboration between EMBL’s European Bioinformatics Institute (EMBL-EBI) and Google Research.

What is natural language processing?

Using artificial intelligence and deep learning methods to analyse natural language and speech. For example, training deep learning models to automatically pull out information from scientific literature.

UniProt is the world’s leading freely-accessible resource for protein sequence and functional information. The database currently stores information on over 200 million protein sequences, however over 50 million of these did not have annotations and were labelled as an uncharacterised protein.

ProtNLM – developed by the team at Google Research – is a natural language processing model that accurately predicts descriptions of protein function directly from a protein’s amino acid sequence. This model was trained to correlate between amino acid sequences obtained from the UniProt database with English language words, phrases and sentences that can be used to describe protein function.

The result is a robust natural language processing model that can be applied to unannotated protein sequences to generate novel descriptions with high accuracy. These updated protein annotations have now been added to the UniProt database.

AI improves data resources

AI is revolutionising the life sciences at an extraordinary rate. DeepMind’s AlphaFold AI system has transformed protein science as we know it, but many of EMBL-EBI’s data resources have also benefited from the addition of AI models. This includes using machine learning to allocate a function to genes in the newly-annotated Ensembl genomes and protein function information added to the Pfam database. Now UniProt joins this list of databases getting a machine learning update.

“There were millions of proteins in the UniProt database that were classed as uncharacterised, meaning we didn’t have names for them,” said Maria Martin, Team Leader in the Protein Function Development Team at EMBL-EBI. “Thanks to the input of machine learning models, most of these now have names and functional information. We’re hoping to build on this in the future to add more functional attributes to the proteins listed in UniProt. We have a huge amount of gene ontology data at EMBL-EBI and these can also be used to train machine learning models to help provide protein function information.”

“The team at UniProt is absolutely top-notch, and we couldn’t be more pleased to contribute back to these amazing resources,” said Max Bileschi, Staff Research Software Engineer and Manager at Google Brain. “We’re helping millions of people do their research, and that’s something I never thought I’d say.”

Vital manual data curation

“EMBl-EBI’s database curation experts have been working closely with Google Research to help them to understand how proteins in the UniProt database are named and help them to distinguish what is an acceptable annotation,” said Sandra Orchard, Team Leader in the Protein Function Content Team at EMBL-EBI. “Getting an accurate protein name is vital as this is the first thing people look for when they are searching for their protein of interest.”

The protein sequence and functional information stored in the UniProt knowledgebase (UniProtKB) comes from two main sources: UniProtKB/Swiss-Prot, where protein sequences are manually annotated and reviewed by a team of expert curators, and UniProtKB/TrEMBL which are partly annotated by automatic systems using UniProtKB/Swiss-Prot entries as templates for the propagation of the annotation. It is for those proteins in UniProtKB/TrEMBL which could not be annotated using the usual automatic annotation systems that ProtNLM has helped to generate new annotations.

“One of the challenges we faced was that very little information is known about many of the proteins that were being annotated by ProtNLM,” said Elena Speretta, Senior Scientific Database Curator at EMBL-EBI. “I worked very closely with the researchers at Google to help evaluate the predictions generated by this AI model. The annotations it produces are overall remarkably good. ProtNLM is a powerful tool that will enable many researchers to explore the functional significance of so many biological sequences for the first time.”

Find out more about ProtNLM on the UniProt website.

Funding

This work was supported by the National Human Genome Research Institute (NHGRI), Office of Director (OD/DPCPSI/ODSS), National Institute of Allergy and Infectious Diseases (NIAID), National Institute on Aging (NIA), National Institute of General Medical Sciences (NIGMS), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Eye Institute (NEI), National Cancer Institute (NCI), National Heart, Lung, and Blood Institute (NHLBI) of the National Institutes of Health under Award Number [U24HG007822] (the content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health).

Edit