Deep learning models help predict protein function
Deep learning models can improve protein annotations and has helped expand the Pfam database
Our protein family database – Pfam – is used by a diverse range of researchers across the globe. Open access to the protein family data stored in Pfam has helped experimental biologists understand protein function, aided structural biologists’ insights into protein structure, given computational biologists rapid access to protein sequence information, and let evolutionary biologists trace the origins of proteins.
Pfam gives researchers access to vital protein annotations, structures, and multiple sequence alignments. It is a resource widely used to classify protein sequences into phylogenies and identify domains – functional regions – to provide insights into protein function.
With help from new deep learning models, Pfam has increased the protein sequence annotation and function data available within the database by unprecedented amounts. Research published in the journal Nature Biotechnology demonstrates how deep learning methods developed by Google Research could be trained using data from Pfam to accurately annotate many previously undescribed protein domains, shedding light on potential protein function. This new data added to Pfam has expanded the database to such an extent, it would have taken several years to achieve the same result manually.
Deep learning and protein function
“Initially I was rather sceptical about using deep learning to reproduce the protein families within Pfam. Then I started collaborating more closely with Lucy Colwell and her team at Google Research and my scepticism quickly changed to excitement for the potential of these methods to improve our ability to classify sequences into domains and families,” said Alex Bateman, Senior Team Leader of Protein Sequence Resources at EMBL-EBI. “These models exceed my expectations. They’re not just copying the data already in Pfam, they’re able to learn from the data and find new information that is yet to be discovered. What this gives us is the ability to expand the Pfam collection and potentially that of other resources using these same deep learning methods.”
By combining deep learning models with existing methods to add new data into Pfam, the researchers were able to expand the database by almost 10%. This exceeds all expansion efforts made to the database over the last decade. The deep learning methods were also able to predict the function for 360 human proteins that had no previous annotation data available in Pfam.
Using additional protein family predictions generated from the Google Research team’s neural networks – a series of algorithms that looks for underlying structure in the sequences of protein domains and families – created a supplement to Pfam called Pfam-N, where N stands for network. Pfam-N adds a further 6.8 million protein sequences to the Pfam database.
“We’re also now building on these established deep learning methods to expand the information in the database even further,” said Bateman. “We’re changing the way the existing deep learning model works so that we can call multiple protein domains at once. This new update to the database should be ready very soon.”
“My personal view is that there’s still a lot of scope to improve the deep learning models we’re currently using,” Bateman added. “We’re in the early days of this and I’m very hopeful for what it will mean for the future classification of protein families. This may even be something that will get solved in the next five years.”
Find out more
Find out more about Pfam’s collaboration with Google Research and get a detailed introduction to Pfam-N in this Xfam blog post.
This work is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.