Deciphering the data deluge: how large language models are transforming scientific data curation

Large language models are changing the way we carry out scientific data curation, annotation, and research, setting the stage for a more efficient understanding of scientific literature

In a world inundated with data, curating valuable information has never been more challenging, or more important. From academic papers to scientific databases, the deluge of new information can be overwhelming, leaving researchers in a constant struggle to keep up. However, a groundbreaking innovation in artificial intelligence is helping to transform the data curation landscape: large language models (LLMs) such as those behind ChatGPT. Powered by sophisticated deep-learning algorithms, these models are revolutionising how we streamline and curate massive volumes of data.

Here we look at some of the ways researchers at EMBL’s European Bioinformatics Institute (EMBL-EBI) are taking advantage of LLMs to aid their data curation processes. From automating the summary and annotation of academic papers to assisting with ontology mapping, LLMs are not just aiding human curators but also have the potential to enhance the quality of the data EMBL-EBI provides to its users.

What are large language models (LLMs)?

LLMs are a type of artificial intelligence system trained on vast amounts of textual data. By processing and learning from this data, these models can generate coherent and contextually relevant text across a wide range of topics. LLMs can understand and produce human-like text, making them valuable tools for tasks such as content creation, answering questions, and natural language understanding.

Transforming data curation

Academic papers are currently being published at an unprecedented pace, and the challenge of pulling out relevant information has never been greater. Andrew Green, ARISE Fellow at EMBL-EBI has been using LLMs to streamline the data curation for EMBL-EBI’s database for non-coding RNAs, RNAcentral.

To do this, Green has successfully developed a tool to scrape scientific articles that mention specific RNA identifiers. These sentences are then fed into GPT-3.5, which generates concise, coherent summaries about the RNA of interest. These summaries describe key details such as the RNA’s functions, its involvement in diseases, and the organisms in which it has been studied.

“One of the intriguing features of using LLMs is the accuracy and contextual understanding they bring into the summarisation process,” said Green. “We’ve seen the model accurately decipher acronyms in a given context and even self-correct its errors when asked to fact-check its summaries.”

To ensure the summaries generated are robust, they go through multiple rounds of validation, and are then rated for quality before appearing in the RNAcentral database. The summaries serve as quick references for scientists to better understand a particular RNA, and also include clickable citations to the original articles on Europe PMC.

“It’s crucial to remember that LLMs don’t inherently know the difference between what’s real and what’s fabricated,” added Green. “In the scientific community, where factual accuracy is paramount, this could be a major concern. Models can sometimes ‘hallucinate’ details that aren’t in the original text. To mitigate this, we have put multiple validation rounds in place. This, combined with constant human oversight, ensures that the information presented is both accurate and reliable.”

At the heart of this approach is an automated method for extracting and summarising valuable information from a multitude of academic articles. This means that this work can also be applied to many other EMBL-EBI resources. Once fully developed and implemented, this automated process for curation serves to aid the work of many of EMBL-EBI’s curators, acting as a first filter in the lengthy process of data collection and interpretation.

Accelerating annotation

Another aspect of the EMBL-EBI database pipelines that can benefit from LLMs is data annotation. Melanie Vollmar is an ARISE Fellow at EMBL-EBI with a strong background in structural biology and a growing expertise in machine learning. As part of her fellowship, she is looking at how to fully automate the extraction of functional information about proteins from academic papers using LLMs.

Her project focuses on gathering structural information from the Protein Data Bank in Europe (PDBe) and supplementing it with related academic publications from Europe PMC. This curated information is then mined for specific functional details, which are mapped back onto the protein sequences listed in UniProt.

Until now, curating literature for functional annotations followed a purely manual approach supplemented by traditional text mining methods. LLMs, designed to grasp the intricacies of human language, can parse through vast amounts of scientific literature, weigh contrasting opinions, and generate complex text-based outputs.

This capability can bring in a new era of data enrichment, as these models help to extract more detailed and contextually rich information from existing biological literature at an accelerated pace. At no point is such a model intended to replace the human biocurator who is required to provide a critical view on the produced output.

“With automation, not only do we increase the pace at which we can annotate data, but we also enrich the quality of that data, offering a more comprehensive resource for our users,” said Vollmar. “Our focus now is on protein structures, but the beauty of our approach is its adaptability: the methods we’re developing could easily be transplanted onto other types of biological data, elevating the annotation process across the board.”

Fine-tuning existing LLMs

Europe PMC is EMBL-EBI’s home of scientific literature, and after many years of serving the scientific community, the resource remains an intuitive and powerful search tool to help users stay on the cutting edge of science. Many of the database’s functionalities rely on literature curation, which involves scanning through dense academic material to extract essential information.

Santosh Tirunagari, Senior Machine Learning Developer at EMBL-EBI is leveraging the capabilities of LLMs to accelerate the curation of scientific literature within Europe PMC. He and others in the team have developed specialised named entity recognition models, which are fine-tuned versions of existing LLMs. These sophisticated tools are designed to automatically identify critical scientific entities such as genes, proteins, diseases, and chemicals in research papers and patents.

Using this approach helps to side-step the high computational costs of developing a language model from scratch, which could require dozens of GPUs and extensive training time. By concentrating on the fine-tuning phase, Tirunagari has been able to adapt these powerful language models to specific tasks relevant to scientific curation. This maximises efficiency while achieving high levels of accuracy.

In one of his models, Tirunagari also uses an innovative ‘human-in-the-loop’ methodology for model training. Beginning with a limited dataset, the fine-tuned models undergo further adjustments using additional scientific papers. Human curators then verify the model’s findings, enabling an iterative feedback loop that continually improves the model’s accuracy.

“Large language models have been a game-changer in our efforts to automate the complex task of scientific curation. By fine-tuning these models, we’ve been able to develop highly specialised tools that can sift through vast amounts of scientific literature and patents to identify key entities such as genes, organisms, proteins, and diseases with impressive accuracy,” said Tirunagari. “This not only accelerates our work but also opens up new possibilities for collaborations, like our ongoing partnership with Open Targets to use these models to aid drug discovery.”

A novel approach to ontology mapping

Ontologies are structured, hierarchical classifications that are widely used for standardising diseases. Current practices for ontology mapping rely heavily on manual curation, making it a time-consuming and error-prone task. To address these issues, Kirill Tsukanov, Senior Bioinformatician at EMBL-EBI, has developed a new method for ontology mapping using openly-available, GPT-based language models.

What is ontology mapping?

Ontology mapping is when you have data in one format and you want to convert it to another standard format so that it can be combined with other data. For example, if one database uses “heart disease” and another uses “cardiovascular disorder,” ontology mapping would help align these terms so the databases can work together.

The new method integrates EMBL-EBI’s Ontology Lookup Service (OLS) with GPT-3.5 to evaluate the relevancy of ontology terms provided by OLS. Rather than generating ontology identifiers from scratch, the GPT model is tasked with grading existing mappings. This new workflow enables the system to map about 20% more terms compared to existing methods while retaining the same accuracy.

“Our prototype already shows immense promise,” said Tsukanov. “The integration of GPT models helps us overcome the limitations of existing systems, increasing the speed of ontology mapping. The application of LLMs in our research is not just innovative; it’s transformative. These models are helping us bridge the gap between raw, unstructured information and actionable, standardised data.”

“While LLMs like GPT-3.5 have proven to be invaluable in tasks like ontology mapping, they present an intriguing challenge,” continues Tsukanov. “These models don’t inherently know the difference between fact and fiction. Recognising this, we’ve been careful to integrate additional layers of validation and are exploring the use of open-source, stable models that can be fine-tuned specifically for our ontological needs. The goal is to have a tool that not only understands human language but aligns that understanding with the precise, standardised terms in our ontologies.”

The project is currently in its developmental phase but Tsukanov plans to test the stability of other LLMs to further improve this new system. The ultimate goal is to create a universally applicable library, serving as a foundation for ontology mapping for different EMBL-EBI initiatives.

Large language models: a catalyst for change

The advent of LLMs such as GPT represents a pivotal moment not only in the field of artificial intelligence but also in how we handle, curate, and understand enormous volumes of data. The success stories above show that, while not without their challenges, LLMs hold immense promise for making our data-rich world more understandable, accessible, and usable.

There are obstacles to overcome, one of the foremost concerns is data integrity and trustworthiness: as LLMs are trained on massive datasets, there’s a risk of perpetuating inaccuracies or biases present in the source data. This is particularly critical in scientific applications where incorrect or biased information could have far-reaching implications. Additionally, the automated nature of LLMs could lead to unintended consequences, such as the omission of nuanced insights that human experts might catch, thereby impacting the quality and reliability of curated data.

Given these complexities, it’s crucial to integrate ethical considerations into the design, implementation, and ongoing management of LLMs in scientific data curation.To address these challenges, our researchers discuss how they have implemented multi-layered verification frameworks for data curated by LLMs. Regular updates to the LLMs themselves, coupled with continuous feedback loops with human curators, allow for ongoing refinement of the models, reducing the likelihood of errors over time.

As LLMs become an increasingly integral part of scientific research, vigilance in maintaining data quality remains a top priority. As the technology matures and as we get better at integrating human expertise with machine capabilities, these challenges are likely to diminish. Ultimately, LLMs have the potential to act as powerful catalysts in the evolution of data curation and scientific research, propelling us into an era where data can not only inform but also enhance our pursuit of understanding.

Tags: bioinformatics, embl-ebi, genomics, open data

For press

Contact the Press Office

EMBLetc.

Online Magazine of the European Molecular Biology Laboratory

Deciphering the data deluge: how large language models are transforming scientific data curation

What are large language models (LLMs)?

Transforming data curation

Accelerating annotation

Fine-tuning existing LLMs

A novel approach to ontology mapping

What is ontology mapping?

Large language models: a catalyst for change

In this Issue

Taking science on the road

Remembering the moment EMBL was established

Download printable preview

EMBLetc.

Online Magazine of the European Molecular Biology Laboratory

Deciphering the data deluge: how large language models are transforming scientific data curation

What are large language models (LLMs)?

Transforming data curation

Accelerating annotation

Fine-tuning existing LLMs

A novel approach to ontology mapping

What is ontology mapping?

Large language models: a catalyst for change

Share

In this Issue

Taking science on the road

Remembering the moment EMBL was established

Download printable preview

Subscribe to our e-newsletter

Newsletter archive

For press

Follow us