The EOSC-Life project brought together European research infrastructures that make data Findable, Accessible, Interoperable and Reusable (FAIR). The project laid the foundation for creating an open, digital, and collaborative space for biological and medical research.

As EOSC-Life comes to an end, work package lead Helen Parkinson, who is also a Team Leader at EMBL-EBI, shares her take-aways for making informatics projects more sustainable. 

What is sustainability when it comes to informatics projects?

Organisations, labs, and individuals start lots of projects, but not everything they build can or should be sustained long-term. Here, sustainability refers to understanding a project’s life cycle from the very beginning and monitoring it throughout. This is applicable in the life sciences and informatics projects more widely. 

It’s easy to start things; it’s much harder to sustain them. This is a common problem in data, tool, and service delivery. As part of EOSC-Life, research infrastructures recently pooled their expertise in a paper that describes a set of core sustainability principles which anyone can use. 

How does EMBL-EBI approach sustainability?

EMBL-EBI provides the world’s largest suite of data resources and tools for the life sciences, and each one of these has a life cycle that we monitor closely. Everything we develop has to be maintained, so we have to consider code updates, security patches, integration of new data types to support the user community, and a lot more. 

We don’t retire things very often, and when we do, it’s usually because the tech used to generate the data has been superseded, or the data type is no longer needed, or there are better ways to deliver it. Sometimes the architecture itself just gets old. For example, code tends to have an eight to 10 year life cycle, and this is getting shorter. But rewriting code takes time, so you have to allocate resources to it.

Retiring a data resource – ArrayExpress

After two decades in use, EMBL-EBI retired the ArrayExpress interface, which enabled researchers to access sequencing and microarray data from functional genomics studies. ArrayExpress had already undergone significant refactoring to support sequencing-based expression studies which are gradually replacing microarray experiments. The BioStudies data resource was created to serve as a more general infrastructure for multi-omics data.

To ensure continuity, the team migrated the data into the BioStudies data resource, creating a dedicated ArrayExpress collection. This meant users could continue to access the data in question. This way, the technical underpinnings of the ArrayExpress database could be retired, saving resources.

“The main challenge in this project was ensuring uninterrupted access for our users, both data submitters and consumers,” explained Ugis Sarkans, Technical Team Leader at EMBL-EBI. “For a period of time before the retirement, both the “old” ArrayExpress and the ArrayExpress collection in BioStudies were run in parallel, enabling users to provide feedback and adjust their data access patterns.”

When is the right time to think about sustainability?

You have to think about sustainability at the very start of the project. Consider whether the project is going to end when the funding ends or if you need a sustainability strategy. You can’t easily retrofit a strategy, so you need to plan and budget for it. As a project progresses, you should always have sustainability at the back of your mind, because things can change, and you might need to adapt. 

It’s hard to predict the future, and right now we’re projecting into a fast-moving AI space. One of the recommendations in the paper is to be prepared, be agile, and act in a timely way. Always look at your landscape and users. Horizon scanning is absolutely essential. And you can’t only do it when your review cycles come around; it’s something you have to be doing all the time. 

What are some of the major obstacles to making projects sustainable?

One of the major issues is what happens to a resource or a tool when the funding runs out? Capacity is another obstacle, especially as there is a shortage of people who have deep knowledge of their specialist scientific areas and the mindset to deliver data over a long term. One of the paper recommendations is to plan for acute and future training needs. 

Another blocker is data harmonisation. A good example is human cohort data. Each cohort uses a coding system which might come from the local health service or could be bespoke. It’s easy to re-analyse a single dataset, but the real insights come when you can run analyses on data from many systems and countries. For example, if you want to know which genetic variants are associated with a particular disease, the more data you have, the better. You also have to distil the data into knowledge: make it accessible, add quality indicators, update it to the latest genome assembly – and this is where harmonisation and integration come in.  

My team at EMBL-EBI provides toolkits that enable access to technologies, allowing researchers to harmonise and map datasets. This is some of the expertise we brought to the EOSC-Life project. 

Sustainability through training

Training materials such as webinars can help others use and reuse your data, tools, and resources. Below are a few things to consider when thinking about training:

  • When creating training materials, it’s important to make them FAIR (Findable, Accessible, Interoperable and Reproducible). 
  • Consider a combination of live and on-demand training, enabling users to learn at their own pace. 
  • Remember to keep training materials up to date and retire them when they are no longer relevant. 
  • Training activities can be a sustainer and driver of the communities involved in informatics projects – they bring together experts and novices, reinforce expertise, and develop capacity. 
The EMBL-EBI Training team supported the EOSC-Life project by organising and facilitating training activities.The EOSC-Life training activities are available on the project website.  

How do you know when to end a project?

You can’t expect your resources to remain stable and unchanged forever. You have to update and innovate, but you also have to know when to stop doing something. Sometimes, the reason is a paradigm shift in data generation technology. Other times, it’s a drop in usage.

I always find it fascinating to check the citations for our data resources, including the GWAS Catalog and the PGS Catalog, and see what they are being used for. If people are still using your product to innovate, that’s a sign you should continue to sustain it.

Edit