Quantitative monitoring of nucleotide sequence data from genetic resources in context of their citation in the scientific literature
GigaScience 29 December 2021
Biodiversity loss is one of the biggest catastrophes facing humanity, and time to address it is running out. But to conserve biodiversity, we must first understand it, and this relies on open science and open data.
By Guy Cochrane, Team Leader for Data Coordination and Archiving at EMBL-EBI and Head of the European Nucleotide Archive
The UN Biodiversity Conference and the 15th Conference of the Parties to the Convention on Biological Diversity (COP15), set to take place in 2022, will see the adoption of the post-2020 global biodiversity framework – a roadmap to conserve and restore biodiversity during the next decade.
One key point in the negotiations will be access- and benefit-sharing for genomic sequencing data, also referred to as digital sequence information (DSI). In other words, how can we make sure that the countries that are rich in biodiversity can benefit from the research and discoveries that their biodiversity enables?
With megaprojects such as the Darwin Tree of Life, African BioGenome Project, and Earth BioGenome Project sequencing hundreds of new species every day, the question about how the data is subsequently made available to the scientific community is more pertinent than ever.
DSI is crucial for the life sciences, particularly in our explorations of life’s mechanisms, and in applications such as drug discovery, new product development, and food security. Currently, a vast amount of DSI is openly accessible, meaning it’s freely and easily available for anyone to access and analyse, using public data resources, like the ones managed by EMBL’s European Bioinformatics Institute (EMBL-EBI).
The open nature of these huge data libraries of biological information collected from all over the world speeds up scientific discovery. However, in discussions regarding who benefits from biodiversity data, there can sometimes be an assumption that countries that are rich in biodiversity but less well-off economically are mainly data producers, while higher-income and less biodiverse countries are mainly data consumers. This belief creates a dichotomy between provider and user, and assumes that the flow of information is unidirectional. This assumption could be damaging for the free flow of data going forward.
But the situation may not be as straightforward as that. To explore patterns of application of genomics and usage of the resulting data, we analysed the flow of DSI from a global perspective in a study recently published in the journal GigaScience.
Alongside colleagues from IPK Leibniz Institute and the Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures, we extracted and linked data records from the European Nucleotide Archive (ENA) – one of the world’s leading open databases for nucleotide sequence data – to citations in open access scientific publications aggregated at Europe PubMed Central (Europe PMC).
The aim was to create a tool that offers insight into how many datasets exist in ENA that represent organisms from any given country, and how much data from other countries each country uses.
The analysis linked eight million ENA data records with scientific publications that referenced them. The method is not without limitations, including the fact that it covers only a sample of the records in ENA and EuropePMC, and these records go beyond biodiversity data. However, the analysis still offers some interesting insights.
Following the open data philosophy, the analysis and dataset are available to explore in the WiLDSI Data Portal. Anyone interested can take a deep dive and explore questions such as “which countries use genomic data?”, “which countries (or groups of countries) share genomic data?” and more. We hope that other researchers will explore the dataset and identify further applications for it.
The key finding of the analysis is that the flow of DSI available in open databases is much more complex than previously thought. Genomic data sharing is not a one-way process, from one set of countries to another, but a complex web of usage.
The analysis also showed that countries across the economic spectrum can be heavy producers and/or heavy users of genomic data. To take one example, in the case of the biodiverse country Malaysia, there seems to be a good balance between production and consumption of DSI. The analysis found that Malaysia uses data from 68 countries, and that data produced in Malaysia are used in 59 countries across the economic spectrum.
While there are certainly some countries where data usage is higher than production, in general, the picture is well-balanced. This would suggest that the open model for DSI sharing is heavily used and beneficial for the global research community. The fact that scientists have easy access to data from all over the world is useful to scientists in all countries, regardless of economic status.
A second important finding is that better metadata is required in open databases. Many of the datasets available in the ENA can’t be traced geographically or temporally, which somewhat limits their usefulness for future research.
In biodiversity research, being able to trace a sample to its country of origin could be crucial. Similarly, in the case of a global pandemic, spatio-temporal metadata is essential for understanding the pathogen, its evolution, and spread.
To address this need for better metadata, the International Nucleotide Sequence Database Consortium (INSDC), which includes the ENA, has recently announced that from 2022, it will make spatio-temporal metadata mandatory for new submissions. The change aims to enrich the scientific value of the data, especially for scientists working in the areas of infectious disease, biodiversity, and ecology.
We hope that the WilDSI analysis will serve as a useful tool in discussions about the future of genomic data sharing, and we would like to encourage interested parties to explore the WiLDSI Data Portal. A second study published in GigaScience delves deeper into the wider policy implications of the findings.
Mechanisms for genomic data sharing must be equitable for producers and users alike, regardless of where in the world they are located. The fact that researchers from low- and middle-income countries are asking for additional safeguards, such as the option to keep data private for a period of time, highlights the shortcomings of the current economic and academic systems. These shortcomings absolutely need to be addressed if we, as a species, want to reap the benefits of research and development on a global scale. But in a world where data is essential for advancing discovery, any barriers to data access are likely to slow down scientific progress on a global scale. If scientific data is not FAIR (Findable, Accessible, Interoperable, and Reproducible), then our best hopes of making discoveries that solve global challenges like the loss of biodiversity are just castles in the sky.
GigaScience 29 December 2021
GigaScience 29 December 2021
Looking for past print editions of EMBLetc.? Browse our archive, going back 20 years.EMBLetc. archive