Repurposing bioinformatics tools to tackle the pandemic
Bioinformaticians at EMBL-EBI and beyond are adapting computational tools to investigate coronavirus genomes and proteins
In March 2020, the World Health Organization declared COVID-19 a pandemic. Since then, COVID-19 has become overwhelmingly present in our lives, and researchers are working tirelessly to understand the virus. Within the field of bioinformatics, many have been expanding, adapting, and developing data analysis tools to study coronaviruses.
Viruses: so good at being bad
Swine flu, Ebola, HIV/AIDS, SARS, MERS, and now COVID-19: all these epidemics from modern history have one thing in common – they are caused by viruses. What makes viruses so good at threatening human health?
Viruses are a prolific and diverse group of minute, highly efficient infecting machines: where life thrives, viruses follow. From humans and other mammals to birds, fish, insects, plants, and even bacteria: no organism is off the hook when it comes to viral infections. Viruses come in various shapes, structures, and sizes, honed by natural selection to hijack the cells of their hosts to produce more viral copies.
The basic composition of a virus is quite simple: some genetic material in the form of DNA or RNA encased in a protective shell of proteins, sometimes coated with lipids. Since viruses rely on their host’s cells for fundamental processes of life such as reproduction, many scientists do not consider them to be alive. Viruses carry their own set of genes that contain instructions for making viral proteins, and they hijack the cell machinery to ensure the production and dissemination of new viral particles. Once a virus has entered the system, it can be very difficult to get rid of it.
The special case of coronaviruses
Coronaviruses are a group of relatively large RNA viruses that infect birds and mammals, including humans, causing various symptoms and diseases. In humans, coronavirus infections generally have no measurable effect or cause only mild symptoms, such as those of a common cold. As most coronaviruses are commonly found in non-human animals, scientists have historically known little about the way they could impact human health. However, the 21st century has seen the rise of three coronavirus epidemics, propelling biomedical research in a new direction.
Biomolecular research produces tremendous amounts of data that cannot reasonably be processed using manual methods. That’s when bioinformatics – a combination of biology, computer science, and mathematics – comes to the rescue with an abundance of powerful analytical tools.
Empowering coronavirus research with bioinformatics
The start of the current pandemic saw a rapid increase in SARS-CoV-2-related research. Scientists around the world are scrutinising the virus’s genome and its evolution, tracking the pandemic and searching for the virus’s weaknesses.
For such research, it’s crucial to make genomic, molecular, medical, and epidemiological data easy to access, analyse, and visualise for scientists and healthcare professionals. In light of the sheer amount of data that researchers produce, many fundamental questions in coronavirus research can only be tackled with the help of bioinformatics.
Bioinformatics tools offer the power, speed, and accuracy to boost discovery and inform efforts to produce effective treatments and vaccines. EMBL’s European Bioinformatics Institute (EMBL-EBI) is a leading centre for biomolecular data, and a major contributor to such efforts. With over 40 data resources and many more data analysis tools, EMBL-EBI stores, shares, and analyses data produced by life scientists around the world.
“EMBL-EBI champions open access bioinformatics resources, so we already had a suite of powerful data analysis tools at our disposal before the start of the pandemic,” says Amonida Zadissa, Senior Scientific Services Officer at EMBL-EBI. “The speed at which the virus has spread has pushed us to adapt our tools to SARS-CoV-2. I can’t imagine addressing this pandemic without bioinformatics and open access resources.”
Adapting existing tools
A team of researchers at EMBL-EBI, the European Virus Bioinformatics Center at Friedrich Schiller University Jena in Germany, and collaborators reviewed all the bioinformatics tools to date that accelerate coronavirus research. Their work, published in Briefings in Bioinformatics, shows how careful adaptation of existing tools can provide answers to a wide range of biological, medical, and epidemiological questions about SARS-CoV-2 and COVID-19. As mentioned in this study, Rob Finn, Microbiome Informatics Team Leader at EMBL-EBI, worked with his team to adapt one of their flagship online resources, MGnify, ordinarily used to piece together bacterial genomes from a broad range of environmental samples.
“Normally, our tools are designed to analyse DNA sequences. However, we can use the same sort of approaches to look at RNA and piece short RNA fragments together to work out if coronaviruses are present in a sample,” says Finn. “This project is very much driven by scientific curiosity, even though it might prove useful in the future. We want to know what coronaviruses are out there.”
“The idea was also to trace the evolutionary history of coronaviruses across hosts, and see how closely related the viruses we detect are to SARS-CoV-2,” says Alexandre Almeida, Postdoctoral Fellow in the Microbiome Informatics Team. “Close relatives of coronaviruses that can infect humans are more likely to follow that evolutionary path and switch hosts from animals to humans.”
Almeida led the project and applied the VIRify pipeline – a new virus identification tool developed within MGnify – for the detection of coronaviruses in clinical and environmental samples. As a proof of concept, the tool was able to recover a complete SARS-CoV-2 genome from a human lung sample collected in Wuhan, China. Other scientists have developed a similar tool and detected thousands of coronavirus genomes in a variety of samples, discovering new strains. They have even generalised their method, paving the way for virus discovery across virus families.
Bioinformatic analysis of SARS-CoV-2 data
Once a coronavirus is detected in a sample, its genome and proteins can be sequenced and analysed. Knowing the viral diversity and the molecular characteristics of each strain will help refine treatments and contribute to vaccine development efforts. Several bioinformatics resources and tools hosted at EMBL-EBI and collaborating institutes now have entire sections dedicated to coronaviruses, allowing users to further investigate their coronavirus sequencing data.
For example, Ensembl has launched a SARS-CoV-2 genome browser, where users can download the SARS-CoV-2 reference genome. Rfam, the database of RNA families, has a COVID-19 resource page that allows users to annotate coronavirus RNA and predict secondary structures. UniProt, the comprehensive resource for protein sequence and annotation data, has also launched a COVID-19 portal, which provides the latest information on viral and human proteins relevant to the disease.
Open access transforms coronavirus research
When COVID-19 research efforts started ramping up, it became obvious that scientists, healthcare professionals, and public health advisors needed an open access centralised resource where data relating to SARS-CoV-2 and COVID-19 could be accessed. For this reason, EMBL-EBI set up the COVID-19 Data Portal, a single place where researchers can upload, share, and access data.
“Bringing together all the research data we hold at EMBL-EBI is a good start, but it wasn’t enough,” says Amonida. “We coordinate with other biomolecular centres that hold relevant data and, more importantly, we try to make the data useful downstream, for this pandemic and potentially for future diseases. Although COVID-19 is at the centre of attention right now, we need to be prepared for future epidemics.”
An international team of scientists including Pedro Beltrao’s Group at EMBL-EBI have looked for common weaknesses among the three coronaviruses that threaten human health. They compared SARS-CoV-1, MERS-CoV, and SARS-CoV-2 at the molecular level, including how their proteins interact with human proteins and where they localise in their host cells. In this study, published in the journal Science, they identified potential drug targets common to all three viruses, and a selection of drugs that could be repurposed for COVID-19 treatment.
“Our work involves collaborations with institutes worldwide. It would never be possible without freely accessible resources like the ones hosted at EMBL-EBI,” says Beltrao. “We will upload our data to the COVID-19 Data Portal and hope that it can be used in this pandemic and future outbreaks too.”
In its first six months, the COVID-19 Data Portal saw nearly 3 million web requests and thousands of data submissions. Over 300 institutions from 30 countries have deposited data and the portal now offers open access to over 180,000 scientific publication records relating to the COVID-19 outbreak.
This rapid accumulation of data is a testament to what science can achieve in very little time, but more importantly acts as a model for how to share infectious disease data in the future. Collaboration between countries and disciplines, and reusing and adapting data infrastructure hold the key to helping us understand, monitor, and stop other infectious diseases in the future.
This post was originally published on EMBL-EBI News
HUFSKY, F., et al. (2020). Computational strategies to combat COVID-19: Useful tools to accelerate SARS-CoV-2 and Coronavirus research. Briefings in Bioinformatics. Published online 04 11; DOI: 10.1093/bib/bbaa232
EDGAR, R.C., et al. (2020). Petabase-scale sequence alignment catalyses viral discovery. bioRxiv. Published online 10 08; DOI: 10.1101/2020.08.07.241729
GORDON, D.E., et al. (2020). Comparative Host-Coronavirus Protein Interaction Networks Reveal Pan-Viral Disease Mechanisms. Science. Published online 15 10; DOI: 10.1126/science.abe9403