How open data is changing our pursuit of discovery
In the 16th Century, a cabinet of curiosities (in German: Wunderkammer) was a popular way to show off one’s private collection of extraordinary objects. Animal specimens, skeletons, minerals, unusual handmade objects and intriguing antiquities from the New World could all be revealed with a flourish, and rouse in visitors a keen sense of curiosity in this new Age of Wonder.
Over time, cabinets of curiosities morphed into modern museums. Both feed two profoundly human tendencies: curiosity, and the desire to collect and preserve knowledge. These same tendencies are driving a sea change in science: disruptive technology, a tsunami of data and the democratisation of access. Now, curious visitors to the European Bioinformatics Institute (EMBL-EBI) data resources can gaze into a new kind of menagerie. They can explore the wonders of the molecular world, carefully tended in public databases that anyone can access.
A collector’s view
One of the Enlightenment’s many obsessions was crafting a meticulous directory of the living world. That enthusiasm was revived with the discovery of genetic code as the common language that is helping us unlock some of the mysteries shared by all life. Instead of cataloguing the visible world, scientists can now also sequence DNA from millions of species and enter the information into databases. The result is a “living” catalogue, open to everyone, that can help us make sense of our world.
So how has the database changed the way we collect and make sense of data? “The first obvious difference between a cabinet of curiosities and a database is the content,” explains Jee-Hyub Kim, Data Miner and Alumnus of EMBL-EBI. “On the one hand, a collection of physical objects makes you feel something straight away. Just imagine what it must have felt like for someone who may have never even seen the ocean, to see and touch a starfish or coral – these objects must have looked so alien!
A collection of physical objects makes you feel something straight away…
“It’s difficult to create this sort of rapport with something as intangible as data. That’s why you need a good interface and visualisation tools – to allow the user to explore and interact with a dataset or a digital object.”
Visualisation is a very powerful thing: being able to “see” connections inspires people to keep exploring. The people who build EMBL-EBI data resources understand this, and strive to build interfaces and visualisations that keep researchers engaged.
One example is the Protein Data Bank in Europe (PDBe), a resource for collecting, organising and disseminating data on macromolecular structures, such as proteins. Apart from being a central repository for scientists studying proteins, PDBe allows users to see and interact with digital, 3D models of proteins. A recent update to PDBe takes these gigantic files, and reduces their size by up to 500 times. This means that if you want to see how a molecule interacts with a protein, you can bring up a visualisation on any Internet-connected device. That includes phones and tablets. This allows many more people throughout the world to explore PDBe and use it as a learning tool.
With virtual reality (VR) becoming widely available to consumers, one could easily imagine a biology lesson in which students use VR headsets to explore life at every scale: from molecular to cell, to organ, to system, to organism.
A multidimensional cabinet
“Traditional cabinets of curiosities organised items by type, so in a sense they were like an ontology of shapes, because they classified artefacts according to what they looked like,” says Chuck Cook, Scientific Services Manager at EMBL-EBI. “You could draw a parallel with the modern database, which organises biological data resources in a similar way – into categories. In the database, information and categories are interlinked, so in a way the database is like a ‘smart’ or multidimensional cabinet of curiosities.”
Even as technology advances at breakneck pace, science is more accessible than ever. The sublime combination of high-tech visualisation, collaborative software programming and open data is truly democratising biology.
“With a traditional cabinet of curiosities, the collector was the ultimate authority,” adds Andy Yates, Team Leader of Ensembl Genomics Technology Infrastructure at EMBL-EBI. “EMBL-EBI keeps its “collections”, or services, open to the researchers everywhere. In doing so, we’re making the contents – and ourselves – open to reanalysis and review. It’s a necessary move if we want our resources to be truly useful.
“We work within the scientific community, and that means we are open to critique – at a speed that would have been unthinkable even 15 years ago. Previously, we would have probably published the latest version of the Ensembl genome browser on a CD, sent it off in the post and that was it. There was no instant feedback, no self-regulation, nothing. It’s only in the past few years that this kind of openness has become workable, due to improved communication channels, and now it is actually expected by our users.”
Opening up the cabinet
“Data accessibility is crucial for anybody doing science, which is a massive change,” continues Yates. “Cabinets of curiosity were private collections with limited accessibility. Some owners opened their doors to the public, but it was still only a small number of artefacts – the most peculiar ones – that were on display. Most things were indexed and locked away.”
The sublime combination of high-tech visualisation, collaborative software programming and open data is truly democratising biology
EMBL-EBI hosts many tens of petabytes of data. A big part of the work that goes on there relates to making datasets easy to find. Without indexing, there is no way of knowing what is in a database, or how it got there. Indexing is as central to public data resources today as it was to early collections.
Data curation and annotation activities are intense at EMBL-EBI. Once you have generated a sequence, you can identify a specific gene. Then, you have to search that gene against a huge amount of pre-existing data. The curation process includes labelling and describing datasets consistently. This allows any researcher can discover and make use of the data for their own experiment. This helps research communities build knowledge and make connections between different studies and disciplines. Without descriptions – also called metadata – samples and sequences are cast adrift in a sea of data. “Without metadata, exploring a database is like wandering through the basement of the Louvre blindfolded, hoping you’ll find the Mona Lisa,” says Yates.
To be useful, research datasets must be put into context and linked to the paper that describes them. To make these hard-earned datasets reusable by other scientists, text miners and data curators at EMBL-EBI carefully check data submissions. This ensures they meet the necessary requirements.
EMBL-EBI works hard to develop tools according to FAIR Guidelines. The FAIR data movement aims to make research data Findable, Accessible, Interoperable and Re-usable. It has gained momentum in the life sciences. Hundreds of thousands of scientists throughout the world are generating diverse datasets of all sizes. FAIR data articulates the central needs of data-driven research, as access to accurate information is the basis of good hypotheses.
So how is data availability changing the way we answer scientific questions? According to Chuck Cook, “people are going to become more dependent on big data, and scientists who can’t use big data will be left behind professionally. As we become more specialised, running isolated experiments is becoming more difficult. To delve deeper into research, we will need to collaborate with people from lots of different backgrounds. And to do that we need a common language – that’s something we are actively working on.”
“Biologists have to turn into programmers, to a certain extent,” agrees Yates. “That’s how the scientific questions are changing. The researcher will come up with a hypothesis and then prove or disprove it through data mining of large data resources. That requires some degree of programmatic knowledge. The questions may be similar, but they can be much more complex. We will still repeat, and repeat, and repeat our questions and analysis, gently refining the answers we get.”
From discovery to application
“The time it takes to go from scientific discovery to application is becoming much shorter,” adds Rob Finn, Team Leader for Sequence Families and EMBL-EBI’s Metagenomics resource. “This is partly because the data is connected, so you get the whole biological context rather than just looking at one thing in isolation. That means you’re better informed to design your next experiment.”
The time it takes to go from scientific discovery to application is becoming much shorter
Finn is no stranger to exploration. He is involved with data from the Tara Oceans expedition, which sailed a research schooner more than 300,000 kilometres. Scientists on the voyage systematically collected samples of plankton from all the world’s oceans. They then shipped them back to land for DNA sequencing and analysis.
So far, this modern version of the HMS Challenger has led to the discovery of over 40 million new genes. Many, possibly most, of which belong to unknown organisms. It may be many years before we fully understand all these new sequences.
The planktonic worlds revealed by the Tara expedition are a precious treasure in EMBL-EBI’s modern cabinet of curiosities. As scientists begin to analyse these datasets on a massive scale, they are revealing profound, new insights. The work helps us understand the invisible ecosystems that support the global food chain.
“Sequencing the samples from Tara lets us ‘see’ some of the diversity of life in the oceans,” continues Finn. “The first set of 40 million genes identified in Tara Oceans samples are mainly prokaryotes – bacterial species we haven’t seen before. But in the second wave of data, we have identified over 117 million eukaryote genes so far – and there is still a long way to go. There’s a huge amount of genetic data to study out there. What do all these genes do, what species do they belong to? How does it all fits into the bigger picture? Those are the really intriguing questions we’ll be exploring for years to come.”
Mapping the life sciences
In light of this ever-growing influx of data, what are the big challenges facing biology in the coming years? “The big change I see in biology is that molecular scientists now have the capacity to look genome-wide and species-wide,” says Janet Thornton, Director Emeritus of EMBL-EBI and Senior Scientist. “Before open data, a scientist worked on one protein, gene or experimental system, possibly for their entire career. Seeing the bigger picture was practically impossible. Today, we can make genome-wide and species-wide observations.
“This shift also poses the biggest challenge, which is that, despite the unity of biology (in that all living systems are coded by the genetic code), truly important discoveries in biology still lie within the nitty-gritty details. In genomics, we have seen the impact of technological development to drive innovation. Certainly, recent developments in imaging for cell biology will allow researchers to develop high-throughput experiments that change the questions we can ask.
The nitty-gritty details
“High-throughput experiments open all sorts of doors. Because of them, you don’t necessarily look at each piece of data in the same way because it’s part of this mammoth ‘whole’ that is ‘evolved life on earth’. This means you skip over the details. My worry is that, despite this potential ‘pot of gold at the end of the rainbow’, we will still need to look closely at these gruesome details to understand many fundamental questions, such as why do organisms age?
“Biology is still in the discovery phase, and slowly moving into the theoretical explanation phase. As always, our science will follow the ‘Map, Quantify and Model’ roadmap. It’s like before the world was mapped – we are only just properly mapping biology now. Initiatives like the Human Cell Atlas are very good examples of all the missing details we still need to understand before we begin to explain how things work. The next step will be to translate this knowledge into everyday areas, such as medicine, agriculture and biodiversity.”
Much like the collectors who set up the first cabinets of curiosities, scientists are still meticulously cataloguing everything they learn about the form and function of life. But at EMBL-EBI, the work is about more than just recording and describing data. Linking it all up to facilitate further discovery is another area of intense focus.
By working with users, helping set standards and curating data, EMBL-EBI creates resources that other scientists can build on. Programmers and scientists at the institute also develop a broad range of analytic tools. These include complex machine-learning methods and computational models for testing hypotheses, or simply satisfying curiosity, and applying new knowledge to real-world questions.
EMBL-EBI’s ‘smart’ cabinet of curiosities spans all of molecular biology. From microbes to population-scale, genome-wide studies, the institute makes data from biological discoveries open and accessible to anyone with an Internet connection and a curious mind.
Curiosity is a profoundly human trait. We start asking questions almost as soon as we learn to speak and continuously redefine our understanding of the world by questioning it. This is the driving force behind science, technology, engineering and maths. As part of our curiosity editorial theme, we are exploring what EMBL is curious about.
It’s almost a year since the coronavirus outbreak was declared a pandemic, affecting all our lives. While the virus continues its grip on the world, scientists are understanding it better and better, increasing our knowledge about it and opening up new ways to fight it.