For a positive culture change in life science research
Keeping an accurate and complete record of the data flow during a research project is an important part of a researcher’s job.
Being consistent and thorough with documentation is the starting point for making data FAIR, and the study reproducible.
Data and code are the foundations of research findings, whereas journal publications are “advertisements”.
Policy changes from institutions, funders, journals, and supporting mechanisms are the drivers for the adoption of open science.
David, R., Rybina, A., Burel, J. M., Heriche, J. K., Audergon, P., Boiten, J. W., … & Gribbon, P. (2023). “Be sustainable”: EOSC‐Life recommendations for implementation of FAIR principles in life science data handling. The EMBO journal, 42(23), e115008.
Sarkans, U., Chiu, W., Collinson, L., Darrow, M. C., Ellenberg, J., Grunwald, D., … & Brazma, A. (2021). REMBI: Recommended Metadata for Biological Images—enabling reuse of microscopy data in biology. Nature methods, 18(12), 1418-1422.
Paul-Gilloteaux, P., Tosi, S., Hériché, J. K., Gaignard, A., Ménager, H., Marée, R., … & Colombelli, J. (2021). Bioimage analysis workflows: community resources to navigate through a complex ecosystem. F1000Research, 10, 320.
McMurry, J. A., Juty, N., Blomberg, N., Burdett, T., Conlin, T., Conte, N., … & Parkinson, H. (2017). Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoS biology, 15(6), e2001414.
Jupp, S., Malone, J., Burdett, T., Heriche, J. K., Williams, E., Ellenberg, J., … & Rustici, G. (2016). The cellular microscopy phenotype ontology. Journal of biomedical semantics, 7, 1-8.
Aleksic, J., Alexa, A., Attwood, T. K., Hong, N. C., Dahlö, M., Davey, R., … & Vieira, B. M. (2015). An open science peer review oath. F1000Research, 3, 271.
Patwardhan, A., Ashton, A., Brandt, R., Butcher, S., Carzaniga, R., Chiu, W., … & Kleywegt, G. J. (2014). A 3D cellular context for the macromolecular world. Nature structural & molecular biology, 21(10), 841-845.
Smedley, D., Schofield, P., Chen, C. K., Aidinis, V., Ainali, C., Bard, J., … & Hancock, J. M. (2010). Finding and sharing: new approaches to registries of databases and services for the biomedical sciences. Database, 2010.
00:00:10:20 – 00:00:34:20
Anandhi:
Welcome to the Knowledge Catalyst, a podcast where we see what open science in practice looks like. This is a space where we share how we can catalyze and accelerate discovery through openness, transparency and collaboration. From community experts, we learn how good scientific practices and fair principles can propel science forward. I’m your co-host on Anandhi Iyappan, Open Data and Metadata Standards Officer at the EMBL.
00:00:34:22 – 00:01:04:18
Anandhi:
Today we will be discussing the advancements in open science in the emerging field. Emerging research generates massive datasets with immense potential, but without proper sharing, metadata and accessibility. Much of this data remains underutilized. Open science aims to change this by making research more transparent, reproducible and collaborative. As interesting this may seem, it comes along with various challenges such as how do we balance openness with data complexity?
00:01:04:20 – 00:01:28:06
Anandhi:
How do we ensure that imaging data is truly reusable? Today, we’ll dive into the evolution of open science in emerging research. The challenges, the breakthroughs, and the path forward. I’m your host Anandhi, Open Data Science Officer at EMBL. And joining me for this conversation is someone who has been deeply involved in shaping open data science policies.
00:01:28:08 – 00:01:51:22
Anandhi:
I would like to introduce to you Jean Karim, a researcher, a person who is strongly advocating open science, who is also the Computational Support for Cell Biology and Biophysics Unit at EMBL. JK, as we fondly call him, has been promoting open science at EMBL for many years and was a part of the team that is drafting the open data policies.
He has built databases, developed platforms for sharing biological data, and has also played a crucial role in advocating for transparency in research. Together, we’ll unpack the evolution of open science where it started, the hurdles we faced, and where it’s headed next. Let’s get started. Welcome to the podcast, Jean Karim. Thank you for having me here. You’re very welcome.
00:02:14:22 – 00:02:41:00
Anandhi:
I have given you the introduction. Would you like to still add a bit more about yourself? Your research, and also how your interest in research actually came into existence?
JK:
Well, good. My Interest in research showing goes a very long time back. I knew from about the age of 11 that I wanted to study biology and I wanted to find new things related to biology.
So that’s led me to to research, but also at the same time, I had always been interested in computers, and eventually I switched from the from the wet lab, if you want to, the more computational, things. And that was around the time when we started having, full. So at the time I was working, with Drosophila, doing some classical genetics, and then, we, we had RNA, I, and the genome.
And I realized things are changed and we needed to leverage, bioinformatics tools, to make use of all the sequence. And that’s also how I kind of came into open science in a way, because at the time, the the, pioneers of genome sequencing were releasing the data to the public as almost as soon as it was out of the sequencing machines.
And that was, to me, something that I thought everybody should be doing, especially, when I started working with, microscope images that I thought, why not do it? Because there is so much potential. And it’s basically the argument I borrowed is someone is is going to scoop me, but it’s very unlikely that someone will address exactly the same question using your own data, especially if they don’t have all the context.
00:04:08:06 – 00:04:51:10
Anandhi:
Since you touched about, the early research about Drosophila and, and now about imaging, what do you want to give your first hand experiences of how you did research or how did you transition from wet lab towards dry lab? What were your earlier experience compared to how it is today?
JK:
Well, the transition, as I said, was motivated by this realization that we, massive amounts of genomic data and that we needed to leverage it first to basically, design sequence based reagents and, well, I hadn’t really done much, with computers since I was basically, a teenager or a little bit in, in school later.
00:04:51:10 – 00:05:16:05
JK:
But, so I started picking it up, in parallel with my fly work. And at the time everybody was doing bioinformatics using Perl. So that’s what I picked up. And then, after that, I decided I should actually learn, to do things a little bit more, professionally. And I moved to the Sanger Institute, to to learn.
00:05:16:05 – 00:05:47:11
JK:
And that’s also through that move that came my connection to EMBL came because we were collaborating on this Mitocheck project at the time, and here at EMBL, our people were generating massive amount of, image data. Actually, movies of cells in culture.
Anandhi:
So the thought behind open science, I’m sure when you started your research at the EMBL, open science as a concept would have not been discussed widely.
So how did that thought come into your mind?
JK: Yeah, it’s a it was actually a bit discussed because, since I was also working on some, computational biology aspects of the image data, I was advocating for making all these massive amount of images public, as basically we had quality control them. And there was basically, as I said, this, resistance.
But, for at the end of the project and it has always been part of the project to actually release as much information to, to the public. So that’s why, we built a database and a website, for, allowing people to browse all the project’s data, not on the, RNA. I screened the data with images, but also the, the proteomics that was going on and also some of the other, aspects.
So we presented, let’s say, some sort of digested view of the whole project data. And then we made available to the original data. Also at the time there was no repository for images, so people basically had to just ask for the, for the data. And we got until about last year. I was getting requests for the data.
00:07:13:08 – 00:07:35:07
Anandhi:
So how was the interest among your colleagues when you were discussing this thought process of wanting to deposit the data? I mean, create a database or a repository.
JK:
So the database was part of the project. So it was actually sort of something that will be useful to people. And it was it’s been used in many different projects afterwards.
And we are still actually using some of that data as recently as, yesterday for preparing a course. So the, the, the, the question was not so much about making things public. Because that was planned. And what has evolved, I think, from that time is that we would do, we would include more things, for example, something we hadn’t realized people could be interested in, the training set for our classifier.
And a couple of years ago, or maybe a little bit more, someone asked for that data, but we never kept it. We didn’t actually think at the time that somebody might want to do this.
Anandhi:
Could you expand a little bit on the training set as a classifier, because you’re talking to an audience who might not know better.
00:08:25:01 – 00:08:58:14
JK:
So when when we have massive amount of, image data, and in that case, those were, cells undergoing mitosis, we wanted to know which cells were doing, mitosis and which not. And so for that, we train, machine learning algorithms to distinguish between the different forms of cells and for, building this classifier, we need to annotate some of the cells to assign them manually to different categories, so that the algorithm can learn how to distinguish them.
And this training set basically do the original, images with the annotated cells were not kept, but someone more than ten years after the end of the project was interested in them.
Anandhi:
Maybe we let’s take a little bit, step back, and maybe I will ask you a more generic question, because when you spoke about data repositories and creating a platform for tobe open, why?
According to your work in imaging, you see there is a necessity. But in generally in modern science, why do you think open science as a concept is very much important, especially in the field of computational biology, in imaging, like I said, it’s obvious. But in general, what would be your view?
JK: Well, I don’t think it’s so obvious actually.
00:09:46:21 – 00:10:15:23
JK:
And some people are still, still not convinced. I mean, from the present time, I think it’s become really important to be completely transparent about what you’ve been doing. There are many reasons. One is you want to be able to, to show what you’ve done, and not necessarily if it leads to a paper. There are many contributions that don’t necessarily fit in a paper.
00:10:15:23 – 00:10:45:18
JK:
And actually, my view of publishing is that a paper is just paid advertisement for your work. But the actual important information is not in the paper. And I’ve I’ve been actually trying to reuse image data, for example, that have been made public. And it’s never been possible to actually fully use that data because there is always information that missing that’s missing, and it’s not even in the paper.
00:10:45:20 – 00:11:11:04
JK:
So the paper is not worth very much beyond beyond advertising. What you’ve done.
Anandhi:
So basically you’re saying that the research less publications or journals should have the same view as research as the quality of papers submitted to them should be as good as the research that is being done for us to have open source, open science tool. I, I think that, the journals should just basically be running paid advertisement. We don’t actually need peer review because it’s actually, most of the time, not very, very useful. It’s full of conflicts of interest because the people who are most able to evaluate your work are either your collaborators or your direct competitors. And so that’s basically built in conflict of interest. So and then because in the paper, you don’t have necessarily access to all the data.
00:11:40:06 – 00:12:09:12
JK:
So you don’t actually as a reviewer I myself now reject all papers if I don’t have the data and the code, as a matter of principle, because there is no point.
Anandhi:
And do you think these are also the main reasons why people because they have issues sharing the data to journals, that they are also against open science as a concept, because there is a demand from the other community where they are open to submit data, but some of them they cannot. So they see open science as a barrier to research.
JK:
I mean, well, yeah, I mean, traditionally people see making data public and documenting actually properly their data and and their data handling and the process of the data, as a chore that someone asks them to do. And a lot of people still don’t see that as part of their job as a scientist.
00:12:35:20 – 00:13:00:16
JK:
And that’s the main problem. I think it’s a cultural shift that people haven’t really realized.
Anandhi:
So how do you think as a community of open science advocates, how do you think we can instill this as part of a research culture and not see this as something that needs to be additionally done? What do you think that needs to be taken to address?
00:13:00:16 – 00:13:32:01
JK:
Well, I’ve been convinced over the years that the the carrot only approach doesn’t work. So we need to stick. And as long as the funders and all the powers that be don’t want to enforce, proper open science policy. So things are not only going to evolve slowly based on, a limited number of, advocates basically.
Anandhi:
Another important issue that I wanted to discuss with you is reproducibility, which is something very close. I think most of the people in research will, relate to what I’m going to ask you. How do you think open science? It’s probably more direct or more obvious, but maybe in your own experience, how does open science help to help in addressing reproducibility?
JK:
Are we talking about reproducibility or replicability?
Anandhi:
I would say reproducibility.
JK:So you mean this in the sense of reaching the same conclusion as the original study.
And not replicating the numerical value to the decimal. Yeah. So I think it’s important. Yes. Because if you don’t have all the information or at least as much as possible, you can’t necessarily always form an informed opinion. And especially on computational aspect, if you can’t rerun the code, you don’t know what’s all the actually implementation details that matter.
00:14:26:20 – 00:15:02:10
Anandhi:
And also for our audience, I realized that maybe we should give a little bit of an overview of what is reproducibility and replicability so they understand what we’re talking.
JK:
Yeah. So that’s kind of what I alluded to. Those two terms are sometimes mixed and used interchangeably. So for me reproducibility is reaching the same conclusions as a study, but not necessarily, the precise numerical details and replicability is actually performing the study again and reaching exactly the same numerical values and conclusions.
00:15:02:10 – 00:15:27:00
JK:
I think so, which is much harder. But it’s also of, I think to me, less of value because you want what matters generally is the conclusion and not necessarily the precise numerical values that you might come across. Plus, many of those studies or some of the algorithms used, have inherent stochasticity built in. So you will never get exactly the same result.
00:15:27:00 – 00:15:51:19
JK:
But the main conclusions or conclusions of, of particular study should hold no matter what you do.
Anandhi:
Do you think open science alone can address, the problem of reproducibility, or do you think there are other measures that needs to be taken?
JK:
It’s part of it in in the sense that, I believe it will basically make things much more auditable.
00:15:51:21 – 00:16:15:24
JK:
And one of my concern at the moment is you can actually ask, an AI to write you the full paper. And I’ve generated, for example, microscopy image, myself that I’ve shown to people, and they haven’t been able to distinguish them from real, microscopy generated images. So you can generate the figures, you can generate the text.
00:16:15:24 – 00:16:38:11
JK:
And so basically at the press of a button, you can have a paper that on, on its own is indistinguishable from, a real study. So now if you can provide the data and the code and the metadata about all of that, I’m not saying you can’t fake it, but it’s actually, another whole nother level.
00:16:38:16 – 00:17:02:15
JK:
And at that point, it might actually be easier to actually do the experiments than actually trying to fake it with all the metadata and everything. So we can say AI is open, science friendly to some extent, to some extent, in the sense that it will make us realize that if we want to believe a study again, the paper doesn’t count.
00:17:02:19 – 00:17:29:22
JK:
We need the data.
Anandhi:
Since you mentioned that it’s something that, in the previous question that, it should be something that a researcher should see, that it’s part of their job to, ensure that they document data properly and capture the right provenance. Can you probably list the best practices that one needs to do in order to share the accord properly, in order to share the data in a proper manner?
00:17:29:23 – 00:17:59:15
Anandhi:
What would you think would be the first like ten things that they should do as part of best practices?
JK:
Well, it’s basically documenting things just basically with the same attention to details as they normally write their experiments in their lab book. So when you do wet lab experiments, you’d normally write down all the details, in your lab book and people should take the same, same care for computational things.
00:17:59:17 – 00:18:33:03
JK:
And that’s basically where things start becoming difficult. They all bunch of things. There’s no one particular rule, but documenting everything and making sure that you don’t lose these documentation that basically the metadata to, as you move along your research project and as you transform the data that you basically should be able to track back, from a data point in a figure in the paper all the way back to the samples.
00:18:33:03 – 00:18:55:15
JK:
And, let’s say the instruments that has been used, and all this chain should be uninterrupted. Anandhi:
Would you recommend any open source tools or softwares that help in capturing.
JK:
You can do you can do it low tech or you can do it. Hi-tech. I’ve been advocating, the low tech approach because I think it’s, easier.
00:18:55:20 – 00:19:24:07
JK:
So it’s a matter, as I said, of writing down things, the domains I advice would be to actually always be consistent and in the way you, you write things down. So always using the same word for designing the same thing and not being ambiguous. And then there are plenty of tricks that you can use. Some, some structured formats, tables.
00:19:24:09 – 00:19:55:13
JK:
However, an important aspect is that things should normally be, what we say machine actionable. So not only should a program be able to read what you’ve written or your documentation, but be able to, make some decision based on that.
Anandhi:
JK, in your long standing experience in research, would you mind sharing a nice example or stellar example where you saw that open data is, has led to some kind of a breakthrough in research?
00:19:55:13 – 00:20:27:16
Anandhi:
Any small or big example?
JK:
Well, I think the, historically one of the most important one was, the, the human genome data. I mean, even before that, the Klingons genome and the fly genome data, they were published even before the, the papers came out. So people had access to that data, and that made, things move very fast because you could actually start doing things, without delay.
00:20:27:18 – 00:20:59:22
JK:
You had access to vital information for some research projects that you didn’t have to wait for two years until the paper was published for that, for access to that data.
Anandhi:
How would you advice for a community? This is a little bit going away from the topic, that you’re discussing, like when people are dealing with sensitive data, how do you think they could still make the data openly accessible because they have their own, ethical, concept, like they have their own reservations as to why they can’t share the data.
00:20:59:22 – 00:21:37:12
JK:
Well, so so this so-called sensitive data is a bit of a minefield because there are so many layers. Of technical aspects and also regulations. I’ve been involved a little bit on in, at the European level in these. But the, the default position, for example, is for the medical or clinical data to be locked up in hospitals, although legally they could be processed outside or they could, they could be, more open.
00:21:37:14 – 00:22:01:18
JK:
But generally people don’t know what is allowed and not allowed. So they lock up everything so that there is no problem. And, and a simplification of clarification and also maybe training people to in what is kind of a load and and doable with sensitive data. And what qualifies exactly as sensitive data would be already going a long way.
00:22:01:20 – 00:22:35:19
JK:
I’ve, I’ve seen projects, for example, where they wanted to use clinical data from a particular source, and it took basically about one year of lawyer to load your discussion before the researcher could actually access the data.
Anandhi:
And that is definitely not taking science forward because of.
JK:
Yeah, because of that. And also now the problem is if you want to, combine data from multiple sources, imagine you have to have a team of lawyers talking to multiple teams of lawyers across Europe so that you can bring all your data together.
00:22:35:24 – 00:23:05:15
Anandhi:
Yeah, that’s a complex problem to solve. Yeah. Moving on from open science, I would now like to venture into the emerging, data set and the problems that or the complexities that it comes with. We know that imaging data sets are definitely massive and what for you is the first challenge that you have in storing them, in being able to share them, or even to be able to reuse the data sets as big as emerging.
00:23:05:17 – 00:23:35:01
JK:
So what I’ve realized, over the years is that, your big data can be my small data, it’s all relative based on your capacity, having technical, capacity, your skills. And so there is no nothing really to be afraid of in terms of large amount of data. You just need to be prepared. So you need, of course, to have the space.
00:23:35:01 – 00:24:00:04
JK:
You need to, to have the compute capability. But what I found also is that the larger the data sets, the more people are aware of being disciplined and and of trying to do things. And so most of the problems with data backups come from small to medium sized studies, because they are people are actually not careful at all.
00:24:00:06 – 00:24:23:07
Anandhi:
So do you think there is a discipline? I mean, can you say that the discipline in the imaging field for, capturing data, documenting and the discipline of, capturing the metadata is quite prevalent among other communities that you’ve worked with?
JK:
There has been some progress. But still, I think the most of the community is still not there.
00:24:23:07 – 00:24:51:01
JK:
And a lot of, as I said, a lot of things these are done on a small scale. And so people are basically do all these things manually. And when you do a lot of things manually, you just make mistakes. I mean, writing the same name multiple times, you’re going to, to make typos, and sometimes you would just make you use a capital letter to at the beginning of the words, sometimes not.
00:24:51:03 – 00:25:12:05
JK:
And when you try to put things in in a file name, for example, you try to put some information there. Sometimes you will put it at the beginning, sometimes at the end. And so that’s basically a disaster if you want to computationally process those things because it’s not consistent.
Anandhi:
Would you in this case recommend the use of ontologies, to be able to standardize?
00:25:12:07 – 00:25:39:19
JK:
Well, so if you want to standardize at least the, the, the, the the descriptions or the on the vocabulary, you could start with just simply, a list of words or structure, a list of, of words that you want to use. But yes, ontologies have the nice advantage that the terms themselves are semantically related. So you can do things, but it’s really already advanced.
00:25:39:19 – 00:26:08:21
JK:
I would just say if if you’re prepared to do this, go ahead. But otherwise, you don’t necessarily have to go this much. As long as the information is there. We can always take it from there.
Anandhi:
I’m not sure about this, but, how do you think issues like data privacy or policy confidentiality or any kind of proprietary imaging-based data, how how does that impact the open science?
00:26:08:21 – 00:26:39:18
JK:
So from the the, the the microscopy side of things, the, the main obstacle, all the insistence of the microscopic vendors of using their own proprietary formats. So one of the first things we usually do when we get images out of a microscope is convert them to some sort of more open format, I can understand some of the historical reasons for using proprietary formats.
00:26:39:18 – 00:27:03:02
JK:
I don’t think they are valid nowadays. They yeah, they it’s it’s really hard to, to get companies to say, well, if you save your data in your, in an open format, we can do things, with them and that saves us some work.
Anandhi:
I’m curious to know because I’m new to imaging. So could you actually, for people like me, can you tell me how do you process an image?
00:27:03:02 – 00:27:22:19
JK:
Okay, once you look into the microscope, from there, how that becomes into a process data, could you just journey us through that? Well, yeah. I mean, so you look you start basically putting your sample on the, under the microscopes and then, you start looking at them. From then on, you typically have some automated acquisition. So it depends a little bit.
00:27:22:19 – 00:27:46:21
JK:
Sometimes it’s live samples over time. So it’s taken over several minutes. So I was but sometimes it’s a fixed sample so it’s dead. And so you just press the button and acquire one image. And from then on, there are different things people can do. They could just look at it and, and qualitatively describe.
00:27:46:23 – 00:28:08:04
JK:
But very often people want to extract some numbers out of that. And so that’s where a whole bunch of computational methods come in. So you might want to count how many cells you have in your field of view. And so you run a particular program to actually do that for you that identifies all the cells and then counts them all.
00:28:08:04 – 00:28:32:19
JK:
Sometimes you want to actually do more and characterize them. So I actually measures a size or how bright they are for a particular marker and so on.
Anandhi:
Do you also have different types of imaging like fluorescent or do you also see a little bit about.
JK:
Yeah. So they are actually I have historically been more involved in the light microscopy than the electron microscopy field.
00:28:32:19 – 00:29:00:17
JK:
So I will leave the, aside. In light microscopy, a lot has been done with fluorescence. Especially since the advent of green fluorescent proteins and all derivatives and fluorophores that you basically can tag on to, any kind of molecule in the cell. So that allows you to see a particular, molecule or entity, sometimes seen in real time.
00:29:00:19 – 00:29:26:01
Anandhi:
And once you get those data sets, then you could the next step would be analysis.
JK:
Yeah. So you, you could again you could just describes and there are still people discover very new things by just describing, qualitatively what they see going on. But otherwise you might typically want to quantitate, differences between treatments, for example.
00:29:26:01 – 00:29:49:12
JK:
So is these things brighter here and there, or do we have more cells here than here? All these, things and again, all kinds of, computational processing.
Anandhi:
We touched upon machine learning and I maybe could you connect that with this question as to how do you see AI and machine learning, impact the openness of imaging data.
00:29:49:14 – 00:30:15:13
JK:
Machine learning is the new to, well, so there’s been so AI is the new term for machine learning, basically. And before that I think it was called statistical learning. So these, those methods have been used for forever. But in terms of neural networks, more specifically, they’ve been used for imaging for the last basically 12 years.
00:30:15:18 – 00:30:40:16
Anandhi:
So one last question for you from my end would be, what can be done to encourage more scientists to contribute to open source imaging projects? What should push them or what would motivate them to be part of such projects?
JK: I said, I mean, the, the current approach, only works or takes us so far. And so I think we need the stick.
00:30:40:18 – 00:31:19:13
JK:
So they need to be made aware that this is not up for debate. This is what they have to do.
Anandhi:
And is there a way to incentivize scientists to share their work openly rather than keeping it behind the other? Well, so that’s again, that’s the carrot seeing, we could do more there by basically, having more recognition, maybe also, putting more resources like, actually, having money from the grants to actually do proper data management, maybe, support, infrastructures, maybe also support people for doing it.
00:31:19:15 – 00:31:41:08
JK:
But support people are only going to, of, of the way, but because the rest is down to the researchers themselves. So they need also to be trained. And again, they it has to be clear that this isn’t a requirement.
Anandhi:
Thank you very much for this insight podcast. Thank you. Thank you for listening to the Knowledge Catalyst.
00:31:41:08 – 00:31:53:13
Anandhi:
This is your co-host Anandhi. Looking forward to the next chat.
Dr. Jean-Karim Heriché provides computational support for the Cell Biology and Biophysics Unit at the EMBL. An expert in open imaging data, he has been an early proponent of data management and open science at EMBL.
Please contact Victoria Yan: victoria.yan@embl.de
Production: Victoria Yan, Anandhi Iyappan
Audio Technician and Editing: Felix Fischer
Original music: Sergio Alcaide, Felix Fischer
Graphics: Holly Joynes
Web Design: Victoria Yan, Szymon Kasprzyk
Photography: Kinga Lubowiecka