Big (Protein) Data to Knowledge

A new Big Data to Knowledge (BD2K) project draws on crowdsourcing, cloud technologies and clinical cohorts to transform protein data to knowledge.

  • New ‘Big Data to Knowledge’ project by UCLA, Scripps and EMBL-EBI
  • Goal is to integrate different data types, on multiple scales
  • Outcomes will have applications for BD2K in the US and ELIXIR in Europe

A new Big Data to Knowledge (BD2K) project launched today by University of California, Los Angeles (UCLA), the Scripps Research Institute and EMBL-EBI is set to integrate proteomics, metabolomics, variation and molecular pathway data as part of an efficient, global digital ecosystem for biomedical research. Drawing on cardiovascular data from two major cohorts, the ‘Protein Data to Knowledge’ platform will incorporate cloud technologies, crowd-sourced annotation, text mining and multi-scale clinical data modelling.

One of the biggest challenges in biomedical research is trying to compare apples and oranges: different types of experiments generate different types of data that are very difficult to match up. In addition, the volumes of data produced in high-throughput experiments and clinical research can be extremely difficult to manage.

“Like everyone working on BD2K, our goal is to revolutionise how we address the universal challenge of discerning meaning from unruly data,” says Peipei Ping of the National Heart, Lung and Blood Institute (NHLBI) Proteomics Center at UCLA, who is leading Protein Data to Knowledge.

Part of the solution is to use everything in the scientific community’s arsenal to give structure to these large and heterogeneous datasets. The project partners will create a platform that allows scientists working in all domains to add information to datasets alongside annotations by experts. The platform will be inclusive in many ways, for example making it easier for people to query the data using language particular to their area of science.

“We want to make it easier for the wider scientific community to participate in making sense of Big Data,” says Henning Hermjakob, who leads Molecular Systems data resources at EMBL-EBI. “We have a unique opportunity with our partners in the US to build on the success of ProteomeXchange, building relationships and capturing knowledge generated by people working in sometimes unexpected areas. This isn’t just about making a better proteomics resource – it’s about making good data more discoverable and easier to mine in the biomedical literature.”

This is about making good data more discoverable and easier to mine in the biomedical literature.

One goal of Protein Data to Knowledge is to visualise biological pathways and networks at multiple scales, which will make it easier to identify relationships between drugs and diseases. To achieve this, the project partners will create cloud-based methods to integrate proteomics and metabolomics data, and display them in useful ways.

“The future of biomedical research is about assimilating data across biological scales from molecules to populations,” said Philip E Bourne, NIH Associate Director for Data Science, in a press release from the NIH earlier this month.

“We’re really aiming to transform our research culture from one where Big Data are in the limited domain of the computationally privileged, to one where they are democratised for use by the entire research community,” says Ping. “Today we are kicking off a big effort to build a federated architecture of community-supported tools for enhancing data management, integration and analysis.”

“Collectively, projects in BD2K and in ELIXIR can start to make big data work for research on all levels, including healthcare,” adds Hermjakob. “Sharing knowledge between these major initiatives will help us work efficiently and keep pace with demand, so we can enable the discovery of solutions that benefit everyone.”

About BD2K

The ability to harvest the wealth of information contained in biomedical big data will advance our understanding of human health and disease; however, lack of appropriate tools, poor data accessibility, and insufficient training, are major impediments to rapid translational impact. To meet this challenge, the National Institutes of Health (NIH) launched the Big Data to Knowledge (BD2K) initiative in 2012. BD2K is a trans-NIH initiative established to enable biomedical research as a digital research enterprise, to facilitate discovery and support new knowledge, and to maximize community engagement. Overall, the focus of the BD2K initiative is the development of innovative and transforming approaches as well as tools for making big data and data science a more prominent component of biomedical research.


The goal of ELIXIR is to orchestrate the collection, quality control and archiving of large amounts of biological data produced by life science experiments. Some of these datasets are highly specialised and would previously only have been available to researchers within the country in which they were generated. For the first time, ELIXIR is creating an infrastructure – a kind of highway system – that integrates research data from all corners of Europe and ensures a seamless service provision that it is easily accessible to all. In this way, open access to these rapidly expanding and critical datasets will facilitate discoveries that benefit humankind.

This post was originally published on EMBL-EBI News.

Tags: bioinformatics, database, embl-ebi, partnerships


Looking for past print editions of EMBLetc.? Browse our archive, going back 20 years.

EMBLetc. archive

Newsletter archive

Read past editions of our e-newsletter

For press

Contact the Press Office