Shedding light on rare diseases: open data and model organisms

Rare diseases are a worldwide healthcare challenge, and one way of understanding and managing them is through model organisms

Decorative image showing mice silhouettes in different colours.
Credit: Karen Arnott/EMBL-EBI

By Tudor Groza, Team Leader, EMBL-EBI

A rare disease is a condition that affects less than one in 2,000 people. There are approximately 9,000 documented rare disorders (with some studies counting over 10,000). Collectively, rare diseases affect at least one in 16 people, which represents a significant part of the population. 

More than 80% of rare diseases have a genetic component, and these conditions are disabling, often life limiting. They’re also expensive to manage, and have a devastating impact on patients, their families, and the healthcare system. Even more sobering is the fact that one third of children with a rare disease die before their fifth birthday

An increasing priority 

Serving the needs of children with rare diseases has been recognised by the World Health Organisation (WHO) and the European Commission as a critical unmet medical, social, and human rights need, and a global public health priority. Receiving an early and accurate diagnosis is imperative for informed care in medicine, and yet patients and families living with suspected rare diseases will often spend more than five years on a diagnostic odyssey. This usually entails multiple specialist visits and invasive testing, and can carry significant societal and personal costs.

Recent advances and remaining challenges

Advances in genomic sequencing coupled with the adoption of standardised terminologies and an increasing willingness to share data internationally have laid the foundation for progress in this field over the course of the last few years. Some examples include the 100K Genomes Project by Genomics England. This groundbreaking initiative has already resulted in improved diagnostic efficiency for monogenic diseases, and the ability to diagnose a critically ill child by rapid whole-genome sequencing in only seven hours.

What are monogenic diseases?

Monogenic diseases are caused by variation in a single gene. Examples include sickle-cell anaemia, cystic fibrosis and Huntington’s disease.

Advanced sequencing techniques often uncover new genomic variants in a gene that is suspected to play a role in the underlying condition of a patient. The quest is then to prove the relationship between the new variant and the condition. Establishing such ‘proof’ can be quite challenging. But additional experimentation using model organisms can often provide a key piece of the puzzle.

Model organisms and open data

Over many decades, organisms such as mice, fruit flies, and roundworms have become classic models for studying human biology and disease. Such ‘model organisms’ reproduce relatively fast and can be readily studied in laboratory settings. As such, scientists use model organisms to conduct experiments that cannot be performed with humans. A typical experiment involves altering a gene of interest in a model organism and seeing if the resulting phenotype – or trait – mirrors the phenotype encountered in humans. For example, altering the gene FGFR3 in mice leads to a shorter than normal stature. 

Model organisms also provide key information on therapeutic strategies that cannot be obtained by alternative methods, and accelerate the development of new therapeutic options for not only rare diseases, but also other areas such as cancer.

“Unlocking new knowledge from model organisms, through the clarity of the phenotypic lens, is set to transform rare diseases discovery, diagnosis and care in a hitherto unprecedented manner.”

– Gareth Baynam, Medical Director at the Rare Care Centre, Perth Children’s Hospital; Chair of the Interdisciplinary Scientific Committee, International Rare Diseases Research Consortium

The continuous development of curated databases and computational platforms has also contributed to the success of using model organisms. Efforts such as the Monarch Initiative, the Mouse Genome Database, the Zebrafish Model Organism Database and FlyBase, for instance, have pioneered approaches to document and publish the data generated in the process of studying model organisms in a computer-readable manner. This has enabled a standardised way of sharing knowledge, in addition to the development of algorithms that combine data from multiple species and humans to generate hypotheses in a clinical setting.

“Making rare disease knowledge openly available and interoperable across species and data types is pivotal for expediting rare disease diagnostics, research, and care. Fundamentally, model organisms fill a critical knowledge gap where we have little or no evidence from human data sources.”

– Melissa Haendel, Marsico Chair in Data Science, University of Colorado Anschutz Medical Campus; Program Director at The Monarch Initiative

In this context, EMBL-EBI plays an important role by leading, via its Phenomics team, two international initiatives that support the advancement of biomedical science using mouse models, with direct applicability to rare diseases: the Patient-Derived Cancer Models (PDCM) Finder project and the Mouse Phenotype Informatics Infrastructure (MPI2) consortium. The value of the mouse as a model organism comes from the fact that it has similar developmental, physiological, biochemical, and behavioural features as humans. 

EMBL-EBI coordinates the data side of the MPI2 and PDCM Finder initiatives, ensuring the data generated by the projects are openly available and easily accessible to all. In addition, EMBL-EBI genomic and model-derived phenotypes are used by international rare disease platforms like the European Joint Programme on Rare Diseases, hence providing functional information to enhance patient data.

PDCM Finder

Some cancers, such as childhood cancers, are considered rare because they are relatively uncommon. Unfortunately, their rarity also makes them harder to diagnose. Patient-derived cancer models (PDCMs) are models of cancer where the tissue or cells from a patient’s tumour are implanted into an immunodeficient mouse or are reproduced in cell lines or self-organised three-dimensional tissue cultures. 

In recent years, PDCMs have become essential tools in both cancer research and preclinical studies. Academic and commercial organisations have invested significantly in the generation and characterisation of these models. As these models gain much of their value through reuse and integration, there is a compelling need for PDCM datasets to adhere to the FAIR data principles of Findability, Accessibility, Interoperability, and Reusability. This is the critical contribution of EMBL-EBI to the wider PDCM community. 

PDCM Finder is the largest resource of its kind in the world. It standardises, harmonises, and integrates the complex and diverse data associated with PDCMs, making it more discoverable and more useful. The project provides access to 4,592 PDX models, 1,520 cell lines and 108 organoid models from 27 international providers from the cancer research community.  It acts as a unified entry point for research and clinical communities to search and compare PDCMs and their associated data including frequently mutated genes, diagnoses, drug treatments, and sequence data.

International Mouse Phenotyping Consortium

MPI2 represents the data collection, quality control, analysis and publishing arm of the International Mouse Phenotyping Consortium (IMPC), which is an effort to create and characterise a collection of mice containing a null mutation in every gene in the mouse genome. A null mutation is a mutation that results in a complete functional loss of a particular gene. 

The latest IMPC data release covered 8,267 genes and over 90 million data points. The resource supports the identification of new mouse models of rare and common human diseases, new gene functions and the development of novel methodological approaches that form the basis of new gene-disease associations.

The data published by these resources are a key component in a complex ecosystem of databases that are evolving to deliver systematic analyses of cellular, organism level and population analyses. Two leading examples are the Monarch Initiative, which connects phenotypes to genotypes across species, and the Illuminating the Druggable Genome, which sheds light on unannotated proteins in commonly drug-targeted protein families. 

With direct applicability to informing mechanistic studies of human disease, the IMPC described in 2021 1,696 models of known human disease-gene associations. This knowledge is also made available to the clinical community via the Exomiser software, a critical component of the ISO-accredited diagnostic pipeline for the UK’s 100,000 Genomes Project. IMPC data also is directly integrated into the interpretation of results for National Health Service patients in the UK and in US-based sequencing programs such as the Undiagnosed Disease Programme and Network. By making the data as widely and freely available as possible, we are hoping to contribute to the understanding, diagnosis and treatment of rare diseases worldwide.

Tags: bioinformatics, data science, data service, data sharing, database, embl-ebi, open data, rare disease


Looking for past print editions of EMBLetc.? Browse our archive, going back 20 years.

EMBLetc. archive

Newsletter archive

Read past editions of our e-newsletter

For press

Contact the Press Office