A new annotation initiative, published in the journal Nature Biotechnology, presents a roadmap for integrating some of these previously unexplored gene segments into human genome databases.

As part of the initiative, researchers from 20 institutions worldwide aim to add more than 7,200 unrecognised translated regions within the genome, some of which may code for ‘missing’ proteins, to the major human genome databases.

Unexplored regions of the human genome

Thousands of small translated open reading frames (ORFs) – short spans of DNA between a start and stop codon – have been reported in the human genome over the past few years using ribosome profiling, or Ribo-seq. This is an experimental technique that determines which part of the messenger RNA (mRNA) ribosomes interact with, therefore predicting translated regions. These data could be of vital importance to ongoing efforts to decipher the human genome sequence, as the annotation of translated sequences remains difficult.

Scientists have traditionally gained confidence in the identification of protein-coding regions in genes by comparing DNA sequences from multiple species. This usually works well because the most important coding regions have been preserved during animal evolution for millions of years. However, this method has a drawback: relatively young coding regions – e.g. that arose during the evolution of primates – fall through the cracks and are therefore missing from the databases.

To further complicate matters, it is now known that translation does not necessarily lead to the production of stable protein molecules, and that this process can also mediate cell physiology through alternative modes of function such as gene regulation.

An ORF roadmap

For this project, a collective of scientists started by gathering published information on translated sequences that had been discovered using Ribo-seq. They then assembled these data into a standardised catalogue, mapped to reference gene annotation produced together by Ensembl and GENCODE. This was no small feat and required data obtained in a wide variety of ways from different laboratories to be combined in a uniform format.

This project was co-led by Jorge Ruiz Orera from the Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Sebastiaan van Heesch from Princess Máxima Center for Pediatric Oncology, Jonathan Mudge from EMBL’s European Bioinformatics Institute (EMBL-EBI), and John Prensner from the Broad Institute of MIT and Harvard.

“It is especially remarkable that most of these 7,200 ORFs are exclusive to primates and might represent evolutionary innovations unique to our species,” said Ruiz-Orera. “This shows how these elements can provide important hints of what makes us human.”

“It’s tremendously exciting to enable the research community in this way,” said van Heesch. “At this time, we really can’t say that all of these things really are human proteins, but we can say that something unexplored is happening across the human genome and that the world should be paying attention.”

Revising human genome databases

This international initiative intends to carry on the work from this project and revise the human genome databases used by scientists worldwide. Ensembl and GENCODE are the first to configure this ORF catalogue as a component of their reference gene annotation databases. This work is also supported by the protein annotation databases produced by UniProt Knowledgebase, the Human Proteome Project (HPP), and the HUGO Gene Nomenclature Committee (HGNC).

“For too long, the scientific community has been mostly left in the dark about these ORFs,” said Mudge. “We’re very proud that our work will enable researchers across the world to study them. This is the point at which they enter the mainstream of genomic and medical science – an effort which we expect to have wide-ranging ripple effects.”

“These ORFs almost certainly will be contributing factors to many human traits and diseases, both rare diseases and common ones such as cancer,” said Presner. “The challenge is now to figure out which ones have which roles in which diseases.”

Funding

AF, JMM, PF are supported by the Wellcome Trust [Grant number 108749/Z/15/Z], the National Human Genome Research Institute of the National Institutes of Health under award number 2U41HG007234 and the European Molecular Biology Laboratory.

Edit