Most genomes are only as useful as their annotation – a process that identifies the genes present in a genome and what roles they fulfil. For researchers studying bacteria, genome annotation provides the context required to understand how bacteria adapt to become resistant to treatment, be more virulent, or spread faster. Genome annotation also allows us to understand how bacteria evolve and quickly adapt in order to survive. 

But to truly unlock such insights, researchers need to annotate the genomes of an entire bacterial population, not just single individuals. The standard way of annotating genomes is to do it one genome at a time. This worked well in the early days of genome sequencing, but as hundreds of thousands of genomes become available, new problems arise. These include redundancy – because the same things get annotated over and over again – inconsistency, and the need to scale up the performance of software tools. 

Because existing tools predict genes in genomes individually, they cannot leverage the wealth of genomic data available for each species by producing a pangenome – the entire set of genes present within a species, and this is essential for understanding genomic variation at the species level. 

From tens to thousands of genomes

The ggCaller software, jointly developed by the Lees group at EMBL-EBI and Nicholas Croucher’s group at Imperial College London, is the first to accurately annotate thousands of bacterial genomes all at once. To do so, it uses a deep learning model and genome graphs – a way of succinctly representing many thousands of genomes at once. ggCaller can also cluster genes from large numbers of bacterial genomes into a pangenome.

Explore ggCaller

“The pangenome consists of so-called core genes and accessory genes,” explained John Lees, Research Group Leader at EMBL-EBI. “Core genes are present in all individuals in a species and are usually essential for life. Accessory genes only appear in some members of the population and historically have been harder to analyse, because they’re harder to accurately cluster. However, we know that accessory genes are an important way in which bacteria can adapt to vaccines and antibiotic treatments, so we believe they are important to study. ggCaller brings us one step closer to accurate reconstructions of bacterial pangenomes, which are important for tracking and understanding dangerous pathogens.”

Understanding pneumococcal disease 

Streptococcus pneumoniae is a bacterium that causes pneumonia and meningitis. It resides asymptomatically in the upper respiratory tract of healthy carriers, but can wreak havoc in people with weak immune systems, including children and the elderly. The S. pneumoniae genome contains on average just over 2 million DNA base pairs, a core set of 1,500 genes, and a further 6,500 accessory genes. Because S. pneumoniae can ‘grab’ DNA from its environment and add it to its chromosome – a process known as horizontal gene transfer – its genetic material varies greatly between the roughly 1,000 known strains. Some of these genes cause resistance to antibiotics. 

To test their new ggCaller tool, the Lees group functionally annotated significant associations to macrolide resistance using a collection of S. pneumoniae DNA sequences. They identified key resistance determinants that were missed by annotation methods that only use a single reference genome. This showed that ggCaller can leverage genetic context to find actual genes relevant to antimicrobial resistance.

Another tool in the box

“We wanted to create a tool that other researchers could use to analyse their data, and that was quick enough to match existing gene prediction tools,” said Sam Horsfield, PhD student at Imperial College London and EMBL-EBI. “We hope that ggCaller will be a useful addition to the genome annotation toolbox, allowing researchers to analyse a larger number of genomes in one go.”

“I am currently working on a model for bacterial population genetics. As an input to my model, I needed a reliable pangenome annotation. I used ggCaller to make this analysis. The tool was fast and straightforward to run, providing me with high-quality results to test my model,” said Leonie Lorenz, Predoctoral Fellow in the Lees group, who was not involved in the development of ggCaller. 

ggCaller will be used to enable future projects running as part of EMBL’s infection biology transversal theme, which aims to characterise pathogen interactions with the host to tackle infection and antimicrobial resistance. This is part of EMBL’s scientific programme ‘Molecules to Ecosystems’. 

Funding

This work was possible with support from the MRC Centre for Global Infectious Disease Analysis (Studentship Grant Ref: MR/S502388/1), jointly funded by the UK Medical Research Council and the UK Foreign, Commonwealth & Development Office, and is also part of the EDCTP2 programme supported by the European Union. Funding also came from the UK Medical Research Council and Department for International Development (grants MR/R015600/1 and MR/T016434/1), and supported by a Sir Henry Dale fellowship jointly funded by Wellcome and the Royal Society (grant 104169/Z/14/A). The work was also supported by the European Molecular Biology Laboratory. 

Edit