Ensembl 110 and Ensembl Genomes 57 have introduced in-house prokaryotic gene annotation across genomes available in Ensembl Bacteria

Since its inception, Ensembl Bacteria has imported user-submitted annotations from the International Nucleotide Sequence Database Collaboration (INSDC) for prokaryotic genomes. Any in-house gene annotation produced by Ensembl remained largely within the scope of vertebrate genomes, with the odd metazoan or microbial genome annotated as a test. Recently, however, it has been possible for Ensembl Bacteria to establish a robust and scalable pipeline to produce consistent annotation for prokaryotes. The Ensembl team believes that a consistent set of annotations will go a long way in enabling better comparisons between genomes and also in the computation of pangenomes. Furthermore, against a backdrop of increasing volumes of prokaryotic assemblies being submitted to INSDC without accompanying gene annotation, a robust approach will allow Ensembl Bacteria to make a meaningful contribution to this space.

A common annotation pipeline has been developed to annotate both isolate genomes and metagenomic assembled genomes (commonly referred to as MAGs) in bacteria through a collaboration between the Microbiome Informatics and Ensembl microbial groups at EMBL-EBI. This pipeline comprises Prokka for gene calling, followed by cmscan, InterProScan and EggNOG tools to bolster the functional annotation, and Codetta to screen for alternative genetic codes in a scalable manner.  This approach will also facilitate the extension of the annotation framework to annotate features such as operons, pathways and biosynthetic gene clusters in future releases.

The team has deployed this annotation framework on all 31,332 genomes hosted in Ensembl Bacteria, and is doing a phased transition to the new annotation. In release 110, they have replaced the annotations of all but 115 genomes. These 115 are key species whose annotation has been used in pan-taxonomic comparative analysis in Ensembl for a while; and will remain unchanged for the next few releases.

The team has also taken this opportunity to implement a systematic naming scheme for the genes in Ensembl Bacteria based on rules provided by the Global Alliance for Genomics & Health (GA4GH). They used the following five facts about each gene and encoded it using the SHA-512 algorithm. They then used the first 15 characters of this checksum prepended with “ENSB:” as the gene identifier. The greatest benefit of using such a system is the ability to identify identical genes unambiguously and refer to them with the same identifier even when alternative gene prediction tools are used. 

  1. NCBI taxon identifier of species
  2. GA4GH checksum (sha512t24u) of the dna sequence the CDS is on (truncated to 24 characters)
  3. Start of CDS
  4. End of CDS
  5. Strand


Example:

{1500254}:{ga4gh.SQ:crH5Li56HcBu–dIy9nRCP-VELBcbjJ2}:{3254255}:{3257189}:{-}

Edit