Efficient sequence alignment against millions of prokaryotic genomes with LexicMap
Nature Biotechnology 10 September 2025
10.1038/s41587-025-02812-8
By making the world’s microbial DNA easier to explore, LexicMap helps researchers track outbreaks, study antibiotic resistance, and understand microbial diversity
 
      A new sequence alignment tool, LexicMap, lets scientists search for a DNA sequence against millions of bacterial and archaeal genomes in minutes.
Open-access databases such as the European Nucleotide Archive (ENA) contain over 2.4 million bacterial genomes, and this number continues to grow rapidly. Until now, searching these vast resources has been slow and computationally demanding, limiting scientists’ ability to track antibiotic resistance, study outbreaks, or explore microbial diversity.
A new paper, published in the journal Nature Biotechnology, introduces a new algorithm called LexicMap. By using an innovative method to index genetic data, LexicMap enables researchers to quickly search for DNA sequences or mutations across the world’s growing DNA databases. This opens up new opportunities in epidemiology, ecology, and evolutionary biology.
“Evolution gradually changes genes through mutation, so biologists often want to scan through all the world’s DNA data to look for matches and how they differ through mutations,” said Zamin Iqbal, Professor of Algorithmic and Microbial Genomics at the University of Bath and visiting Group Leader at EMBL-EBI. “As the data explosion has outstripped our algorithms, we have had to live with search engines that search a fraction of our data.”
Over the last decade, the team behind LexicMap have been developing high-quality data resources for the use of the research community and, in parallel, developing improved search algorithms for microbial DNA. They also work as part of a global consortium – AllTheBacteria – to assemble and annotate all 2.4 million bacterial and archaeal genomes in the ENA. LexicMap is the first alignment algorithm which can search all these data rapidly, and with a low computational burden.
“Google search is a routine part of modern life, and we cannot imagine dealing with the internet without it,” said Wei Shen, Associate Professor at Chongqing Medical University and former visiting scientist at EMBL-EBI. “Alignment to a DNA database is the biology equivalent of Google search, and LexicMap now makes that scalable to the full volume of global bacterial data. If you have found a new drug resistance gene, you might want to know how prevalent it is amongst bacteria, and now you can search through the world’s data for it in just a few minutes.”
By making microbial genomes easier to search, LexicMap opens up new possibilities for research and public health.
“Having the ability to search all publicly available bacterial genomes in minutes changes what’s possible,” said John Lees, Group Leader at EMBL-EBI. “If you’re developing a new antibiotic and discover a resistance mutation, you need to know how common it is in the real world. Now, for the first time, you can search over 2 million genomes – the entire global collection – in minutes to find out.”
The LexicMap tool has already been integrated into the AllTheBacteria project, which curates and indexes high-quality assemblies of all known bacterial genomes. This gives researchers an easy way to explore one of the largest collections of microbial DNA ever assembled.
During his time at EMBL-EBI, Wei Shen, the lead author on this study, received support through the EMBL Sabbatical Visitor Fellowships. These fellowships offer researchers the opportunity to spend time at EMBL, collaborate with experts, and work on projects that benefit from EMBL’s world-class facilities and resources. They are designed to foster international collaboration, drive scientific innovation, and support researchers in advancing their work.
This study was supported by grants from the National Natural Science Foundation of China (82341112), Chinese Scholarship Council scholarship (202308500105 to W.S.), EMBL Visitor/Sabbatical Programme fellowship, Remarkable Innovation-Clinical Research Project, Joint Project of Pinnacle Disciplinary Group (to W.S.), and Kuanren Talents Program of The Second Affiliated Hospital of Chongqing Medical University.
Nature Biotechnology 10 September 2025
10.1038/s41587-025-02812-8