Online Magazine of the European Molecular Biology
The story of Clustal: democratising sequence alignments
We caught up with 2023 Lennart Philipson Award winner Desmond Higgins for a chat about his time at EMBL, his research developing sequence alignment tools, and 20th-century bioinformatics.
In the late 1980s, whenever Des Higgins received a request from a fellow scientist to share Clustal – the groundbreaking software he had developed – he would send it out in a floppy disk via the postal service. Over a period of four years, he estimates having sent two or three hundred copies. Then, in 1992, Higgins put a new version of the program on the EMBL file server, set up by EMBL IT group leader Roy Omond.
“And overnight, there were 400 downloads,” said Higgins. “It was a wonderful feeling.”
A lifelong learner and problem solver, Higgins joined EMBL in 1990, well before the bioinformatics boom had taken off. An avid reader of science books and a collector of wild spiders as a child, he had earlier arrived at Trinity College, Dublin, to study biology and fallen in love with computation while pursuing a PhD in zoology.
“Bioinformatics in the 1980s was mainly done by researchers working on their own or by very small groups, and as a sideline rather than a main focus,” said Higgins. “Even the word ‘bioinformatics’ wasn’t much used up until 1985.”
However, that changed when large-scale sequencing became popular. As Higgins recalls: “Once people started sequencing genomes, you couldn’t make use of the data without bioinformatics. And so, between 1990 and the year 2000, bioinformatics went from being a minor field to being of fundamental importance.”
Two other developments aided the growing popularity of bioinformatics. The first was the Human Genome Project, launched in 1990 and (mostly) completed in 2003 when it was the first to sequence more than 90% of the human genome. And the second was the recognition by pharmaceutical companies in the 1990s of the commercial potential of mining early data from human genomes, as well as from other biological datasets.
At this crucial moment in the development of the field, EMBL was well-situated to take a leading role. The EMBL Nucleotide Sequence Database, the world’s first nucleotide sequence database, was established in the early 1980s at The Data library of EMBL Heidelberg, which later developed into the European Bioinformatics Institute (EMBL-EBI) in Hinxton. When Higgins joined the EMBL Data Library, it was being led by Graham Cameron (who developed the concept for EMBL-EBI and later became its associate director).
“EMBL was one of the main places in Europe that championed bioinformatics,” said Higgins. “It had one of the biggest collections of bioinformaticists in the world. We had very good computer facilities, and everyone had a computer on their desk connected to the EMBL mainframe computers.” According to Higgins, the scientists were also fully connected to the internet, something that set them apart from most other scientists on the planet at the time.
“It felt pioneering,” Higgins said. “It felt like we were doing something new and important and even if the rest of the world didn’t think so, they would soon realise it – because what we were doing was about to become essential.”
A crucial collaboration
One of the things Higgins appreciated most about EMBL was the academic freedom it afforded its researchers. In this environment, the problem that he turned his attention to was one he had already been working on before he came to EMBL – that of multiple sequence alignments.
Aligning or comparing short sequences of DNA, RNA, or protein sequences can give scientists a wealth of interesting biological information. One of the most important applications is in the field of phylogenetics – figuring out how organisms are related to each other in the evolutionary tree by comparing their genetic codes. Another application is in working out the function of an unknown protein by comparing its sequence to that of known proteins. As Higgins explains, “It is useful to be able to pile sequences on top of each other to look for which regions are conserved and which regions are variable.”
However, when researchers needed to make such alignments in the 1980s, there were no easily available methods for doing so, and scientists often ended up doing them manually using word processing software – a time-consuming and error-prone process. Towards the end of the decade, quite a few researchers, including Higgins, created and released programs to speed up or automate this process.
“But to use these, you needed a mainframe computer,” explained Higgins. “You had to work in an institute that had one and you had to know how to use it. These were big expensive boxes that required a whole computer lab to run them.”
While EMBL had good mainframe computers, most molecular biologists in the world didn’t have easy access to mainframes. And hence, what Higgins wanted most was to make multiple sequence alignment work on old IBM and Apple Macintosh PCs, which most scientists had on their desktops for writing manuscripts. The result was ‘Clustal’ – one of the first multiple sequence alignment programs that didn’t require mainframes to run.
“I figured out how to make multiple alignments work on these tiny little computers,” said Higgins. “It meant that now anyone could make their own multiple alignments in their offices.”
It was at this stage that he ran into Toby Gibson, Team leader at EMBL who was then a staff scientist in Patrick Argos’s research group. Gibson often had to do multiple sequence alignments for his work, and had been using the manual method up until then. “And I said I’ve got a program that can do this. Would you like to try it?” Higgins recalled.
While excited by Higgins’s program, Gibson was nevertheless sceptical about some aspects. And so Julie Thompson, then a programmer working with Gibson and now a senior scientist at the Institute of Genetics and Molecular and Cellular Biology (IGBMC) in Strasbourg, France, took up the task of modernising the Clustal package that Higgins had created, and making it more sensitive and accurate for protein alignments.
“EMBL was a wonderful place to collaborate,” said Higgins. “People were free to take on new collaborations quickly. Also, there were seminars and workshops happening constantly, so you got to meet new people all the time.”
The three scientists continued to meet over coffee or beer in Heidelberg, and the program was finally finished around 1994. The team described the updated software, which they called Clustal W, in a paper published in the journal Nucleic Acids Research. Higgins had moved to EMBL-EBI by then, and this was the very first paper published from that institute.
A revolution in multiple sequence alignments
The success of Clustal W exceeded all expectations. According to a 2014 analysis by Nature, the 1994 paper introducing Clustal W was then the most highly cited bioinformatics paper of all time, and the 10th most cited paper across all scientific fields. At its height, the program was used many thousands of times every day around the world, by everyone from undergraduate students to senior bioinformaticians. It enabled advances in fields as diverse as evolutionary biology, cancer research, and vaccine design.
Thompson later created a graphical user interface for Clustal, making the program even easier to use and accessible to more scientists worldwide. The scientists described this version of the program – Clustal X – in a 1997 paper, which the same Nature analysis found to be the 28th most cited paper across all fields, and the fourth most highly cited bioinformatics paper of all time.
Ease of use was one of the guiding principles for Higgins, Gibson, and Thompson. “When I first made the program, I wanted to ensure that you could use the program without having to read the manual,” said Higgins. “We wanted it to be simple enough that undergraduates could use it in practicals or other scientists could use it without having to be trained in bioinformatics.”
The other major ideal was accessibility. “We never charged for Clustal,” said Higgins. “The concept of open access did not exist in those days, but the software was effectively open access, because it was free to use and we gave away the source code.”
The end of an era
Leaving EMBL in 1996, Higgins moved to University College Cork in Ireland, where he taught biochemistry from 1997 to 2003. He was Professor of Bioinformatics at University College Dublin until his retirement last year. With his retirement, Clustal is no longer in active development, but its last released version – Clustal Omega – continues to be available to the world via EMBL-EBI.
“Omega is the last letter of the Greek alphabet,” said Higgins, adding that many new programs for multiple alignment have come up in recent years, including MAFFT, also hosted by EMBL-EBI. “Life goes on,” he added philosophically.
In recognition of his indisputable contributions to the field of bioinformatics research, Higgins was awarded the 2023 Lennart Philipson Award. The awards will be presented as part of the EMBL World Alumni Day celebration, which will take place at EMBL Heidelberg on 7 July 2023.
Celebrating 100 issues of EMBLetc.
The Lennart Philipson award was first introduced in 2017. Here is an excerpt from Issue 78 of EMBLetc. discussing the idea behind the award.
Veli Vural Uslu, winner of the 2023 John Kendrew award, chats about his journey in science and his adventures in science communication. Uslu is the writer, director, and organiser of various science-themed theatre plays, and the founder of TAP (The Awesome Potatoes) Science Theater Heidelberg.