11 June 2018

Tags: core content, digital, guidelines, information architecture, taxonomy

On taxonomy

Indulge me, if you will, with a lengthy zoological introduction.

In the 1700s, the Swedish zoologist Carl Linnaeus devised a system for classifying all the organisms of the Earth. Each species received a unique name comprising two parts, following Latin grammatical construction.

The mice in the EMBL labs? Mus musculus (Latin for “muscular mouse”). The frogs? Xenopus laevis (meaning “strange foot not-heavy”). My personal favourite at EMBL: Ambystoma mexicanum (“Mexican blunt mouth”), the axolotl.

The two names form the smallest grouping of biological classification – the genus and the species – which fit in the larger groupings of Kingdom, Phylum, Class, Order, Family, Genus, Species. (For example, a dog is classified as Animalia, Chordata, Mammalia, Carnivora, Canidae, Canis lupus. I remember my school biology teacher, the indefatigable Miss Walsh, teaching us the mnemonic Kids Prefer Cheese Over Fresh Green Spinach before exams.)

The names can be based on words or names that are not Latin – as long as they follow Latin grammatical constructions – and zoologists often have fun breaking these conventions. (Personal favourites include species of Carabid beetle in the genus Agra: that would be Agra cadabra, Agra phobia, and Agra vation).

*Agra grace*, a species of Carabid beetle. It shares a genus with Agra memnon, presumably the king of all beetles (Image: Wikimedia commons)

In biology, taxonomy uses unique names to build a coherent structure that reflects evolutionary relationships. Names matter.

What is a web taxonomy?

A web taxonomy is the grouping of categories, tags and keywords that describe content on a website. It can include types of content (news, event, podcast), content topics (cancer, genetics, biochemistry), and more specific tags (Krebs cycle, kinase, glial cell).

A taxonomy helps to position content in the correct place, informing site structure, navigation and the structure of URLs. Vocabularies of more descriptive tags can keep a structure more flexible, so that lots of types of content, tagged correctly, can easily be found and displayed together.

Taxonomy is the glue that holds diverse types of content together.

Why does EMBL need a web taxonomy?

At EMBL, we have started a number of projects to help us build better websites. We are improving embl.org and the intranet, and will have numerous inter-related microsites down the line. A robust web taxonomy will play a key role in these projects, for the following reasons:

To help people to find things

Content that is correctly and consistently tagged with a universally-used taxonomy will appear in useful, relevant places on EMBL web domains. The tagging helps to design and refine search algorithms, so the most relevant content will appear when a user searches for a keyword.

To structure content

A taxonomy can help us to find gaps in our content (“Hey! We don’t have a page about the axolotls!”) and fit new articles into an existing structure. It makes it easier to build effective landing pages, or topic pages, by automatically pulling related content from different sources and displaying it in one place.

To structure navigation

One option is to build a website’s structure using its taxonomy as the foundation. The content-management system Drupal, for example, collects categories, tags, or metadata into “vocabularies” that allow the site builder to connect, relate and classify a website’s content.

To personalise or customise web experiences

To personalise an experience, you can configure a Content Management System only to deliver content relevant to a particular user, location, time zone, etc. Alternatively, you can give the user the option to personalise their experience by directly choosing to display only the topics or tags they are most interested in.

The EMBL taxonomy project

I’ve made a start on building a better taxonomy for EMBL. Though this project will start on the web, the intention is to use taxonomy more broadly to describe all of our content across various media.

Using the web crawler tool Screaming Frog (watch a demo here) I collected the top 1000 or so URLs on embl.de and news.embl.de, and split them into their component parts.

As URLs are often built either directly from the titles of pages, or by an editor manually writing the URL structure, I reasoned that words appearing in a page’s URL could be used as a proxy to describe the content of that page.

The idea is to try to categorise these URL fragments in a way that makes sense for EMBL. The exercise should give us insight into how best to tag our content in future.

A taxonomy workshop

At a recent workshop in Heidelberg, we in the Stratcomm team familiarised ourselves with EMBL’s existing taxonomies. EBI has the EDAM ontology, for example, which “provides a set of concepts with preferred terms and synonyms, definitions, and some additional information – organised into a simple and intuitive hierarchy for convenient use.”

The ontology – published online as a dynamic, evolving reference – is a consistent way to describe data, which helps to structure databases and to find data quickly. It’s a useful tool.

Another taxonomy used at EMBL can be found in the back of the Research in brief 2017 document. The matrix on page 34 groups activities at EMBL by research unit and biological theme. As overlapping themes can span different research groups, it has various redundancies built in. And that’s okay.

A practical advantage of this taxonomy is that is has already been signed off by EMBL group leaders, and so we can incorporate its lessons into any new structure.

A card-sorting exercise

Card-sorting is fun! “WTF?” is my favourite category (Image: Cian O’Luanaigh)

At the workshop, we split into teams to sort several hundred URL fragments from the news site and intranet into categories. Each team came up with their own category names, further splitting large categories into sub-categories.

I wanted to see if different teams would come up with similar category headings. How many ways are there to sort the information? As it turned out, the teams agreed on many categories and features, not least the “Kill list” – the fact that not all URL fragments make good tags.

More from the card-sorting exercise (Image: Cian O’Luanaigh)

Below are some examples of categories that came out of the card sort. Note this is not a final structure for our new taxonomy – just a starting point to further refine.

Examples of possible categories for EMBL content

Scientific services

Methods and Techniques

Scientific disciplines

Themes

Genetics
Metabolomics
Biological processes
Biological entities

Events

Internal
External

Learning and Development

Training
Library
GitLab and tools for computing

People

Staff
Students
External to EMBL
Press

Policy

Locations

Sites
Facilities
Laboratories

Departments

Units

IT / computing

Web
IT Tools

Publications

Scientific publications
General interest publications

Human Resources

Services

Outreach

Industry

Funding

[Kill list] – words that do not work as tags or are too specific to be useful in sorting

Several features were common to all three groups:

People – Internal vs external roles

Locations – Sites and Facilities should be separate and reflected in the taxonomy

Science – Methods, Concepts, Processes

Kill list – Words that do not work as tags

Next steps

Content audit

The first task is to do a full audit of EMBL’s websites and create a list of every webpage and what is on it. For each piece of content we have to decide whether to keep it as it is, improve it, or delete it. This will give us a good idea of what we have and what we are missing.

Deploy the taxonomy

We will have to tag all content in our new systems using the new taxonomy. The first step is Bynder, where we will use the new taxonomy to organize images. Then, we will use the same taxonomy for each microsite we create, and eventually apply the new taxonomy to embl.org and new iterations of the intranet.

Curate the tags

I used to work at a magazine, where journalists would tag their articles with relevant keywords. The style and number of keywords used was a matter of personal taste: some would add one or two general categories (“Physics”, “Maths” etc) and others would go to town on detail (“endoplasmic reticulum, ATP kinase” etc). A colleague was tasked with cleaning up the tags list, and, to his horror, found over 1500 different keywords with no clear structure. (Serious offences include “Steven Hawking”, “Steven Hawkins”, “Steve Hawkings” – you’ll know the correct spelling is Stephen Hawking).

Even with a clear taxonomy and structure, it must be curated and looked after, in order to stay useful.

In the coming weeks I’ll be building a website to document our preferred names and synonyms for a taxonomy for EMBL. Stay tuned!

EMBL Communications