Darwin Tree of Life at EMBL-EBI: reaching the first major milestone
Our researchers highlight their contributions to the Darwin Tree of Life project and how new genome annotations helps to further biodiversity research
The Darwin Tree of Life (DToL) project is a collaborative effort to sequence, assemble, and annotate the genomes of all 70,000 eukaryotic species found in the UK and Ireland. Researchers at EMBL’s European Bioinformatics Institute (EMBL-EBI) are supporting the project by storing and annotating the genomes sequenced, and making this data openly available through the DToL Data Portal.
Now that the EMBL-EBI DToL researchers have hit a first big milestone in the project – putting together genome annotations for 100 new species – we take a look at the challenges they have faced so far and their future plans within the project.
The DToL Data Portal
The DToL Data Portal serves both the scientific community and the public by showcasing the huge range of data generated by the project. It pulls together the sampling carried out by the different DToL project partners across the UK, all of the genome assemblies produced at the Wellcome Sanger Institute, and the annotation work conducted by the Ensembl team at EMBL-EBI.
The Data Portal itself was developed by Alexey Sokolov’s team at EMBL-EBI. This portal hosts all of the data from the DToL project and gives users open access to genome assemblies and annotations. The portal also has a tracking feature where users can follow the sequencing progress of their species of interest and contains a phylogeny browser where users can navigate the available tree of life for the species present in the portal.
“We’re constantly improving the Data Portal so the scientific community can get the most out of the Darwin Tree of Life data,” said Alexey Sokolov, Project Lead at EMBL-EBI. “The Portal currently allows users to track the status of their species of interest and we are working to make this process more detailed. Hopefully, in the future, we will include a sign-up for notifications about the status of particular species.”
The European Nucleotide Archive (ENA) team plays a vital role in making the DToL data open access and freely-available to the scientific community. They are also working to ensure the long-term storage of the DToL data in a standardised way. To do this, the ENA is working to improve metadata standards for biodiversity data such as those generated through the DToL project. This involves mandatory spatio-temporal metadata for all new data submissions which also enriches the scientific value of the data.
“The ENA is adapting its data submission process to meet the needs of researchers working on global biodiversity projects such as Darwin Tree of Life,” said Josie Burgin, Bioinformatics Project Manager at EMBL-EBI. “Making these data open, findable, and reusable relies on having rich and relevant metadata.”
Genome annotations for biodiversity research
Rapid access to the genome annotations produced from the DToL project will have a huge impact on global biodiversity research by opening new doors for scientists in this field. Lepidoptera – butterflies and moths – and Hymenoptera – bees, wasps, and ants – are some of the first DToL genome annotations to be completed by the Ensembl team. Furthering our understanding of Hymenoptera genomes could help in the fight to prevent the devastating global decline of wild bee species.
“The annotation work that Ensembl does has scaled up massively to keep up with the data generated from the DToL project,” said Peter Harrison, Genome Analysis Team Leader at EMBL-EBI. “Our first big push was to get the genome annotations for Lepidoptera and Hymenoptera out to help researchers with their global conservation efforts. These were also all annotated in a matter of days which is an incredible turnaround compared to what we were able to do previously.”
The next big DToL challenge faced by the Ensembl team will be the arrival of several new plant species needing genome annotations. Plant genomes are often very different from animal genomes; their introns and genes are usually much smaller on average. This creates problems for Ensembl’s existing pipelines with optimised settings for an expected gene size. Some plant genomes are also gigantic – up to 40 times bigger than the human genome – making them tricky to work with and needing much more data storage.
“Things start to move more quickly once we have pipelines set up to run a particular group of species. It’s initially a very involved process and somebody has to test and check every step to make sure everything looks consistent,” said Fergal Martin, Eukaryotic Annotation Team Leader at EMBL-EBI. “Now we are at a stage where our pipelines will produce good genome annotations in an extremely short timeframe for the species we have cracked. For example Lepidoptera and Hymenoptera; we’ve put together a lot of genome annotations for these species and so it’s much more straightforward to create new genome annotations when more bees or butterflies start to come our way.”
About the Darwin Tree of Life Project
The Darwin Tree of Life Project is an ambitious programme to sequence, assemble, and openly publish the genomes of over 70,000 species of animals, plants, fungi, and protists in Britain and Ireland. The Project contributes to the global mission to sequence all life – the Earth Biogenome Project. The genomic data generated will revolutionise bioscience forever, facilitating research into evolution and biology, conservation of biodiversity, and the development of new biomaterials and pharmaceuticals.
The Darwin Tree of Life Project is being undertaken by a consortium of ten Partners: the Earlham Institute, EMBL’s European Bioinformatics Institute (EMBL-EBI), Marine Biological Association, Natural History Museum, Royal Botanical Garden Edinburgh, Royal Botanical Gardens Kew, University of Cambridge, University of Oxford, University of Edinburgh and the Wellcome Sanger Institute.