Piecing together the best reference genome
Researchers reveal the best technology for assembling reference genomes
Researchers in the international Genome 10K (G10K) Consortium, including those at The Rockefeller University, the Wellcome Sanger Institute, the University of Cambridge, and EMBL’s European Bioinformatics Institute (EMBL-EBI), have established the most accurate way to assemble reference genomes to date.
When researchers create a reference genome, DNA from the organism of interest is first sequenced in short pieces. A big challenge when making a genome assembly is piecing these genome sequence fragments back together correctly. Once reassembled, you’re left with a reference genome. This can be used to answer fundamental questions about biology, disease, and biodiversity.
This research, published in the journal Nature, evaluates the most accurate method to date for assembling reference genomes. The study uses examples from 16 vertebrate species to show that modern long-read sequencing methods are crucial for maximising genome quality. Using these new methods, the researchers have also been able to correct substantial errors within some well-established reference genomes.
Access new genomes
The Vertebrate Genomes Project (VGP), part of G10K, is an international effort to generate reference genomes for all 70,000 vertebrate species. As part of the project, EMBL-EBI’s Ensembl team are helping to generate these new genome annotations, and will make the data freely available to scientists through the Ensembl VGP page and the Ensembl genome browser.
“We want to get these new high-quality genome annotations out there for the entire scientific community to use,” says Paul Flicek, Associate Director and Head of Genes, Genomes and Variation Services at EMBL-EBI. “These genomes are a starting point for many new discoveries. They can be used to answer fundamental questions in biology and disease, to identify species at risk of extinction, and will ultimately help preserve genetic information about life on Earth.”
Accurate assemblies require long reads
“When I was asked to take on leadership of G10K in 2015, I emphasised the need to bring on more partners and to work on approaches that produce the highest-quality data possible. It was taking months per gene for my students and postdocs to correct gene structure in assembled genome sequences for their scientific experiments, and was causing errors in our biological studies,” says Erich Jarvis, lead of the VGP sequencing hub at The Rockefeller University, Chair of G10K, and a Howard Hughes Medical Institute Investigator. “For me this was not only a practical mission, but a moral imperative.”
Making a genome assembly is much like solving a puzzle: lots of DNA sequences are pieced together to create a reference genome. Also like a puzzle, the larger – or in this case longer – the pieces are, the easier it is to solve. This study confirms that long-read sequencing technologies are essential for creating the most accurate reference genomes possible.
“Previous methods of making a genome assembly involved chopping the genome into tiny bits, sequencing these, and then trying to reassemble the pieces. The problem is that these very small bits are often difficult to accurately reconstruct, partly because genomes are very repetitive,” says Fergal Martin, Vertebrate Annotation Coordinator at EMBL-EBI. “Using cutting edge genomics, in particular advances in long-read sequencing and genome assembly workflows, we can get past many of these problems and create a reference genome more true to the biological reality.”
The VGP is one of many large-scale projects to produce reference genome assemblies meeting a specific minimum quality standard. Others include the Human Pangenome Project and the Darwin Tree of Life.
This post was originally published on EMBL-EBI News