Ensembl has incorporated a vast amount of knowledge into a fully annotated reference human genome, GRCh38. Their work builds on the release of a new assembly by the Genome Research Consortium, and provides a solid foundation for future genomics research.
Every day, new discoveries in genetics are added to the reference genomes that underpin research into disease and evolution. To provide the most accurate reference for human genomics, the Genome Reference Consortium (GRC) creates a fresh new assembly every few years that collects their ongoing, careful efforts to fix errors and add missing sequence into a major release. Once the new reference assembly is available, Ensembl bioinformaticians add the latest genomic data and analyses to provide a comprehensive picture for the community that annotates essentially every understandable feature in the genome.
“When the reference human genome is used in clinical or diagnostic settings, the accuracy of the assembly will be paramount,” says Paul Flicek, Head of Vertebrate Genomics at EMBL-EBI. “All across Europe and in other parts of the world, genomics is moving into the clinic, including in the UK, which has recently committed to spending over £100 million. These efforts need a strong foundation that is as accurate as possible: the reference genome assembly, and the annotation that makes it useful.”
It is no small matter for genomics researchers to shift their reference from one genome assembly to another: the sequence alone is around 3 billion base pairs long (the contiguous length is 3.4 Gb), and there is a vast amount of data describing each region. The new assembly in Ensembl marks the first step in a shift for research consortia such as Blueprint and the 1000 Genomes Project, which will have easier access to the most up-to-date genomic information.
Updates to the human genome are driven by new discoveries that improve our understanding of difficult regions (for example, ones that have big gaps), and those that are clinically relevant, such as paralogs of SRGAP2, a gene important in cortex development.
To update information on the new reference human genome, the Ensembl team started with nearly half a million proteins and over 200,000 cDNAs aligned to the new assembly, then filtered it to produce a much smaller set that could be used to predict genes. Ensembl’s evidence-based computational annotation was combined with manual gene annotation from the Wellcome Trust Sanger Institute’s HAVANA team to produce the GENCODE version 20 gene set with thousands of genes – and that was just one part of the work. The new Ensembl release incorporates a host of new data used to identify gene-regulation sequences, updated gene expression and variation data, and new data types such as models of centromere sequences.
In addition to making the resource more useful for clinical research, the Ensembl team has updated all whole-genome alignments with other species, so researchers can gain a better understanding of how different features evolved. In addition, they used the new human gene set to generate orthologs to all other available species in Ensembl to make the exploration of gene relationships across species more accurate.
Ensembl now offers several improved software tools. The new BLAST interface makes it easy to track genome alignment jobs that are running concurrently, and the ever-useful Variant Effect Predictor (VEP) determines effects of variants on genes, transcripts, proteins, regulatory regions and phenotypes.
“We’ve taken the latest human genome sequence and run it throughout our analyses to provide the highest possible quality data,” says Ensembl’s Bronwen Aken. “We’ve aimed for the most correct gene set that relies on the underlying assembly from the GRC, and are proud to have delivered a resource that will substantially improve genomics research in the coming years.”