Ten years after the end of the 1000 Genomes Project, brand new insights emerge from its sample set, providing a more complete view of human genetic variation than ever before
New analysis of the 1000 Genomes sample set yields brand new insights, providing a more complete view of human genetic variation than ever before. Credit: Daniela Velasco/EMBL
Summary
The 1000 Genomes Project (2007–2015) collected DNA samples from diverse human populations across five continents to analyse genetic variation from humans across the globe.
Using advanced sequencing technologies, scientists have now mapped genomic variation in over 1,000 individuals from the project, offering new insights into human biology.
In a complementary study, researchers assembled nearly complete genome sequences for 65 individuals, enabling detailed analyses of complex regions such as centromeres.
These new datasets represent one of the most comprehensive overviews of the human genome to date and will enhance our understanding of genetic diversity across populations.
Structural variations mapped through these datasets are not only a common class of genetic variants but also play a major role in many diseases, including cancer, providing a reference to allow understanding of what goes wrong under disease conditions in future clinical studies.
Completed in 2003, the Human Genome Project gave us the first sequence of the human genome, albeit based on DNA from a small handful of people. Building upon its success, the 1000 Genomes Project was conceived in 2007. The project began with the ambitious aim of sequencing 1,000 human genomes and exceeded it, publishing results gleaned from over 2,500 individuals of varying ancestries in 2015. Together, these projects have contributed to much of our knowledge about the genetics that make us unique and underlie our biology.
Now, 10 years down the road, EMBL scientists and their collaborators have revealed exciting new insights into human biology through deeper analysis of samples from this vast resource, using methods and technologies not available a decade ago. The resulting datasets, shared in two back-to-back publications in the journal Nature, constitute what may be the most complete overview of the human genome to date.
“About 15 years ago, most human genome sequencing relied on ‘reads’ from small stretches of DNA – not enough to piece together a full genome, but sufficient to allow studies of genetic variation in larger parts of the genome,” said Jan Korbel, Group Leader and Interim Head at EMBL Heidelberg, and co-senior author of the new studies. “However, since about five years ago, it has become possible to routinely sequence human genomes with new commercially available technologies that can decode much longer stretches of DNA, allowing us to assemble the full genome of individuals and assess all parts of the genome for genetic variation.”
These technologies are collectively known as long-read sequencing methods, and EMBL scientists have used them to improve our understanding of cancer development and for environmental research. “We wanted to take advantage of the power of these new transformative sequencing techniques to learn more about human genetic variation,” said Korbel.
Genetic variations – differences in DNA sequence between individuals – help make each of us unique and play an important role in health and disease. While such variations can take the form of small differences, e.g. in one or a few ‘letters’ of the genetic code, they can also be much more profound, with entire long stretches of DNA being deleted, inverted, repeated, or added in certain individuals.
It is now known that such ‘structural’ variations are not only common but also play a major role in many genetic diseases, including cancer. ‘Maps’ of such variation across the human genome are also highly relevant clinically, as they serve as a reference to understand what goes wrong under disease conditions.
The two new studies use long-read sequencing technologies to dive deeper into such structural variations across the genome. For both studies, the Korbel Group teamed up with the lab of Tobias Marschall at Heinrich Heine University Düsseldorf, Germany, which is composed of experts in genome data science.
Enhancing the human pangenome
The first study looks at 1,019 genomes from the 1000 Genomes Project dataset, spread across 26 populations from five continents. Using long-read sequencing methods and teaming up with Siegfried Schloissnig from the Institute of Molecular Pathology (IMP) Vienna, Austria, the researchers created detailed maps of structural variations across the genomes of these individuals. In addition to generating new biological knowledge, with this new information, they could expand by more than twentyfold the 44-genome reference graph published by the Human Pangenome Reference Project in 2023.
For this study, the researchers also collaborated with Ewan Birney’s team and Sarah Hunt at EMBL-EBI, as well as Bernardo Rodríguez Martín from the Centre for Genomic Regulation (CRG), Spain, among others.
“The original 1000 Genomes Project created a map of genome locations that are variable in the human population, and this enabled us to systematically search for regions associated with common diseases,” said Hunt. “That first map was built from short variants, but we already know of cases where longer variants are associated with disease. The new map from this study is more precise and deeper than other structural variant maps created so far and will enable us to seek new disease links.”
The second study uses a much smaller sample set of only 65 individuals but combined several powerful sequencing methods to put together genomes that are more complete than any ever sequenced before. For several chromosomes, the researchers assembled end-to-end sequences, a remarkable feat considering that human chromosomes can be hundreds of millions of base pairs (i.e. ‘letters’) long. This study was carried out in collaboration with researchers from several leading US institutes, who together formed part of the Human Genome Structural Variation Consortium.
“The Human Genome Structural Variation Consortium brings together people who are experts in different techniques and genomic areas and shows the power of international collaboration to drive discovery,” said Hunt. “This work reveals new biological insights by shining a light on parts of the genome we could not previously see and has created a toolkit for the analysis of further genomes.”
Korbel believes the studies strongly complement each other. “One study uses less sequencing power, but a much larger cohort. The other uses a smaller cohort, but much more sequencing power per sample. This led to complementary conclusions,” he said.
Such complete datasets have tremendous clinical relevance, since they serve as references against which genetic variations in disease can be identified and checked. In an additional experiment, the researchers showed that using the larger dataset of 1,019 genomes as a reference significantly improved the accuracy of identifying disease-associated variants compared to previous methods.
The datasets also yielded interesting new biological insights. For example, the study with 1,019 samples helped elucidate a new mechanism by which transposons – sometimes called ‘jumping genes’ – can help move stretches of DNA to new locations within the genome, giving rise to new variants. The 65-genome dataset, on the other hand, helped scientists understand certain sections of the genome that are very difficult to study using traditional methods, such as centromeres. Centromeres are the spots where two strands of the chromosome attach to each other when cells divide (forming the well-known X-shape), and disruptions in them have been linked to many disorders, including immune disorders and cancer.
“These two studies underscore the crucial role of repetitive DNA in shaping the human genome, uncovering a reservoir of genetic variation within regions that were largely missed in previous reference datasets due to their repetitive and complex nature,” said Bernardo Rodríguez-Martín, former member of the Korbel group, now Group Leader at the CRG and co-senior author of one of the studies.
A new resource for genome biologists
The new datasets have been made publicly available to researchers worldwide to analyse and use. The studies also forced innovation in the form of new genomic analysis methods, which the scientists created to analyse data at a scale much greater than previous studies had attempted.
“Through these studies, we have created a comprehensive and medically-relevant resource that can now be used by researchers everywhere to better understand the origins of human genomic variation, and see how it is affected by a plethora of different factors,” said Tobias Marschall, Professor at Heinrich Heine University Düsseldorf and co-senior author on the two studies. “This is a great example of collaborative research opening up new vistas in genomic science and a step towards a more complete human pangenome.”