Everyone’s genome is different and such variations can often drive disease. Copy number variation (CNV) refers to when the number of copies of a particular gene or genomic element varies from one individual to the next. These genetic variations can lead to many complex diseases, including neurological disorders such as schizophrenia and autism. 

Researchers at EMBL’s European Bioinformatics Institute (EMBL-EBI) have developed a novel method that uses genome-wide sequencing data to analyse CNVs and the impact that they can have on common and complex traits.

If you think about the human genome as a book that uses only four letters, single letter changes – spelling errors – commonly occur. It’s also possible that whole sections of the book can be removed or duplicated giving you many copies of the same page or paragraph, sometimes inserted into random parts of the book. If you think of these sections as regions of the genome, this is CNV. 

The bioinformatic analysis of CNV in the genome can be tricky due to noise and variation commonly found across genome sequencing datasets. In their  research published in the journal Cell Genomics, the scientists describe a new method – CNest – for the analysis of CNV, available as an open source software. 

The researchers also demonstrated, for the first time, the potential of the CNest method for performing systematic large scale CNV genome wide association studies (CNwas). Also by using data from the UK Biobank, they were able to identify many novel CNVs associated with common traits and human disease risk. 

Tackling genomic waves 

A challenge for large scale CNV discovery is the variability you get when analysing genomic sequencing data. This is often caused by the experimental techniques used for collecting DNA samples. This variation gives rise to noise across the genome often referred to as ‘genomic waves’. CNest aims to tackle this issue by using a large-scale normalisation strategy for CNV data analysis and a straightforward linear model for genome wide discovery.

“Analysing genomic sequencing data is never smooth due to differences from genome to genome or genomic waves. This means researchers struggle to make statistical associations for CNV data,” said Ewan Birney, Deputy Director General of EMBL and Director of EMBL-EBI. “CNest is, from our perspective, the first robust whole genome CNV association method that works genome-wide. We were able to develop a way to deal with genomic waves as well as pull CNV data alongside SNP variation data into one framework.”

UK Biobank genome wide association 

As a proof of concept, the researchers tested the capability of CNest using genome wide association studies (GWAS) found in the GWAS Catalog to identify genomic variants statistically associated with a risk for a disease. They ran CNest and CNwas over 200,000 exomes within UK Biobank data covering 78 human traits. This allowed them to identify over 800 genetic associations that are likely to contribute to different diseases as a result of CNV. 

They also investigated the correlation between single nucleotide polymorphisms (SNPs) – a common form of genetic variation – and disease-associated CNVs. 

“It’s well known that SNPs are commonly associated with CNVs; where you find CNV in the genome you frequently find a SNP marking its location,” said Tomas Fitzgerald, Research Scientist at EMBL-EBI. “By comparing CNV and SNP association signals across the same traits and samples in the UK Biobank data we were able to classify these SNP-CNV associations into different categories. It is exciting to see that many of the CNV associations we make have complementary evidence from standard SNP GWAS approaches on the same samples and traits.” 

“Encouragingly, we were able to detect a substantial number of novel CNV associations that could not be detected using SNPs and a further category of association where it was possible to better define the underlying function change by linking SNPs to CNVs,” added Fitzgerald. “The full integration and joint modelling of SNPs and CNVs for human trait association studies is an exciting area for further research and we are excited to see how this field develops in the near future, and hope to see CNwas approaches become mainstream during genome wide association testing projects.”

This research has broad implications for the genetics community by showing the possibility to roll out CNwas across multiple cohorts and traits to find novel genetic associations, which in some cases can have a direct relevance to healthcare. Adding CNwas into a similar framework as those most often used in SNP GWAS has great potential for not only uncovering novel associations but also for an improvement in the prediction of how genetic variation modifies the risk of human disease.

CNest is available as an open source software. It is easy to run across different infrastructures and conforms to the GA4GH standards for responsible genomic data sharing. 

This study and other work across EMBL to produce open source software to benefit other researchers is a vital pillar of the Data Sciences Plans within EMBL’s Molecules to Ecosystems Programme. This aims to ensure that data generated as part of the programme are expertly curated, managed, and shared with researchers everywhere. 

By furthering our understanding of human diseases, this work also sits within the Human Ecosystems theme of the Programme that aims to take advantage of rapidly expanding human datasets to explore human phenotypes.

Edit