The genome in the cloud
Since the completion of the Human Genome Project in 2001, technological advances have made sequencing genomes much easier, quicker and cheaper, fuelling an explosion in sequencing projects. Today, genomics is well into the era of ‘big data’, with genomics datasets often containing hundreds of terabytes (1014 bytes) of information.
The rise of big genomic data offers many scientific opportunities, but also creates new problems, as Jan Korbel, Group Leader in the Genome Biology Unit EMBL Heidelberg, describes in a new commentary paper authored with an international team of scientists and published today in Nature.
Korbel’s research focuses on genetic variation, especially genetic changes leading to cancer, and relies on computational and experimental techniques. While the majority of current cancer genetic studies assess the 1% of the genome comprising genes, a main research interest of the Korbel group is in studying genetic alterations within ‘intergenic’ regions that drive cancer. As this approach looks at much more of the genome than gene-focused studies, it requires analysis of larger amounts of data. This challenge is exemplified via the Pan-Cancer Analysis of Whole Genomes (PCAWG) project, co-led by Korbel, which brings together nearly 1 petabyte (1015 bytes) of genome sequencing data from more than 2000 cancer patients.
The problem is not a shortage of data but accessing and analysing it. Genome datasets from cancer patients are typically stored in so-called ‘controlled access’ data archives, such as the European Genome-phenome Archive (EGA). These repositories, however, are ‘static’, says Korbel, meaning that the datasets need to be downloaded to a researcher’s institution before they can be further analysed or integrated with other types of data to address biomedically relevant research questions. “With massive datasets, this can take many months and may be unfeasible altogether depending on the institution’s network bandwidth and computational processing capacities,” says Korbel. “It’s a severe limitation for cancer research, blocking scientists from replicating and building on prior work.”
It’s a severe limitation for cancer research, blocking scientists from replicating and building on prior work.
With data stored in one of the various commercial cloud services on offer from companies such as Amazon Web Services, or on academic community clouds, researchers can analyse vast datasets without first downloading them to their institutions, saving time and money that would otherwise need to be spent on maintaining them locally. Cloud computing also allows researchers to draw on the processing power of distributed computers to significantly speed up analysis without purchasing new equipment for computationally laborious tasks. A large portion of the data from PCAWG, for example, will be analysed through cloud computing using both academic community and commercial cloud providers, thanks to new computational frameworks currently being built.
One concern about using cloud computing revolves around the privacy of people who have supplied genetic samples for studies. However, cloud services are now typically as secure as regular institutional data centres, which has diminished this worry: earlier this year, the US National Institutes of Health lifted a 2007 ban on uploading their genomic data into cloud storage. Korbel predicts that the coming months and years will see a big upswing in the use of cloud computing for genomics research, with academic cloud services, such as the EMBL-EBI Embassy Cloud, and commercial cloud providers including Amazon becoming a crucial component of the infrastructure for pursuing research in human genetics.
Yet there remain issues to resolve. One is who should pay for cloud services. Korbel and colleagues urge funding agencies to take on this responsibility given the central role cloud services are predicted to play in future research. Another issue relates to the differing privacy, ethical and normative policies and regulations in Europe, the US, and elsewhere. Some European countries may prefer that patient data remain within their jurisdiction so that they fall under European privacy laws, and not US laws, which apply once a US-based cloud provider is used. Normative and bioethical aspects of patient genome analysis, including in the context of cloud computing, are another specific focus of Korbel’s research, which is being pursued via an inter-disciplinary collaboration with Fruzsina Molnár-Gábor from Heidelberg University faculty of law in a project funded by the Heidelberg Academy of Sciences and Humanities.
The establishment of novel powerful cloud computing frameworks … will open up new avenues in cancer research
Achieving wide acceptance of genomic cloud computing in Europe may require the deployment of European-wide as well as more regional cloud services, both academic and commercial. “The establishment of novel powerful cloud computing frameworks enabling us to store, share and analyse data across borders will open up new avenues in cancer research,” says Korbel. “These new initiatives will factor in developments in science and policy for the distribution and sharing of patients’ sensitive genetic data, ensuring a safe environment to serve the interests of both sample donors and researchers.”
- Press release from the Ontario Institute for Cancer Research
- Nature editorial – Cloud cover, Nature, 8 July 2015
- Related Nature article – European labs set sights on continent-wide computing cloud, Nature, 8 July 2015
- PCAWG and cloud computing – The sky's the limit, EMBLetc, 3 July 2015
- The Cloud, dbGaP and the NIH – NIH Data Science blog, 27 March 2015
- European Genome-phenome Archive