The Pan-Cancer Analysis of Whole Genomes (PCAWG) project is starting to use an innovative, high-performance computational infrastructure, kindly donated by Fujitsu and Intel to facilitate the analysis of cancer genomes.
The PCAWG project brings together whole-genome sequencing data from the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) projects, describing over 2000 tumour and matched control samples that cover more than 30 cancer entities. The combined dataset requires one petabyte (1015 bytes) of storage – about the size of a 2000-year-long mp3 playlist. The new technology will make it easier for researchers to access and analyse these data.
Cancer research is generating ever-larger volumes of data, and efforts like the 100,000 Genomes Project in the UK will produce staggering quantities of information that will require analysis. Storing and sharing such ‘big data’ efficiently is a major challenge for everyone involved. Centralised databases make this crucial information available, but downloading large datasets and performing integrative analyses requires more technical infrastructure than most research institutions can afford. The Fujitsu and Intel infrastructure will help biologists and bioinformaticians at the German Cancer Research Centre (DKFZ) and EMBL analyse and share a wealth of cancer genomic data more efficiently.
The data cannot be transferred to the analyst anymore: it is the analyst who needs to be able to come to the data.
“We need to change how scientists access and analyse data. The data cannot be transferred to the analyst anymore: it is the analyst who needs to be able to come to the data,” explains Jan Korbel, who leads the project from EMBL Heidelberg.
PCAWG brings together the research expertise of several academic institutions, the computational infrastructure capacities of Fujitsu, the large-scale network expertise of Intel and SAP and software engineering expertise from biobyte solutions GmbH to build this high-performance, practical and efficient computing system. After this pilot phase, this infrastructure should become available to scientists from other institutions and allow them to work on this large dataset remotely.
The project is unfolding in three steps. During the first phase, the genomic data were uploaded to seven academic computer centres worldwide, each holding a subset of the data. These will be shared using the EMBL-EBI Embassy Cloud and high-performance computing centres at the University of Chicago, the Electronics and Telecommunications Research Institute in Seoul, the University of California in Santa Cruz, the University of Tokyo and the DKFZ. The Fujitsu PRIMERGY HPC cluster was added to the already existing Heidelberg Center for Personalized Oncology at DKFZ (DKFZ-HIPO) and almost doubled its compute capacity.
Now, in the second phase, these ‘academic community clouds’ are executing three computational pipelines to identify genetic variants on the individual subsets, including cancer-specific mutations. This work is being performed in a standardised, consistent manner across all the samples. In the final phase, more than 700 scientists worldwide will access the dataset remotely to perform biomedical analyses that are pertinent to their own research interests.
Studying ‘big data’ in the life sciences poses a big challenge to organisations of all sizes, which often lack adequate infrastructure. EMBL-EBI’s Embassy Cloud is a shared, secure, high-performance workspace providing direct access to privately hosted datasets, public and managed-access datasets and the institute’s powerful computing resources. To learn more about this cloud service, contact firstname.lastname@example.org
“The Fujitsu-Intel infrastructure will significantly accelerate crucial computational analyses,” explains Roland Eils, who leads the project for the DKFZ. “We expect this will further our knowledge on the onset of mutations in cancer genomes, improve our understanding of how they drive tumour development, and reveal yet unknown vulnerabilities of the tumour cells that can potentially be exploited in novel treatment strategies.”