Edit

Multimodal Open Data Integration Support

At your side to solve your daily data management and data analysis challenges

Data Analysis

While Galaxy is the most accessible and efficient when performing standard analyses, more advanced statistical modeling or visualization usually requires specialized code, which can be written and executed using R on our RStudio Server instance.

Workflow modeling, with Galaxy and other Workflow Management System (WMS), to achieve better analysis automation and reproducibility is also in our area of expertise and we can provide advice and support to beginners-to-advanced researchers.

Galaxy
RStudio on JupyterHub
WMS & Support
Super Computer & Software

Galaxy is a web app that allows performing reproducible data analyses in a user-friendly graphical interface.

Accessibility

Everyone at EMBL has access to our Galaxy instance. Login happens with your EMBL credentials.
Galaxy can be used by anyone but reveals to be an incredible asset specifically for bench scientists with little computer knowledge, as it makes it easy to run the most commonly used bioinformatics software.

Bioinformatics software

A variety of bioinformatics tools are available in a few clicks. This includes the most famous NGS, proteomics, and image data software. More can be deployed on-demand when specific interest is raised to us, do not hesitate to contact us.

Computing is performed on our HPC cluster

Resource intensive jobs launched with Galaxy are automatically executed on on EMBL’s high-performance computing infrastucture (maintained by ITS). This means no additional hassle for researcher who never used a HPC cluster before.

Data access with FAIR principles

A quota of 200Gb is allocated to each user. We encourage users to download useful analysis results to their group share when they are produced, and we expect users clean and purge useless data from their history in order to recover disk space.

Galaxy is not to be used as storage and we cannot guaranty the data will be kept for long-term.

Upload data

This is the quickest option, but we do not recommend it for bigger files and/or files that are already stored on your group share, as this will unnecessarily hurt your quota and potentially duplicate data.

Access your group share data

The data available on your group share at /g/<groupname> can directly be linked to your Galaxy data library.
This is has to be done by an admin. To do so, please open a request with us, with the list of files that need to be made available. This avoid unnecessary data duplication, which saves your group resources (disk space and ?)

From and to LabID

Connections have been established between our Galaxy instance and LabID, out data management platform. Datasets can be transferred from LabID to Galaxy in a few clicks (and without data duplication). Sending the data back from Galaxy to LabID is currently being beta testing. This allows to permanently store Galaxy’s analysis results and referencing it it lab notes, linking it to samples, annotations, protocols, and reagents, etc.

Get Support

Galaxy@EMBL

Local Galaxy instance

Galaxy Training Network

Collection of tutorials developed and maintained by the worldwide Galaxy community

Galaxy Chat

Internal chatroom for our Galaxy users, to get advice and troubleshoot issues

RStudio – sometimes now referred to as Posit™ Workbench – is a powerful Integrated Development Environment for R, the go-to programming language for bioinformaticians and statistician aiming at extracting valuable information from experimental data.

Accessibility

Everyone at EMBL can access RStudio via JupyterHub (at https://jupyterhub.embl.de). Login happens with your EMBL credentials.

Get Support

RStudio on JupyterHub @EMBL

Local RStudio instance

RStudio Server Chat

Internal chatroom for our RStudio users, to get advice and troubleshoot issues

To achieve better automation and reproducibility of analysis, we much encourage the usage of analysis workflows and Workflow Management Systems (WMS).

Galaxy

Next Generation Sequencing (NGS) data analysis

We will assist less computer savvy colleagues in their standard NGS data analysis (RNA-seq, ChIP-seq, ATAC-seq, HiC, scRNA-seq…) by providing ready-to-use Galaxy workflows.
Non standard analysis workflows have to be developed by you, nevertheless we can teach you the basics of Galaxy so that you can assemble your own workflow in no time.

Our expertise in other domain than NGS is limited, however we help you with assembling your own workflow.

Training

MODIS have regularly been providing training internally, and the Galaxy Training Network provide live material to learn by yourself. This covers a large area of domains, including sequencing, miscroscopy, proteomics, metabolomics, etc.

Command-line based WMS

For bioinformaticians proficient with command line tools, we advise looking into command-line based WMS. The most commonly used at EMBL are Nextflow and Snakemake*.

(*) We cannot recommend one WMS over another. Snakemake and Nextflow are both powerful tool, and other WMS also exist out there. Picking the right tool is a hot topic in life sciences, many aspects are to be considered and the choice ultimately is up to you. However we at MODIS do have a better expertise on Nextflow.

(GB Unit) Custom analysis & long-term collaboration

When your group is part of the GB Unit, we can provide further support and collaborate on workflow development. This for example can either mean developing a custom Galaxy or Nextflow workflow, or collaborating on the development of a Nextflow workflow with bioinformaticians in your group in order to teach them best practice of software development with git and of modular workflow development.

Get Support

We maintain a super computer named Seneca, which we use to run RStudio Server. This computer can be accessed via ssh and is connected to your group share. It can be used to run basic Unix commands and resource inexpensive processing.

Specifications

Dell Power Edge R7425
64 cores capable of 128 concurrent threads (2x AMD EPYC 7601 2,20GHz/2,7GHz, 32C/64T, 64M Cache (180W) DDR4-2666)
2Tb RAM (32x 64GB LRDIMM, 2666MT/s)
3.2Tb local storage Flash Disk (/tmpdata)

Accessibility

Everyone at EMBL has access to Seneca. Login happens remotely via ssh to seneca.embl.de (when connected to the EMBL network).

Cluster access

Seneca is configured as a SLURM submit host and therefore can be used to submit cluster jobs like login01.cluster.embl.de or login02.cluster.embl.de. Find more information on the ITS Cluster Wiki.

Software

The majority of software and their versions are handled with Easybuild, the software framework used and maintained by ITS. Software is specifically compiled against the platform it’s running on and is therefore optimised. A specific version of a software – compiled by a specific toolchain – is referred to as an environment module. Modules are loaded in the user environment on demand, by the user themself, using the module command. Loading a given module does load all the needed software dependencies with it.

Modules basics

Easybuild builds software modules. Linux comes with the module command-line tool to interact with modules (we use Lmod), and typically load them into your environment, list the existing and/or loaded ones, etc.

List available modules

module avail lists all modules.
module avail <string> lists all module with <string> in their name (case insensitive), e.g. module avail python returns Python and IPython modules, etc.
module spider and module spider <string> do a similar job.

Load a module

module load <module_name> [<module2_name> ...] loads the given module(s), e.g. module load Python/3.10.8-GCCcore-12.2.0 SciPy-bundle/2023.02-gfbf-2022b loads both Python and SciPy. Find names with the avail or spider commands.

When possible, load matching toolchain versions, i.e. versions that have been compile with the same toolchain.

NB 1: When loading multiple modules and hitting a dependency conflict, the last loaded module wins, i.e. the last module that needs the dependency dictates the loaded version of said dependency.

List loaded modules

module list lists all the loaded module.

Even after explicitly loading a single module, the list may contain multiple module. This is because loading a module means loading the given one and all the module it depends on. For example loading R-bundle-Bioconductor/3.16-foss-2022b-R-4.2.2 effectively loads R, Bioconductor, as well as 123 other dependencies.

Unload module(s)

module unload <module_name> unloads a given module and all obsolete dependencies .
module purge unloads all the loaded module.

Note on installing new software

You cannot install your own software with Easybuild^*.

When you identify a piece of software that is not available, you can request its install to us or to IT Services. On our side, installing should not take long, providing that either (1) an official easybuild recipe exists, or (2) that the install procedure is standard & following best practices.

As an alternative, you may also use virtual environments managers (like conda) but we provide only limited support for them.

* Effectively you could maintain your own Easybuild install, but this is advanced usage and out of scope of this document

Limitations and good practices

The machine running RStudio Server is powerful but is a shared resources accessible by all EMBL scientists. Be mindful of others.

Do not run resource intensive jobs on this machine or they will be killed.

Edit

Data Analysis

Galaxy

Accessibility

Bioinformatics software

Computing is performed on our HPC cluster

Data access with FAIR principles

Upload data

Access your group share data

From and to LabID

Get Support

RStudio on JupyterHub

Accessibility

Get Support

WMS & Support

Galaxy

Next Generation Sequencing (NGS) data analysis

Training

Command-line based WMS

(GB Unit) Custom analysis & long-term collaboration

Get Support

Super Computer & Software

Accessibility

Cluster access

Software

List available modules

Load a module

List loaded modules

Unload module(s)

Limitations and good practices