Edit

Genome Biology Computational Support

At your side to solve your daily data management and NGS data analysis challenges

Data Analysis

While Galaxy is the most accessible and efficient when performing standard analyses, more advanced statistical modeling or visualization usually requires specialized code, which can be written and executed using R on our Posit™ Workbench instance.

Workflow modeling, with Galaxy and other Workflow Management System (WMS), to achieve better analysis automation and reproducibility is also in our area of expertise and we can provide advice and support to beginners-to-advanced researchers.

Galaxy

Galaxy is a web app that allows performing reproducible data analyses in a user-friendly graphical interface.

Accessibility

Everyone at EMBL has access to our Galaxy instance. Login happens with your EMBL credentials.
Galaxy can be used by anyone but reveals to be an incredible asset specifically for bench scientists with little computer knowledge, as it makes it easy to run the most commonly used bioinformatics software.

Bioinformatics software

A variety of bioinformatics tools are available in a few clicks. This includes the most famous NGS, proteomics, and image data software. More can be deployed on-demand when specific interest is raised to us, do not hesitate to contact us.

Computing is performed on our HPC cluster

Resource intensive jobs launched with Galaxy are automatically executed on on EMBL’s high-performance computing infrastucture (maintained by ITS). This means no additional hassle for researcher who never used a HPC cluster before.

Data access with FAIR principles

A quota of 200Gb is allocated to each user. We encourage users to download useful analysis results to their group share when they are produced, and we expect users clean and purge useless data from their history in order to recover disk space.

Galaxy is not to be used as storage and we cannot guaranty the data will be kept for long-term.

Upload data

This is the quickest option, but we do not recommend it for bigger files and/or files that are already stored on your group share, as this will unnecessarily hurt your quota and potentially duplicate data.

Access your group share data

The data available on your group share at /g/<groupname> can directly be linked to your Galaxy data library.
This is has to be done by an admin. To do so, please open a request with us, with the list of files that need to be made available. This avoid unnecessary data duplication, which saves your group resources (disk space and ?)

From and to LabID

Connections have been established between our Galaxy instance and LabID, out data management platform. Datasets can be transferred from LabID to Galaxy in a few clicks (and without data duplication). Sending the data back from Galaxy to LabID is currently being beta testing. This allows to permanently store Galaxy’s analysis results and referencing it it lab notes, linking it to samples, annotations, protocols, and reagents, etc.

Get Support

Galaxy Chat

Internal chatroom for our Galaxy users, to get advice and troubleshoot issues

Posit™ Workbench

Posit™ Workbench – the new name of RStudio Server – is a powerful Integrated Development Environment for R, the go-to programming language for bioinformaticians and statistician aiming at extracting valuable information from experimental data.

Accessibility

Everyone at EMBL has access to our Workbench instance. Login happens with your EMBL credentials.

Data access

Workbench has access to the EMBL file system, including your group share, therefore you can directly access the data by referring to its path (e.g. on your group share).

Limitations and good practices

The machine running Workbench is powerful but is a shared resources accessible by all EMBL scientists. Be mindful of others.

Each session is limited to 40Gb of memory. Please refrain from opening multiple session at once. We will kill your session if they jeopardize the work of others.
Resource intensive jobs have to be run on the cluster. This is specifically true when running parallelized jobs using multiple cores and a lot memory.

R (and Bioconductor) versions

The different R versions that are made available to you have been optimised to run on our infrastructure. The ones available in Workbench are the same versions available via command-line on Seneca as well as on the cluster.
The specific version to be used can be selected from the UI. Each R version is tight to a specific version of Bioconductor. All the available versions can be listed in a terminal session from Seneca or from the cluster (module avail R-bundle-Bioconductor).

Note on installing new versions

You cannot install your own version of R and use it on Workbench.

All versions are handled with the software framework used and maintained by ITS (Easybuild). New version can be installed by us or ITS providing the install recipe has been released by Easybuild and is available in their GitHub repository.

We also advise against installing your own R version locally or on Seneca with e.g. conda because this will critically limit the reproducibility of your analysis.

Library and package install

We encourage you to play around and install as many libraries as you want, however please consider the following:

1. Properly configure the install location of libraries

Default install location for libraries is your home (~) which disk space is limited by a quota.

By installing to many libraries, you will eventually hit the quota and start experimenting disk space errors. You can circumvent this issue by configuring R to install library somewhere else, for example in your group share. To achieve this, please create a .Renviron file in your home folder (~/.Renviron) and set the R_LIBS_USER variable

R_LIBS_USER="/g/‹your_group›/‹your_username›/R-libs/%p/%V"
# Which resolves to /g/‹your_group›/‹your_username›/R-libs/x86_64-pc-linux-gnu/4.2.1 

Make sure to use the variables %p and %V (resolved respectively as the system architecture name and the R version) so that R adequately maintain version specific library install folders. This is important to not run into dependency conflicts when changing R versions.

See also: https://git.embl.de/-/snippets/7

2. Do not update all pre-installed Bioconductor libraries

Please update a pre-installed library only when needed (i.e. in case you are solving a dependency issue or when you know a newer version has a critical bug fix)

As explained above, R comes with the Bioconductor bundle and therefore has an extensive list of pre-installed libraries. Each is pinned to a specific version (the one listed in the Easybuild recipe). Updating all libraries – as sometimes advised by R – is not recommended here. It will download and install a newer version within your library folder, however the newer version will not be compiled in an optimised way as when we pre-installed it, and will therefore run less efficiently.

3. Libraries install are specific to a given R version

Each R version comes with its specific install of libraries. Therefore, installing many libraries for many R versions will expend the disk space you use accordingly.

Troubleshooting

Exceeding home disk space quota

Problem: Workbench (RStudio) store sessions information into user’s home directory at /home/<username>. This can lead to issue when hitting disk quotas (50Gb)

Solution: Move the ~/.local/share/rstudio directory to another disk without quota (preferentially on seneca, or to your group share).

mkdir /tmpdata/$USER
mv ~/.local/share/rstudio /tmpdata/$USER/rstudio && ln -s /tmpdata/$USER/rstudio ~/.local/share/rstudio
Conversion of RMarkdown to HTML or PDF fails
Cannot convert rmarkdown to pdf/html
If you get errors related to 'X11 display' or 'Invalid argument', set this in your ~/.Rprofile:
options(bitmapType='cairo')
If, specifically, you want to generate PNG images inside HTML output, you can also use following Rmd preamble:
---
output:
  html_document:
    dev: CairoPNG
---

Problem: Conversion of RMarkdown (.rmd) files to HTML or PDF fails with errors related to X11 display or Invalid argument

Solution: Create or update the user ~/.Rprofile to add the following line

options(bitmapType='cairo')

If, specifically, you want to generate PNG images inside HTML output, you can also use following Rmd preamble:

---
output:
  html_document:
    dev: CairoPNG
---
Use a specific version of python

Problem: An R package using a specific/outdated Python version is producing dependencies conflicts with our default Python version.

Solution: Use the reticulate package use_python(), use_virtualenv(), use_condaenv() functions. For example:

library(reticulate)
use_condaenv('my-project')

Get Support

Workbench Chat

Internal chatroom for our Workbench users, to get advice and troubleshoot issues

WMS & Support

To achieve better automation and reproducibility of analysis, we much encourage the usage of analysis workflows and Workflow Management Systems (WMS).

Galaxy

Next Generation Sequencing (NGS) data analysis

We will assist less computer savvy colleagues in their standard NGS data analysis (RNA-seq, ChIP-seq, ATAC-seq, HiC, scRNA-seq…) by providing ready-to-use Galaxy workflows.
Non standard analysis workflows have to be developed by you, nevertheless we can teach you the basics of Galaxy so that you can assemble your own workflow in no time.

Our expertise in other domain than NGS is limited, however we help you with assembling your own workflow.

Training

GBCS have regularly been providing training internally, and the Galaxy Training Network provide live material to learn by yourself. This covers a large area of domains, including sequencing, miscroscopy, proteomics, metabolomics, etc.

Command-line based WMS

For bioinformaticians proficient with command line tools, we advise looking into command-line based WMS. The most commonly used at EMBL are Nextflow and Snakemake*.

(*) We cannot recommend one WMS over another. Snakemake and Nextflow are both powerful tool, and other WMS also exist out there. Picking the right tool is a hot topic in life sciences, many aspects are to be considered and the choice ultimately is up to you. However we at GBCS do have a better expertise on Nextflow.

(GB Unit) Custom analysis & long-term collaboration

When your group is part of the GB Unit, we can provide further support and collaborate on workflow development. This for example can either mean developing a custom Galaxy or Nextflow workflow, or collaborating on the development of a Nextflow workflow with bioinformaticians in your group in order to teach them best practice of software development with git and of modular workflow development.

Get Support

Super Computer & Software

We maintain a super computer named Seneca, which we use to run the Posit™ Workbench instance. This computer can be accessed via ssh and is connected to your group share. It can be used to run basic Unix commands and resource inexpensive processing.

Specifications
  • Dell Power Edge R7425
  • 64 cores capable of 128 concurrent threads (2x AMD EPYC 7601 2,20GHz/2,7GHz, 32C/64T, 64M Cache (180W) DDR4-2666)
  • 2Tb RAM (32x 64GB LRDIMM, 2666MT/s)
  • 3.2Tb local storage Flash Disk (/tmpdata)

Accessibility

Everyone at EMBL has access to Seneca. Login happens remotely via ssh to seneca.embl.de (when connected to the EMBL network).

Cluster access

Seneca is configured as a SLURM submit host and therefore can be used to submit cluster jobs like login01.cluster.embl.de or login02.cluster.embl.de. Find more information on the ITS Cluster Wiki.

Software

The majority of software and their versions are handled with Easybuild, the software framework used and maintained by ITS. Software is specifically compiled against the platform it’s running on and is therefore optimised. A specific version of a software – compiled by a specific toolchain – is referred to as an environment module. Modules are loaded in the user environment on demand, by the user themself, using the module command. Loading a given module does load all the needed software dependencies with it.

Modules basics

Easybuild builds software modules. Linux comes with the module command-line tool to interact with modules (we use Lmod), and typically load them into your environment, list the existing and/or loaded ones, etc.

List available modules
  • module avail lists all modules.
  • module avail <string> lists all module with <string> in their name (case insensitive), e.g. module avail python returns Python and IPython modules, etc.
  • module spider and module spider <string> do a similar job.
Load a module
  • module load <module_name> [<module2_name> ...] loads the given module(s), e.g. module load Python/3.10.8-GCCcore-12.2.0 SciPy-bundle/2023.02-gfbf-2022b loads both Python and SciPy. Find names with the avail or spider commands.

When possible, load matching toolchain versions, i.e. versions that have been compile with the same toolchain.

NB 1: When loading multiple modules and hitting a dependency conflict, the last loaded module wins, i.e. the last module that needs the dependency dictates the loaded version of said dependency.

List loaded modules
  • module list lists all the loaded module.

Even after explicitly loading a single module, the list may contain multiple module. This is because loading a module means loading the given one and all the module it depends on. For example loading R-bundle-Bioconductor/3.16-foss-2022b-R-4.2.2 effectively loads R, Bioconductor, as well as 123 other dependencies.

Unload module(s)

module unload <module_name> unloads a given module and all obsolete dependencies .
module purge unloads all the loaded module.

Note on installing new software

You cannot install your own software with Easybuild*.

When you identify a piece of software that is not available, you can request its install to us or to IT Services. On our side, installing should not take long, providing that either (1) an official easybuild recipe exists, or (2) that the install procedure is standard & following best practices.

As an alternative, you may also use virtual environments managers (like conda) but we provide only limited support for them.

* Effectively you could maintain your own Easybuild install, but this is advanced usage and out of scope of this document

Limitations and good practices

The machine running Workbench is powerful but is a shared resources accessible by all EMBL scientists. Be mindful of others.

Do not run resource intensive jobs on this machine or they will be killed.

Edit