Edit

Genome Biology Computational Support

At your side to solve your daily data management and NGS data analysis challenges

Data Management

We develop Lab Integrated Data (LabID), our open-source solution supporting scientists concretely solving modern data management problems. We work in close collaboration with EMBL core facilities as well as IT Services to provide the most advanced framework for open-science.

Lab Integrated Data

LabID embeds FAIR principles1 at its core and helps you gathering exhaustive information about the different entities constituting a research project (lab inventory, samples, assays, datasets, protocols and your personal electronic lab notes) all along the experimental and analysis routines.
The data, the metadata, and the relationships between the entities are recorded and organized to make it findable, accessible, interoperable and reusable by yourself and others, working towards open-science.

We work in close collaboration with core facilities to establish bridges between LabID, data acquisition facilities, and data repositories.

[1] Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).

Accessibility

All EMBL groups and individuals have access to our EMBL-wide LabID instance. You can login with your EMBL credentials. For security reason, a secure VPN connection is required to connect from outside.

How do I join LabID?

You can use LabID by simply login in with your usual EMBL credentials. When your group has never registered datasets in LabID, a few setup steps have to be completed to enable access to your group share that is required for data registration. Please, get in touch with us so that we can address the following technical aspects (the agreement of the group leader is required):

  • The LabID UNIX user has to be added to your UNIX group
  • A LabID data repository directory has to be created on your group share. This is where all files managed by LabID will be stored in a secured way.
  • The LabID data repository structure (dropbox and data storage) have to be generated
  • Your new data repository dir has to be registered in LabID as a known storage volumes
What happens when I leave EMBL (user)?

All your data remains available in the LabID server i.e. we never delete data.

You can easily export a copy of your data (lab stocks, samples, datasets) to excel spreadsheets. Your ELN notes (experiments) can also be exported to PDF and/or HTML formats. And of course, you can keep using LabID at your new home as it is free and open-source.

Before leaving, be nice and tidy up:

  • Make sure to share your items with e.g. your group leader or the lab manager
  • Annotate your samples with protocols and annotations. This is particularly important if you want to seek support to later publish your data.
  • Make sure to QC flag your samples and datasets. Failed datasets can be removed from disk to save space.
  • Archive to tape all projects/datasets that do not need to remain live
What happens when I leave EMBL (group leader)?

In addition to the above explanations, you need to get in touch with us as your group share(s) will be retired at some point. This needs to be well planned.

Permission system

LabID implements a powerful permission system allowing to fine-tune access to your data.
By default, items are created with view, edit & delete permissions for the author, and view & edit permissions for the author’s group. This applies to all models except Electronic Lab Notebook notes where, by default, no edit permissions is granted to the group.
The permissions can be altered by the owner at all time. It is also possible for us (admins) to alter permissions, but we will not do it instead of active authors.

How do I access the data of alumni?

When a colleague has left EMBL, it is possible for us to enable access on their material. Such actions should be requested by the group leader and a meeting should be set up to discuss details. Please note that for intellectual property reasons, we will never grant edit access on someone’s lab notes.

Dataset access

Datasets registered in LabID are stored on your group share but owned by the LabID UNIX user to protect them from unwanted modifications (rename, move or delete).
They not only can be searched for and browsed with the web UI, they also remains readable from disk at all time, so that you can easily access them with the tool of your choice (Galaxy, command-line tools, R, WMS, etc.).

Dataset registration

Registering datasets simply means handing the datasets over to LabID and providing additional information about samples and assay, in particular when registering raw (or primary) data. Several entry points exist for registration, depending on the data source.
On our side, we make sure all datasets entering LabID are automatically registered in the ITS DM app too.

What is a dataset?

«Dataset» is generally a broad concept meant to refer to any file, set of files, folder, or set of folders, describing an atomic unit of biological data (obtained from a particular sample at particular conditions and/or treatment). Atomic means that the whole set of files/folders should be considered together for data analysis i.e. it is not possible to only analyse part of it. This distinguishes from Studies or Projects that regroup multiple datasets deriving from multiple samples (or technical/biological replicates) and potentially acquired on different days, with different instruments and/or conditions. In LabID, a dataset may be used in different studies & projects.


A file – or a folder, etc. – containing data is referred to, in LabID, as a datafile. In many cases, a single datafile is enough to constitute a dataset, i.e. this single entity detains all parts of a meaningful unit of data about the studied biological material. Conversely, in other cases, the meaningful unit (=dataset) is a combination of datafiles e.g. paired-end fastq files.

Examples of datasets in genomics

  • In single-end sequencing, a dataset is composed by a single file containing all short DNA sequences.
  • Conversely, in paired-end sequencing, the meaningful unit of data is a combination of two files, each containing a part of the information.

Summing up, the definition of a dataset is pretty much context dependent, and should be carefully considered before registering the data. While it is pretty obvious in the case of sequencing, it may become trickier in Microscopy where users may acquire a unique dataset during an imaging session (e.g. a Light Microscopy Screen, cryo-EM/ET session or a SeqFISH spatial omics assay) or multiple datasets belonging to different projects (e.g. in situ expression in different specimen). In case of doubt, get in touch with us to discuss your situation before registering your data.

Registration from trusted sources e.g. GeneCore

When data is acquired by a trusted provider, we have established data transfer mechanisms (or are working on it). Additional input from the data owner (you) remains necessary to review and complete the registration of the datasets received from the trusted provider.

Genomics Core Facility

GeneCore sends us the data they have sequenced for you. You immediately receive an email to proceed to the registration form that has been pre-filled with information available at GeneCore.
This form allows us to capture additional assay information and the relationships between entities (e.g. dataset-sample relationships).

Other Facilities

We are currently establishing additional bridges between LabID and other trusted sources at EMBL. Currently, data obtained from other supported technologies* (light & electron microscopy imaging) have to be manually registered.

(*) apart from supported technologies, the generic assay model can be used to register basic information about an assay, e.g. for proteomics. Albeit elementary, this importantly allows you to maintain relationships between your samples and your datasets, while we develop the appropriate models for these additional assays.

Manual registration

We have established a procedure to streamline manual data registration. The key points during this process is for us to (1) obtain UNIX rights on the datasets, and (2) obtain additional information about how the data was acquired.

Registration from the user dropbox

Each LabID user has their own dropbox on their group share where they can deposit data to be registered. We enforce this initial copy into the dropbox to ensure you have access permissions on the data you are about to register under your name in LabID.

Registration from other places

With the help of the ITS DM app (DMA), we hope to limit superfluous data movements and duplication, while ensuring data authorship and traceability.

DMA data pull: This service allows you to pull the data directly from the machine it was created on (e.g. the computer attached to the microscope) to its final location on your group share.
Data Pull will (1) register the data in the DMA, (2) move it to the indicated location, and (3) hand it over to LabID; all in a single operation.

DMA data handover: Similar to a pull, but when the data is at arbitrary location on the NFS (using handover may require special role).
After pulling or handing over the data, you still need to follow-up with dataset registration within LabID.

Dataset deletion

Permissions on the registered datasets are acquired by the LabID user so that they are protected from direct deletion (i.e. from command line or using the file explorers).

They can be deleted only from LabID. The data deletion policy differs for managed datasets (i.e. datasets automatically registered from trusted providers) and non-managed datasets (the other datasets). LabID will always let you delete non-managed datasets while managed datasets can only be deleted if there is a copy on tape (archive) or the dataset QC is set to failed (in which case we strongly suggest to also provide short explanation in the description).

Data publishing (to EBI repos)

Releasing the data to the public when publishing a scientific paper is mandatory. This is often a tedious process because it happens late and the data and metadata can be scattered over the years. With LabID as a central source of truth, it becomes a routine.

Currently, this process is readily available for sequencing data using the LabID CLI (study is exported as a MAGE-TAB document). For other data types, pipelines still need to be developed; still, the CLI offers tabular MAGE-TAB like export of all the datasets linked to a study.

To export your study into a valid document, the following should happen (use batch edit features to readily perform those):

  • Linked all (and only) the datasets to submit to a unique study. Make sure this study has a great name, description (those will end up as-is on the repository side) and relevant design terms
  • Make sure all linked samples are well annotated and linked to protocols (use the study context filter). Note that LabID only exports the protocols’ summaries: the description has rich text formatting and may embed tables&images that can’t be exported to MAGE-TAB. Protocol summary should be made of few sentences only and expose key aspects of the protocols. When following commercial protocols (kit) or published protocol, it is enough to refer to those (eg with product details or pubmed ID).

We encourage you to contact us so that we can assist you in this process.

Technology stack & repositories

LabID is a combination of a database server exposing a powerful Application Programming Interface (API), a web user interface (UI) and a command line interface (CLI).
The server is developed in Python using Django and uses a PostgreSQL database . The UI is a Vue.js application. The CLI is written in Python.

Documentation

Knowledgebase of our project introducing the concepts, the features and how to use them

Slack

Slack to discuss about LabID, request features, report & troubleshoot issues

Chat with us (EMBL)

EMBL chatroom for LabID  to discuss about LabID, request features, report & troubleshoot issues

DMA Integration

The Data Management Application (DMA) developed by the ITS provides different user interfaces (Web, API & CLI) to perform common data manipulation like transfer between different storage, ownership changes, long term archiving, sharing, etc.

LabID uses the DMA API to automate data registration, sharing*, archiving and retrieval* (*: under development).

What data is registered in the DMA?

LabID manages data of different nature including lab inventories, sample, notes and experimentally derived data. Only the experimentally derived data, commonly referred to as datasets, is registered in the DMA.

Dataset registration

All datasets registered in LabID are automatically registered in the DMA. The DMA identification number is available (and visible) as a dataset annotation.

Dataset access

Datasets registered via LabID belong to the LabID technical user and have to be managed from the LabID UI. However, they remain visible in the DMA under your group section. Any operation (archiving, sharing…) should be performed through LabID.

Dataset sharing

LabID will automatically forward your dataset sharing actions to the DMA. This feature is not yet available but is under active development.

Dataset archiving/restoration

LabID allows you to archive entire projects, studies or a list of datasets. At EMBL, this request is forwarded to the DMA. LabID keeps monitoring the progress on the DMA side and sends you an email when the entire set of datasets has been archived. You can then safely remove all the archived files from the main storage in a single click (from the Archive detail page).

In LabID, restoring datasets from tape can be requested on one or multiple datasets at once. Similarly to the archiving procedure, LabID creates as many restore jobs as needed on the DMA side and keeps monitoring their completion status. You’ll get an email once all requested datasets have been restored. Note that the datasets are automatically restored to their primary location. In case the primary location is not free, the dataset is restored next to the primary location with name <dataset_name>_YYMMDD. A third option is to give an alternative location where all datasets should be restored. The restore feature is not yet available and is under active implementation.

Dataset deletion

This should only happens through the LabID interface.

Potential conflict and limitation of the integration of the DMA with LabID

It is not possible to register “dataset in dataset”, it is therefore very important to think about what are your datasets before registration. For example, if you register a folder with e.g. raw data and analysed data, it won’t be later possible to register the raw and analyzed datasets separately (as they should be).

Galaxy Integration

We maintain the EMBL Galaxy instance as described here. Datasets stored in from LabID can readily be made available in our local Galaxy, without data duplication.
We’ve described how-to during our LabID training session in the Galaxy Sync hands-on section.

Although LabID will not complain, data stored on tier2 cannot be processed in Galaxy. This is because the HPC does not have access to tier2.

Edit