24 March 2022

Tags: beginner, introduction, profile, snakemake

Snakemake profile – 1: Getting started with snakemake

This blog post is the first of a series on creating snakemake profiles. Some content was directly copied from the snakemake manual. It is supposed that the reader has some basic concepts of snakemake, even if I start from the very beginning.

Introduction

Adapting Snakemake to a particular environment can entail many flags and options. Therefore, since Snakemake 4.1, it is possible to specify a configuration profile to be used to obtain default options:

#!/usr/bin/bash
snakemake --profile myprofileFolder

In this section, I am going to explain how to create a profile. The profile folder will contain all the configuration parameters to successfully run your pipeline. Of note, no cluster.json file will be used since it has been deprecated since snakemake v4.1. The profile folder is expected to contain a file config.yaml that defines default values for the Snakemake command-line arguments and jobs resources.

Folder structure

Let’s start by creating a folder having the right structure:

#!/usr/bin/bash
# Create the folder containing the files needed for this tutorial
mkdir snakemake-profile-demo
# Enter the created folder
cd snakemake-profile-demo
# Create an empty file containing the snakemake code
touch snakeFile
# Create toy input files
mkdir inputs
echo "toto" > inputs/hello.txt
echo "totoBis" > inputs/helloBis.txt
# Create an empty folder to create a conda environment
# This is done to make sure that you use the same snakemake version as I do
mkdir envs
touch envs/environment.yaml

Creating the conda environment

For more information about the YAML format see here. For the next step, you need to have conda installed on your computer. Copy the following content to envs/environment.yaml (the indentations consist of two spaces):

channels:
  - bioconda
dependencies:
  - snakemake-minimal=6.15.1

Then execute the following command to create a conda environment containing snakemake v6.15.1:

#!/usr/bin/bash
conda env create -p envs/smake --file envs/environment.yaml

The conda environment creation should display:

Collecting package metadata (repodata.json): done
Solving environment: done
Downloading and Extracting Packages
libgomp-11.2.0       | 428 KB    | 
#######################################################
########## | 100% 
libgfortran-ng-11.2. | 19 KB     |
#######################################################
########## | 100% 
libgfortran5-11.2.0  | 1.7 MB    | 
#######################################################
########## | 100% 
libstdcxx-ng-11.2.0  | 4.2 MB    |
#######################################################
########## | 100% 
libgcc-ng-11.2.0     | 906 KB    | 
#######################################################
########## | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate /g/romebioinfo/tmp/snakemake-profile-demo/envs/smake
#
# To deactivate an active environment, use
#
#     $ conda deactivate

The option -p indicates the path to the conda environment that we call smake and the --file option gives the path to the file containing what we want to include in the environment. This step is important to make sure that we will use the same configuration.

Creating the first rule

Let’s not activate smake for the moment. Rather, let’s write our first rule in snakeFile. Copy the following content (here I use two spaces as indentation):

rule printContent:
  input:
    file1="inputs/hello.txt",
    file2="inputs/helloBis.txt"
  output:
    file1="results/hello-content.txt",
    file2="results/helloBis-content.txt"
  shell:
    """
    cat {input.file1} > {output.file1}
    cat {input.file2} > {output.file2}
    """

Here we wrote a rule called printContent that takes two files as input (hello.txt and helloBis.txt), and that prints their content to two other files (hello-content.txt and helloBis-content.txt) that are written in the results/ folder. This folder will be automatically created by snakemake. Let’s try to execute this file and see what happens:

#!/usr/bin/bash
# Activate the conda environment
conda activate envs/smake
# Run the snakeFile (if --cores is not defined, nothing will happen: try removing it)
snakemake --snakefile snakeFile --cores=1

You should obtain the following message:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job             count    min threads    max threads
------------  -------  -------------  -------------
printContent        1              1              1
total               1              1              1
Select jobs to execute...
[Thu Mar  3 15:19:56 2022]
rule printContent:
    input: inputs/hello.txt, inputs/helloBis.txt
    output: results/hello-content.txt, results/helloBis-content.txt
    jobid: 0
    resources: tmpdir=/tmp
[Thu Mar  3 15:19:56 2022]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2022-03-03T151955.846643.snakemake.log

Congratulations, you have executed your first snakemake pipeline. Check now that the files hello-content.txt and helloBis-content.txt have been created in the result/ folder:

#!/usr/bin/bash
ls results/*

That should give:

results/helloBis-content.txt  results/hello-content.txt

Execute again the snakemake command (your conda environment should still be active):

#!/usr/bin/bash
# Run the snakeFile 
snakemake --snakefile snakeFile --cores=1

You should see:

Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).
Complete log: snakemake-profile-demo/.snakemake/log/2022-03-03T152616.378173.snakemake.log

This is expected. Since you created hello-content.txt and helloBis-content.txt, snakemake considers that Nothing to be done. To continue this tutorial, delete the results/ folder:

#!/usr/bin/bash
rm -r results/

Getting rid of the hard-coded paths

You might have noticed the main limitation in what we have done so far. We had to hard code the path to each input file. This is ok if you have only a few of them but if you have hundreds or thousands, you are in trouble. This is where wildcards get in the game. You can use them as variables that can take several values. Here, we want to replace the strings hello and helloBis with a wildcards that we will call sampleName. snakemake requires this wildcards to be included between brackets:

rule printContent:
  input:
    file1="inputs/{sampleName}.txt",
    file2="inputs/{sampleName}.txt"
  output:
    file1="results/{sampleName}.txt",
    file2="results/{sampleName}.txt"
  shell:
    """
    cat {input.file1} > {output.file1}
    cat {input.file2} > {output.file2}
    """

In the code above, we can see that the two lines of the input section and the two lines of the output section are the same. We can therefore simplify the code as follows:

rule printContent:
  input:
    file1="inputs/{sampleName}.txt"
  output:
    file1="results/{sampleName}.txt"
  shell:
    """
    cat {input.file1} > {output.file1}
    cat {input.file2} > {output.file2}
    """

You can now observe that the file1 variable is not useful anymore and that the second line of the shell script is not relevant either. Below I hence remove file1 everywhere and also the second line of the shell script:

rule printContent:
  input:
    "inputs/{sampleName}.txt"
  output:
    "results/{sampleName}.txt"
  shell:
    """
    cat {input} > {output}
    """

The code is much shorter now. Replace the content of snakeFile with this rule and run the snakemake command (your conda environment should be active. In case it is not, just do conda activate envs/smake. To deactivate it, do conda deactivate):

#!/usr/bin/bash
# Run the snakeFile .
snakemake --snakefile snakeFile --cores=1

You might have thought: “How is snakemake going to know what to put in the sampleName wildcards?”. And indeed, this message tells you that it does not know:

Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end)

Then let’s do the most obvious thing, define the sampleName variable! SPOILER: snakemake is written in Python, so if you do not know Python (like me), you will end up googling stupid things like “for loop in python”. But do not worry, this is working pretty well, and the more you do it, the more you learn!

sampleName=["hello", "helloBis"]
rule printContent:
  input:
    "inputs/{sampleName}.txt"
  output:
    "results/{sampleName}.txt"
  shell:
    """
    cat {input} > {output}
    """

Add the top line to your snakeFile and run the snakemake command again:

#!/usr/bin/bash
# Run the snakeFile .
snakemake --snakefile snakeFile --cores=1

Are you getting the same WorkflowError message? SPOILER: snakemake is not intuitive! You get the same error because the logic of snakemake is a bit particular. You need first to define the files to produce in a rule all section. Be careful, even if you want to define “output files” there, you need the “input” keyword. Then you need to define the values of sampleName with the expand function:

rule all:
  input:
    expand("results/{sampleName}.txt", sampleName=["hello", "helloBis"])
rule printContent:
  input:
    "inputs/{sampleName}.txt"
  output:
    "results/{sampleName}.txt"
  shell:
    """
    cat {input} > {output}
    """

After replacing the content of your snakeFile with the above code, run snakemake:

#!/usr/bin/bash
# Run the snakeFile .
snakemake --snakefile snakeFile --cores=1

You should get:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job             count    min threads    max threads
------------  -------  -------------  -------------
all                 1              1              1
printContent        2              1              1
total               3              1              1
Select jobs to execute...
[Thu Mar  3 16:44:09 2022]
rule printContent:
    input: inputs/hello.txt
    output: results/hello.txt
    jobid: 1
    wildcards: sampleName=hello
    resources: tmpdir=/tmp
[Thu Mar  3 16:44:09 2022]
Finished job 1.
1 of 3 steps (33%) done
Select jobs to execute...
[Thu Mar  3 16:44:09 2022]
rule printContent:
    input: inputs/helloBis.txt
    output: results/helloBis.txt
    jobid: 2
    wildcards: sampleName=helloBis
    resources: tmpdir=/tmp
[Thu Mar  3 16:44:09 2022]
Finished job 2.
2 of 3 steps (67%) done
Select jobs to execute...
[Thu Mar  3 16:44:09 2022]
localrule all:
    input: results/hello.txt, results/helloBis.txt
    jobid: 0
    resources: tmpdir=/tmp
[Thu Mar  3 16:44:09 2022]
Finished job 0.
3 of 3 steps (100%) done
Complete log: /g/romebioinfo/tmp/snakemake-profile-demo/.snakemake/log/2022-03-03T164408.691275.snakemake.log

This is quite different from what we got before. First look at the table at the beginning (on the left below) and compare it with the previous one (on the right below):

job          count    min threads    max threads        job    count    min threads    max threads
---------   -------  -------------  -------------     -------  -------  -------------  -----------
all            1            1              1                            
printContent   2            1              1      printContent   1              1              1
total          3            1              1      total          1              1              1

First, you can see that all is considered as a job (the job that verifies that all the files are produced). Secondly, printContent has 2 jobs instead of 1. This is the key of snakemake, it processed each file (inputs/hello.txt and inputs/helloBis.txt) in two separate jobs. This is what is called parallelization! In the next post, we will see how to submit the jobs to a cluster. Since the jobs are all submitted to your HPC, it then processes inputs/hello.txt and inputs/helloBis.txt at the same time i.e. in parallel. One can understand immediately the advantage of using a workflow manager such as snakemake. When you have hundreds of files, it enables you to save a lot of time! Note also that snakemake outputs a summary of each rule that is processed by the workflow:

rule printContent:
    input: inputs/hello.txt
    output: results/hello.txt
    jobid: 1
    wildcards: sampleName=hello
    resources: tmpdir=/tmp

This enables you to verify what are your parameters. This is going to be important in the next posts. Finally, when a job has finished, in other words when the output file of a rule is produced, snakemake indicates the jobid (that you can find in the above summary) and the number of remaining jobs:

[Thu Mar  3 16:44:09 2022]
Finished job 1.
1 of 3 steps (33%) done
Select jobs to execute...

Conclusion

As you can imagine there is a lot more to say but this first post only aims at giving the concepts to understand the following ones. Next week, we are going to talk about snakemake options (the one you can give to the command line when you type snakemake --snakefile snakeFile --cores=1. To have an idea of the number of options available, run snakemake --help.

Stay tuned and see you next week! (next post)

Bioinformatics Services