Snakemake profile – 2: Reducing command-line options with profile

Profile of a woman with blue hairs by Pablo Picasso

A profile is a folder that contains all the configuration parameters to successfully run your pipeline. Of note, if you have used a cluster.json file before, be aware that it has been deprecated.

Preparation of files (if you skipped the first post)

Run the following script to create the folder structure:

#!/usr/bin/bash
# Create the folder containing the files needed for this tutorial
mkdir snakemake-profile-demo
# Enter the created folder
cd snakemake-profile-demo
# Create an empty file containing the snakemake code
touch snakeFile
# Create toy input files
mkdir inputs
echo "toto" > inputs/hello.txt
echo "totoBis" > inputs/helloBis.txt
# Create an empty folder to create a conda environment
# This is done to make sure that you use the same snakemake version as I do
mkdir envs
touch envs/environment.yaml

Copy the following content to snakeFile:

rule all:
  input:
    expand("results/{sampleName}.txt", sampleName=["hello", "helloBis"])
rule printContent:
  input:
    "inputs/{sampleName}.txt"
  output:
    "results/{sampleName}.txt"
  shell:
    """
    cat {input} > {output}
    """

Copy the following content to environment.yaml:

channels:
  - bioconda
dependencies:
  - snakemake-minimal=6.15.1

Create and activate the conda environment:

#!/usr/bin/bash
conda env create -p envs/smake --file envs/environment.yaml
conda activate envs/smake

Test the pipeline:

#!/usr/bin/bash
snakemake --snakefile snakeFile --cores=1

Snakemake options

In this section, I am going to detail the process of profile creation. This will increase progressively in complexity and we will need to add rules to the snakeFile. First create a config.yaml in a profile folder:

#!/usr/bin/bash
# Create the folder containing the configuration file, it can be named differently
mkdir profile
# Create a config.yaml that will contain all the configuration parameters
touch profile/config.yaml

The first thing we are going to do is to define some general snakemake parameters. To get a complete list of them try snakemake --help. The choice of parameters is subjective and depends on what you want to achieve. However, I found the one below pretty useful on a daily basis. Let’s start with the parameters that we already used. Add the following content to profile/config.yaml:

---
snakefile: snakeFile
cores: 1

The --- at the beginning of the file indicates the start of the document. This is not mandatory to use in our case, this is just a convention. Now run snakemake after deleting the results folder:

#!/usr/bin/bash
# Delete the results/ folder if present
rm -r results/
# Run snakemake with a dry run mode (option -n)
snakemake --profile profile/ -n

A dry run means that the snakemake pipeline will be evaluated but that no files will be produced. You should obtain:

Building DAG of jobs...
Job stats:
job             count    min threads    max threads
------------  -------  -------------  -------------
all                 1              1              1
printContent        2              1              1
total               3              1              1
[Fri Mar  4 08:44:12 2022]
rule printContent:
    input: inputs/helloBis.txt
    output: results/helloBis.txt
    jobid: 2
    wildcards: sampleName=helloBis
    resources: tmpdir=/tmp
[Fri Mar  4 08:44:12 2022]
rule printContent:
    input: inputs/hello.txt
    output: results/hello.txt
    jobid: 1
    wildcards: sampleName=hello
    resources: tmpdir=/tmp
[Fri Mar  4 08:44:12 2022]
localrule all:
    input: results/hello.txt, results/helloBis.txt
    jobid: 0
    resources: tmpdir=/tmp
Job stats:
job             count    min threads    max threads
------------  -------  -------------  -------------
all                 1              1              1
printContent        2              1              1
total               3              1              1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

As you observe, we were able to reduce the snakemake call from snakemake --snakefile snakeFile --cores=1 to snakemake --profile profile/. Therefore, the profile enables the definition of all the snakemake options (and more).

Let’s now add more options to profile/config.yaml:

---
snakefile: snakeFile
cores: 1
latency-wait: 60
reason: True
show-failed-logs: True
keep-going: True
printshellcmds: True
rerun-incomplete: True
restart-times: 3

latency-wait is useful as your system can sometimes be “slower” than snakemake. This means that even if an output file is created, snakemake might not see it. The default value is of 5 seconds, I usually set it to 60.

If a job fails, for whatever reason, it is possible to ask snakemake to try it again by setting re-run-incomplete to True. If a job is run, it can be because the file it produces does not exist yet or because a file on which the job depends (i.e. the input file) was created yet neither. Indeed, the point of using snakemake is to write pipelines. Therefore, you will design a series of jobs that depend on one another.

You can see the reason why a job is triggered by setting reason to True. show-failed-logs will display logs of failed jobs. keep-going tells snakemake to continue with independent jobs if one fails. In other words, snakemake will run as many rules as it can before terminating the pipeline. printshellcmds will print the code that you introduced in the shell section of your rules. Finally, with experience, you will notice that even if you define well the resources needed for each job (covered in the next post), the process can be prone to hiccups. By setting re-run-incomplete and restart-times, you minimize the chance of your pipeline failing even if well coded.

Replace now the content of profile/config.yaml with the above code and perform a dry run:

#!/usr/bin/bash
# Run snakemake with a dry run mode (option -n)
snakemake --profile profile/ -n

You can see below that the cat instruction now appears in your terminal with the sampleName wildcards replaced by the actual values:

Building DAG of jobs...
Job stats:
job             count    min threads    max threads
------------  -------  -------------  -------------
all                 1              1              1
printContent        2              1              1
total               3              1              1
[Fri Mar  4 09:34:13 2022]
rule printContent:
    input: inputs/helloBis.txt
    output: results/helloBis.txt
    jobid: 2
    reason: Missing output files: results/helloBis.txt
    wildcards: sampleName=helloBis
    resources: tmpdir=/tmp
    cat inputs/helloBis.txt > results/helloBis.txt
[Fri Mar  4 09:34:13 2022]
rule printContent:
    input: inputs/hello.txt
    output: results/hello.txt
    jobid: 1
    reason: Missing output files: results/hello.txt
    wildcards: sampleName=hello
    resources: tmpdir=/tmp
    cat inputs/hello.txt > results/hello.txt
[Fri Mar  4 09:34:13 2022]
localrule all:
    input: results/hello.txt, results/helloBis.txt
    jobid: 0
    reason: Input files updated by another job: results/helloBis.txt, results/hello.txt
    resources: tmpdir=/tmp
Job stats:
job             count    min threads    max threads
------------  -------  -------------  -------------
all                 1              1              1
printContent        2              1              1
total               3              1              1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Overall, we reduced the snakemake command from snakemake --snakefile snakeFile --cores 1 --latency-wait 60 --restart-times 3 --rerun-incomplete --reason --show-failed-logs --keep-going --printshellcmds
to a shorter call snakemake --profile profile/.

Next week, we will see how to submit your jobs to a cluster. Stay tuned! (Next post)

Bioinformatics Services

Snakemake profile – 2: Reducing command-line options with profile

Preparation of files (if you skipped the first post)

Snakemake options

Share