9 May 2022

Snakemake profile – 5: Handling memory and timeout errors

Profile portrait of young woman Anagoria by Botticelli

In the previous posts, we saw how to get started with snakemake, reduce command-line options, submit your jobs to a cluster and define resources and threads. However if one of your jobs fails because it uses more memory or time than requested, with what was covered so far, snakemake will not be able to stop displaying a proper error message. It will just hang. In this post, I will show how to correct this.

Create a new project

Create the following folder structure:

#!/usr/bin/bash
# Create the folder containing the files needed for this tutorial
mkdir snakemake-profile-memoryTime
# Enter the created folder
cd snakemake-profile-memoryTime
# Create an empty file containing the snakemake code
touch snakeFile
# Create an empty folder to create a conda environment
# This is done to make sure that you use the same snakemake version as I do
mkdir envs
touch envs/environment.yaml
'
# Create an empty folder to create a profile
mkdir profile
touch profile/config.yaml

Copy the following content to envs/environment.yaml (the indentations consist of two spaces):

channels:
  - bioconda
dependencies:
  - snakemake-minimal=6.15.1

Then execute the following commands to create and use a conda environment containing snakemake v6.15.1:

#!/usr/bin/bash
conda env create -p envs/smake --file envs/environment.yaml
conda activate envs/smake

Handling out-of-memory errors

Let’s create an out_of_memory rule for which we are sure to use more memory than requested. Copy the following content to snakeFile:

onstart:
    print("##### Test out of memory and timeout #####\n") 
    print("\t Creating jobs output subfolders...\n")
    shell("mkdir -p jobs/out_of_memory")
rule all:
    input:
        "results/big.txt"
rule out_of_memory:
    output:
        "results/big.txt",
    threads: 1
    shell:
        """
        for i in `seq 1000000`; do echo $i; done | sort -n | tail > {output}
        """

As you can see there is no input in the out_of_memory rule. Do not worry, this is not a problem. You will face many cases where you will need to create a rule before any files are present in your project. For instance, you will need to process and analyze public data that you will retrieve from the web. The out_of_memory rule was taken from this example. This rule attempts to sort a large sequence of random integers.

Copy the following content to profile/config.yaml:

---
snakefile: snakeFile
latency-wait: 60
reason: True
show-failed-logs: True
keep-going: True
printshellcmds: True
# Cluster submission
jobname: "{rule}.{jobid}"              # Provide a custom name for the jobscript that is submitted to the cluster.
max-jobs-per-second: 1                 #Maximal number of cluster/drmaa jobs per second, default is 10, fractions allowed.
max-status-checks-per-second: 10       #Maximal number of job status checks per second, default is 10
jobs: 400                              #Use at most N CPU cluster/cloud jobs in parallel.
cluster: "sbatch --output=\"jobs/{rule}/slurm_%x_%j.out\" --error=\"jobs/{rule}/slurm_%x_%j.log\" --mem={resources.mem_mb} --time={resources.runtime}"
# Job resources
set-resources:
  - out_of_memory:mem_mb=50
  - out_of_memory:runtime=00:03:00
# For some reasons time needs quotes to be read by snakemake
default-resources:
  - mem_mb=500
  - runtime="00:01:00"
# Define the number of threads used by rules
set-threads:
  - out_of_memory=1

The out_of_memory rule requests 100 MB of memory and we set its needs to 50 MB in set-resources. It will fail with an out of memory error from Slurm (OUT_OF_MEMORY). Snakemake should properly detect this error and shut down. Of note, if you followed the previous tutorials and as we know that out_of_memory will trigger an error, I removed the options rerun-incomplete and restart-time from the profile. Perform a real run:

#!/usr/bin/bash
snakemake --profile profile/

Verify the content of the log file:

#!/usr/bin/bash
more jobs/out_of_memory/slurm_*log

At the end of the log file, you should see that slurm correctly handled the error:

slurmstepd: error: Detected 4 oom-kill event(s) in StepId=35290704.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Create a rule for the time-out errors

Let’s now add a second rule to our snakeFile to test the out of time error:

onstart:
    print("##### Test out of memory and timeout #####\n") 
    print("\t Creating jobs output subfolders...\n")
    shell("mkdir -p jobs/out_of_memory")
    shell("mkdir -p jobs/out_of_time")
rule all:
    input:
        "results/big.txt",
        "results/time.txt"
rule out_of_memory:
    output:
        "results/big.txt",
    threads: 1
    shell:
        """
        for i in `seq 1000000`; do echo $i; done | sort -n | tail > {output}
        """
rule out_of_time:
    output:
        "results/time.txt",
    threads: 1
    shell:
        """
        sleep 100s
        echo "hello" > {output}
        """

Do not forget to fill the onstart and rule all sections appropriately. We add the appropriate resources in profile/config.yaml:

---
snakefile: snakeFile
latency-wait: 60
reason: True
show-failed-logs: True
keep-going: True
printshellcmds: True
# Cluster submission
jobname: "{rule}.{jobid}"              # Provide a custom name for the jobscript that is submitted to the cluster.
max-jobs-per-second: 1                 #Maximal number of cluster/drmaa jobs per second, default is 10, fractions allowed.
max-status-checks-per-second: 10       #Maximal number of job status checks per second, default is 10
jobs: 400                              #Use at most N CPU cluster/cloud jobs in parallel.
cluster: "sbatch --output=\"jobs/{rule}/slurm_%x_%j.out\" --error=\"jobs/{rule}/slurm_%x_%j.log\" --mem={resources.mem_mb} --time={resources.runtime}"
# Job resources
set-resources:
  - out_of_memory:mem_mb=50
  - out_of_memory:runtime=00:03:00
  - out_of_time:mem_mb=100
  - out_of_time:runtime=00:00:10
# For some reasons time needs quotes to be read by snakemake
default-resources:
  - mem_mb=500
  - runtime="00:01:00"
# Define the number of threads used by rules
set-threads:
  - out_of_memory=1
  - out_of_time=1

Run the pipeline

The out_of_time rule sleeps for 100 seconds and books a 10 seconds frame on the cluster. Let’s see if snakemake can catch the error:

#!/usr/bin/bash
rm -r jobs/
rm -r results/
snakemake --profile profile/

You will see that even if all your jobs finished running, snakemake does not catch the “out-of-time” error and does not stop. Use “Ctrl+c” to end snakemake. However, if you check the log of the out_of_time rule, there is an error:

#!/usr/bin/bash
more jobs/out_of_time/slurm*log

You should see at the end of the log:

slurmstepd: error: *** JOB 40212447 ON smer30-4 CANCELLED AT 2022-05-09T22:41:41 DUE TO TIME LIMIT ***

Handling time-out-errors

In order for snakemake to display the error and stop, we have to add a script to our profile. Create the file profile/status-sacct.sh and add the following content:

#!/usr/bin/env bash
# Check status of Slurm job
jobid="$1"
if [[ "$jobid" == Submitted ]]
then
  echo smk-simple-slurm: Invalid job ID: "$jobid" >&2
  echo smk-simple-slurm: Did you remember to add the flag --parsable to your sbatch call? >&2
  exit 1
fi
output=`sacct -j "$jobid" --format State --noheader | head -n 1 | awk '{print $1}'`
if [[ $output =~ ^(COMPLETED).* ]]
then
  echo success
elif [[ $output =~ ^(RUNNING|PENDING|COMPLETING|CONFIGURING|SUSPENDED).* ]]
then
  echo running
else
  echo failed
fi

If your cluster does not have sacct installed, you can ask your admin or look for alternatives here. Important, you need to make status-sacct.sh executable:

#!/usr/bin/bash
rm -r jobs/
rm -r results/
chmod +x profile/status-sacct.sh
snakemake --profile profile/

snakemake still fails to detect the error and this is because the profile/status-sacct.sh script was not integrated by the cluster or more precisely, the script was not able to “communicate” with the cluster. The solution is to add the --parsable option to the cluster section and a new cluster-status: "./profile/status-sacct.sh" section to profile/config.yaml:

---
snakefile: snakeFile
latency-wait: 60
reason: True
show-failed-logs: True
keep-going: True
printshellcmds: True
# Cluster submission
jobname: "{rule}.{jobid}"              # Provide a custom name for the jobscript that is submitted to the cluster.
max-jobs-per-second: 1                 #Maximal number of cluster/drmaa jobs per second, default is 10, fractions allowed.
max-status-checks-per-second: 10       #Maximal number of job status checks per second, default is 10
jobs: 400                              #Use at most N CPU cluster/cloud jobs in parallel.
cluster: "sbatch --output=\"jobs/{rule}/slurm_%x_%j.out\" --error=\"jobs/{rule}/slurm_%x_%j.log\" --mem={resources.mem_mb} --time={resources.runtime} --parsable"
cluster-status: "./profile/status-sacct.sh" #  Use to handle timeout exception, do not forget to chmod +x
# Job resources
set-resources:
  - out_of_memory:mem_mb=50
  - out_of_memory:runtime=00:03:00
  - out_of_time:mem_mb=100
  - out_of_time:runtime=00:00:10
# For some reasons time needs quotes to be read by snakemake
default-resources:
  - mem_mb=500
  - runtime="00:01:00"
# Define the number of threads used by rules
set-threads:
  - out_of_memory=1
  - out_of_time=1

Run the following commands:

#!/usr/bin/bash
rm -r jobs/
rm -r results/
snakemake --profile profile/

In your terminal, you should see:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 400
Job stats:
job              count    min threads    max threads
-------------  -------  -------------  -------------
all                  1              1              1
out_of_memory        1              1              1
out_of_time          1              1              1
total                3              1              1
##### Test out of memory and timeout #####
         Creating jobs output subfolders...
mkdir -p jobs/out_of_memory
mkdir -p jobs/out_of_time
Select jobs to execute...
[Mon May  9 22:53:01 2022]
rule out_of_time:
    output: results/time.txt
    jobid: 2
    reason: Missing output files: results/time.txt
    resources: mem_mb=100, disk_mb=1000, tmpdir=/tmp, runtime=00:00:10
        sleep 100s
        echo "hello" > results/time.txt
Submitted job 2 with external jobid '40212727'.
[Mon May  9 22:53:01 2022]
rule out_of_memory:
    output: results/big.txt
    jobid: 1
    reason: Missing output files: results/big.txt
    resources: mem_mb=50, disk_mb=1000, tmpdir=/tmp, runtime=00:03:00
        for i in `seq 1000000`; do echo $i; done | sort -n | tail > results/big.txt
Submitted job 1 with external jobid '40212728'.
[Mon May  9 22:53:11 2022]
Error in rule out_of_memory:
    jobid: 1
    output: results/big.txt
    shell:
        for i in `seq 1000000`; do echo $i; done | sort -n | tail > results/big.txt
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    cluster_jobid: 40212728
Error executing rule out_of_memory on cluster (jobid: 1, external: 40212728, jobscript: /g/romebioinfo/tmp/snakemake-profile-memoryTime/.snakemake/tmp.cpiviakq/out_of_memory.1). For error details see the cluster log and the log files of the involved rule(s).
Job failed, going on with independent jobs.
[Mon May  9 22:54:22 2022]
Error in rule out_of_time:
    jobid: 2
    output: results/time.txt
    shell:
        sleep 100s
        echo "hello" > results/time.txt
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    cluster_jobid: 40212727
Error executing rule out_of_time on cluster (jobid: 2, external: 40212727, jobscript: /g/romebioinfo/tmp/snakemake-profile-memoryTime/.snakemake/tmp.cpiviakq/out_of_time.2). For error details see the cluster log and the log files of the involved rule(s).
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
Complete log: /g/romebioinfo/tmp/snakemake-profile-memoryTime/.snakemake/log/2022-05-09T225300.853803.snakemake.log

Conclusion

You know now how to handle memory and time errors. In the next post, we will see how to use singularity. Stay tuned!

Bioinformatics Services