{"id":470,"date":"2022-05-09T21:08:39","date_gmt":"2022-05-09T21:08:39","guid":{"rendered":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/?p=470"},"modified":"2022-05-09T21:08:40","modified_gmt":"2022-05-09T21:08:40","slug":"snakemake-profile-5-handling-memory-and-timeout-errors","status":"publish","type":"post","link":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/blog\/2022\/05\/snakemake-profile-5-handling-memory-and-timeout-errors\/","title":{"rendered":"Snakemake profile &#8211; 5: Handling memory and timeout errors"},"content":{"rendered":"\n<div style=\"height:29px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>In the previous posts, we saw how to&nbsp;<a rel=\"noreferrer noopener\" href=\"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/blog\/2022\/03\/snakemake-profile-1-getting-started-with-snakemake\/\" target=\"_blank\">get started with snakemake<\/a>,&nbsp;<a rel=\"noreferrer noopener\" href=\"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/blog\/2022\/03\/snakemake-profile-2-reducing-command-line-options-with-profile\/\" target=\"_blank\">reduce command-line options<\/a>, <a rel=\"noreferrer noopener\" href=\"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/blog\/2022\/04\/snakemake-profile-3-cluster-submission-defining-parameters\/\" target=\"_blank\">submit your jobs to a cluster<\/a> and <a rel=\"noreferrer noopener\" href=\"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/blog\/2022\/04\/snakemake-profile-4-defining-resources-and-threads\/\" target=\"_blank\">define resources and threads<\/a>. However if one of your jobs fails because it uses more memory or time than requested, with what was covered so far, snakemake will not be able to stop displaying a proper error message. It will just hang. In this post, I will show how to correct this.<\/p>\n\n\n\n<div style=\"height:29px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Create a new project<\/h2>\n\n\n\n<p>Create the following folder structure:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\n# Create the folder containing the files needed for this tutorial\nmkdir snakemake-profile-memoryTime\n\n# Enter the created folder\ncd snakemake-profile-memoryTime\n\n# Create an empty file containing the snakemake code\ntouch snakeFile\n\n# Create an empty folder to create a conda environment\n# This is done to make sure that you use the same snakemake version as I do\nmkdir envs\ntouch envs\/environment.yaml\n'\n# Create an empty folder to create a profile\nmkdir profile\ntouch profile\/config.yaml\n<\/code><\/pre>\n\n\n\n<p>Copy the following content to <code>envs\/environment.yaml<\/code> (the indentations consist of two spaces):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>channels:\n  - bioconda\ndependencies:\n  - snakemake-minimal=6.15.1\n<\/code><\/pre>\n\n\n\n<p>Then execute the following commands to create and use a conda environment containing snakemake v6.15.1:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nconda env create -p envs\/smake --file envs\/environment.yaml\nconda activate envs\/smake\n<\/code><\/pre>\n\n\n\n<div style=\"height:29px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Handling out-of-memory errors<\/h2>\n\n\n\n<p>Let&#8217;s create an <code>out_of_memory<\/code> rule for which we are sure to use more memory than requested. Copy the following content to <code>snakeFile<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>onstart:\n    print(\"##### Test out of memory and timeout #####\\n\") \n    print(\"\\t Creating jobs output subfolders...\\n\")\n    shell(\"mkdir -p jobs\/out_of_memory\")\n\nrule all:\n    input:\n        \"results\/big.txt\"\n\nrule out_of_memory:\n    output:\n        \"results\/big.txt\",\n    threads: 1\n    shell:\n        \"\"\"\n        for i in `seq 1000000`; do echo $i; done | sort -n | tail &gt; {output}\n        \"\"\"\n<\/code><\/pre>\n\n\n\n<p>As you can see there is no <code>input<\/code> in the <code>out_of_memory<\/code> rule. Do not worry, this is not a problem. You will face many cases where you will need to create a rule before any files are present in your project. For instance, you will need to process and analyze public data that you will retrieve from the web. The <code>out_of_memory<\/code> rule was taken from this <a href=\"https:\/\/github.com\/jdblischak\/smk-simple-slurm\/tree\/main\/examples\/out-of-memory\" target=\"_blank\" rel=\"noreferrer noopener\">example<\/a>. This rule attempts to sort a large sequence of random integers. <\/p>\n\n\n\n<p>Copy the following content to <code>profile\/config.yaml<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>---\n\nsnakefile: snakeFile\n\nlatency-wait: 60\nreason: True\nshow-failed-logs: True\nkeep-going: True\nprintshellcmds: True\n\n# Cluster submission\njobname: \"{rule}.{jobid}\"              # Provide a custom name for the jobscript that is submitted to the cluster.\nmax-jobs-per-second: 1                 #Maximal number of cluster\/drmaa jobs per second, default is 10, fractions allowed.\nmax-status-checks-per-second: 10       #Maximal number of job status checks per second, default is 10\njobs: 400                              #Use at most N CPU cluster\/cloud jobs in parallel.\ncluster: \"sbatch --output=\\\"jobs\/{rule}\/slurm_%x_%j.out\\\" --error=\\\"jobs\/{rule}\/slurm_%x_%j.log\\\" --mem={resources.mem_mb} --time={resources.runtime}\"\n\n# Job resources\nset-resources:\n  - out_of_memory:mem_mb=50\n  - out_of_memory:runtime=00:03:00\n    \n# For some reasons time needs quotes to be read by snakemake\ndefault-resources:\n  - mem_mb=500\n  - runtime=\"00:01:00\"\n  \n# Define the number of threads used by rules\nset-threads:\n  - out_of_memory=1\n<\/code><\/pre>\n\n\n\n<p>The <code>out_of_memory<\/code> rule requests 100 MB of memory and we set its needs to 50 MB in <code>set-resources<\/code>. It will fail with an out of memory error from Slurm (OUT_OF_MEMORY). Snakemake should properly detect this error and shut down. Of note, if you followed the previous tutorials and as we know that <code>out_of_memory<\/code> will trigger an error, I removed the options <code>rerun-incomplete<\/code> and <code>restart-time<\/code> from the profile. Perform a <em>real<\/em> run:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nsnakemake --profile profile\/\n<\/code><\/pre>\n\n\n\n<p>Verify the content of the log file:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nmore jobs\/out_of_memory\/slurm_*log\n<\/code><\/pre>\n\n\n\n<p>At the end of the log file, you should see that slurm correctly handled the error:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>slurmstepd: error: Detected 4 oom-kill event(s) in StepId=35290704.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.\n<\/code><\/pre>\n\n\n\n<div style=\"height:29px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Create a rule for the time-out errors<\/h2>\n\n\n\n<p>Let&#8217;s now add a second rule to our <code>snakeFile<\/code> to test the out of time error:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>onstart:\n    print(\"##### Test out of memory and timeout #####\\n\") \n    print(\"\\t Creating jobs output subfolders...\\n\")\n    shell(\"mkdir -p jobs\/out_of_memory\")\n    shell(\"mkdir -p jobs\/out_of_time\")\n\nrule all:\n    input:\n        \"results\/big.txt\",\n        \"results\/time.txt\"\n\nrule out_of_memory:\n    output:\n        \"results\/big.txt\",\n    threads: 1\n    shell:\n        \"\"\"\n        for i in `seq 1000000`; do echo $i; done | sort -n | tail &gt; {output}\n        \"\"\"\n\nrule out_of_time:\n    output:\n        \"results\/time.txt\",\n    threads: 1\n    shell:\n        \"\"\"\n        sleep 100s\n        echo \"hello\" &gt; {output}\n        \"\"\"\n<\/code><\/pre>\n\n\n\n<p>Do not forget to fill the <code>onstart<\/code> and <code>rule all<\/code> sections appropriately. We add the appropriate resources in <code>profile\/config.yaml<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>---\n\nsnakefile: snakeFile\n\nlatency-wait: 60\nreason: True\nshow-failed-logs: True\nkeep-going: True\nprintshellcmds: True\n\n# Cluster submission\njobname: \"{rule}.{jobid}\"              # Provide a custom name for the jobscript that is submitted to the cluster.\nmax-jobs-per-second: 1                 #Maximal number of cluster\/drmaa jobs per second, default is 10, fractions allowed.\nmax-status-checks-per-second: 10       #Maximal number of job status checks per second, default is 10\njobs: 400                              #Use at most N CPU cluster\/cloud jobs in parallel.\ncluster: \"sbatch --output=\\\"jobs\/{rule}\/slurm_%x_%j.out\\\" --error=\\\"jobs\/{rule}\/slurm_%x_%j.log\\\" --mem={resources.mem_mb} --time={resources.runtime}\"\n\n# Job resources\nset-resources:\n  - out_of_memory:mem_mb=50\n  - out_of_memory:runtime=00:03:00\n  - out_of_time:mem_mb=100\n  - out_of_time:runtime=00:00:10\n    \n# For some reasons time needs quotes to be read by snakemake\ndefault-resources:\n  - mem_mb=500\n  - runtime=\"00:01:00\"\n  \n# Define the number of threads used by rules\nset-threads:\n  - out_of_memory=1\n  - out_of_time=1\n<\/code><\/pre>\n\n\n\n<div style=\"height:29px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Run the pipeline<\/h2>\n\n\n\n<p>The <code>out_of_<\/code>time rule sleeps for 100 seconds and books a 10 seconds frame on the cluster. Let&#8217;s see if snakemake can catch the error:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nrm -r jobs\/\nrm -r results\/\nsnakemake --profile profile\/\n<\/code><\/pre>\n\n\n\n<p>You will see that even if all your jobs finished running, snakemake does not catch the &#8220;out-of-time&#8221; error and <strong>does not stop<\/strong>. Use &#8220;Ctrl+c&#8221; to end snakemake. However, if you check the log of the <code>out_of_time<\/code> rule, there is an error:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nmore jobs\/out_of_time\/slurm*log\n<\/code><\/pre>\n\n\n\n<p>You should see at the end of the log:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>slurmstepd: error: *** JOB 40212447 ON smer30-4 CANCELLED AT 2022-05-09T22:41:41 DUE TO TIME LIMIT ***<\/code><\/pre>\n\n\n\n<div style=\"height:29px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Handling time-out-errors<\/h2>\n\n\n\n<p>In order for snakemake to display the error and stop, we have to add a script to our profile. Create the file <code>profile\/status-sacct.sh<\/code> and add the following content:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/env bash\n\n# Check status of Slurm job\n\njobid=\"$1\"\n\nif &#91;&#91; \"$jobid\" == Submitted ]]\nthen\n  echo smk-simple-slurm: Invalid job ID: \"$jobid\" &gt;&amp;2\n  echo smk-simple-slurm: Did you remember to add the flag --parsable to your sbatch call? &gt;&amp;2\n  exit 1\nfi\n\noutput=`sacct -j \"$jobid\" --format State --noheader | head -n 1 | awk '{print $1}'`\n\nif &#91;&#91; $output =~ ^(COMPLETED).* ]]\nthen\n  echo success\nelif &#91;&#91; $output =~ ^(RUNNING|PENDING|COMPLETING|CONFIGURING|SUSPENDED).* ]]\nthen\n  echo running\nelse\n  echo failed\nfi\n<\/code><\/pre>\n\n\n\n<p>If your cluster does not have <code>sacct<\/code> installed, you can ask your admin or look for alternatives <a href=\"https:\/\/github.com\/jdblischak\/smk-simple-slurm\/tree\/main\/examples\/timeout\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>. <strong>Important<\/strong>, you need to make <code>status-sacct.sh<\/code> executable:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nrm -r jobs\/\nrm -r results\/\nchmod +x profile\/status-sacct.sh\nsnakemake --profile profile\/\n<\/code><\/pre>\n\n\n\n<p>snakemake still fails to detect the error and this is because the <code>profile\/status-sacct.sh<\/code> script was not integrated by the cluster or more precisely, the script was not able to &#8220;communicate&#8221; with the cluster. The solution is to add the <code>--parsable<\/code> option to the <code>cluster<\/code> section and a new <code>cluster-status: \".\/profile\/status-sacct.sh\"<\/code> section to <code>profile\/config.yaml<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>---\n\nsnakefile: snakeFile\n\nlatency-wait: 60\nreason: True\nshow-failed-logs: True\nkeep-going: True\nprintshellcmds: True\n\n# Cluster submission\njobname: \"{rule}.{jobid}\"              # Provide a custom name for the jobscript that is submitted to the cluster.\nmax-jobs-per-second: 1                 #Maximal number of cluster\/drmaa jobs per second, default is 10, fractions allowed.\nmax-status-checks-per-second: 10       #Maximal number of job status checks per second, default is 10\njobs: 400                              #Use at most N CPU cluster\/cloud jobs in parallel.\ncluster: \"sbatch --output=\\\"jobs\/{rule}\/slurm_%x_%j.out\\\" --error=\\\"jobs\/{rule}\/slurm_%x_%j.log\\\" --mem={resources.mem_mb} --time={resources.runtime} --parsable\"\ncluster-status: \".\/profile\/status-sacct.sh\" #  Use to handle timeout exception, do not forget to chmod +x\n\n\n# Job resources\nset-resources:\n  - out_of_memory:mem_mb=50\n  - out_of_memory:runtime=00:03:00\n  - out_of_time:mem_mb=100\n  - out_of_time:runtime=00:00:10\n    \n# For some reasons time needs quotes to be read by snakemake\ndefault-resources:\n  - mem_mb=500\n  - runtime=\"00:01:00\"\n  \n# Define the number of threads used by rules\nset-threads:\n  - out_of_memory=1\n  - out_of_time=1\n<\/code><\/pre>\n\n\n\n<p>Run the following commands:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nrm -r jobs\/\nrm -r results\/\nsnakemake --profile profile\/\n<\/code><\/pre>\n\n\n\n<p>In your terminal, you should see:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Building DAG of jobs...\nUsing shell: \/usr\/bin\/bash\nProvided cluster nodes: 400\nJob stats:\njob              count    min threads    max threads\n-------------  -------  -------------  -------------\nall                  1              1              1\nout_of_memory        1              1              1\nout_of_time          1              1              1\ntotal                3              1              1\n\n##### Test out of memory and timeout #####\n\n         Creating jobs output subfolders...\n\nmkdir -p jobs\/out_of_memory\nmkdir -p jobs\/out_of_time\nSelect jobs to execute...\n\n&#91;Mon May  9 22:53:01 2022]\nrule out_of_time:\n    output: results\/time.txt\n    jobid: 2\n    reason: Missing output files: results\/time.txt\n    resources: mem_mb=100, disk_mb=1000, tmpdir=\/tmp, runtime=00:00:10\n\n\n        sleep 100s\n        echo \"hello\" &gt; results\/time.txt\n\nSubmitted job 2 with external jobid '40212727'.\n\n&#91;Mon May  9 22:53:01 2022]\nrule out_of_memory:\n    output: results\/big.txt\n    jobid: 1\n    reason: Missing output files: results\/big.txt\n    resources: mem_mb=50, disk_mb=1000, tmpdir=\/tmp, runtime=00:03:00\n\n\n        for i in `seq 1000000`; do echo $i; done | sort -n | tail &gt; results\/big.txt\n\nSubmitted job 1 with external jobid '40212728'.\n&#91;Mon May  9 22:53:11 2022]\nError in rule out_of_memory:\n    jobid: 1\n    output: results\/big.txt\n    shell:\n\n        for i in `seq 1000000`; do echo $i; done | sort -n | tail &gt; results\/big.txt\n\n        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)\n    cluster_jobid: 40212728\n\nError executing rule out_of_memory on cluster (jobid: 1, external: 40212728, jobscript: \/g\/romebioinfo\/tmp\/snakemake-profile-memoryTime\/.snakemake\/tmp.cpiviakq\/out_of_memory.1). For error details see the cluster log and the log files of the involved rule(s).\nJob failed, going on with independent jobs.\n&#91;Mon May  9 22:54:22 2022]\nError in rule out_of_time:\n    jobid: 2\n    output: results\/time.txt\n    shell:\n\n        sleep 100s\n        echo \"hello\" &gt; results\/time.txt\n\n        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)\n    cluster_jobid: 40212727\n\nError executing rule out_of_time on cluster (jobid: 2, external: 40212727, jobscript: \/g\/romebioinfo\/tmp\/snakemake-profile-memoryTime\/.snakemake\/tmp.cpiviakq\/out_of_time.2). For error details see the cluster log and the log files of the involved rule(s).\nJob failed, going on with independent jobs.\nExiting because a job execution failed. Look above for error message\nComplete log: \/g\/romebioinfo\/tmp\/snakemake-profile-memoryTime\/.snakemake\/log\/2022-05-09T225300.853803.snakemake.log\n\n<\/code><\/pre>\n\n\n\n<div style=\"height:29px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>You know now how to handle memory and time errors. In the next post, we will see how to use singularity. Stay tuned!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the previous posts, we saw how to&nbsp;get started with snakemake,&nbsp;reduce command-line options, submit your jobs to a cluster and define resources and threads. However if one of your jobs fails because it uses more memory or time than requested, with what was covered so far, snakemake will&hellip;<\/p>\n","protected":false},"author":5,"featured_media":502,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[4096],"tags":[4100,4098],"embl_taxonomy":[],"class_list":["post-470","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technical","tag-profile","tag-snakemake"],"acf":[],"embl_taxonomy_terms":[],"featured_image_src":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-content\/uploads\/2022\/05\/Botticelli.jpg","_links":{"self":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts\/470"}],"collection":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/comments?post=470"}],"version-history":[{"count":12,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts\/470\/revisions"}],"predecessor-version":[{"id":504,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts\/470\/revisions\/504"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/media\/502"}],"wp:attachment":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/media?parent=470"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/categories?post=470"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/tags?post=470"},{"taxonomy":"embl_taxonomy","embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/embl_taxonomy?post=470"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}