{"id":288,"date":"2022-04-07T10:04:43","date_gmt":"2022-04-07T10:04:43","guid":{"rendered":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/?p=288"},"modified":"2022-04-07T10:04:44","modified_gmt":"2022-04-07T10:04:44","slug":"snakemake-profile-3-cluster-submission-defining-parameters","status":"publish","type":"post","link":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/blog\/2022\/04\/snakemake-profile-3-cluster-submission-defining-parameters\/","title":{"rendered":"Snakemake profile &#8211; 3: Cluster submission &#8211; Defining parameters"},"content":{"rendered":"\n<div style=\"height:23px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>The power of snakemake is to enable parallelization. Through the use of a cluster, jobs can be processed in parallel automatically. One needs to define how snakemake will handle the job submission process. If you already followed the two first posts (<a href=\"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/blog\/2022\/03\/snakemake-profile-1-getting-started-with-snakemake\/\" target=\"_blank\" rel=\"noreferrer noopener\">1<\/a>,<a href=\"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/blog\/2022\/03\/snakemake-profile-2-reducing-command-line-options-with-profile\/\" target=\"_blank\" rel=\"noreferrer noopener\">2<\/a>), you can skip the first section. <\/p>\n\n\n\n<div style=\"height:22px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Preparation of files<\/h2>\n\n\n\n<p>For more details about the steps described in this section, see the previous posts. Run the following script to create the folder structure:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\n# Create the folder containing the files needed for this tutorial\nmkdir snakemake-profile-demo\n\n# Enter the created folder\ncd snakemake-profile-demo\n\n# Create an empty file containing the snakemake code\ntouch snakeFile\n\n# Create toy input files\nmkdir inputs\necho \"toto\" &gt; inputs\/hello.txt\necho \"totoBis\" &gt; inputs\/helloBis.txt\n\n# Create the folder containing the configuration file, it can be named differently\nmkdir profile\n# Create a config.yaml that will contain all the configuration parameters\ntouch profile\/config.yaml\n\n# Create an empty folder to create a conda environment\n# This is done to make sure that you use the same snakemake version as I do\nmkdir envs\ntouch envs\/environment.yaml<\/code><\/pre>\n\n\n\n<p>Copy the following content to&nbsp;<code>snakeFile<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>rule all:\n  input:\n    expand(\"results\/{sampleName}.txt\", sampleName=&#91;\"hello\", \"helloBis\"])\nrule printContent:\n  input:\n    \"inputs\/{sampleName}.txt\"\n  output:\n    \"results\/{sampleName}.txt\"\n  shell:\n    \"\"\"\n    cat {input} &gt; {output}\n    \"\"\"<\/code><\/pre>\n\n\n\n<p>Copy the following content to&nbsp;<code>environment.yaml<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>channels:\n  - bioconda\ndependencies:\n  - snakemake-minimal=6.15.1<\/code><\/pre>\n\n\n\n<p>Copy the following content to&nbsp;<code>profile\/config.yaml<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>---\nsnakefile: snakeFile\ncores: 1\nlatency-wait: 60\nreason: True\nshow-failed-logs: True\nkeep-going: True\nprintshellcmds: True\nrerun-incomplete: True\nrestart-times: 3<\/code><\/pre>\n\n\n\n<p>Create and activate the conda environment:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\nconda env create -p envs\/smake --file envs\/environment.yaml\nconda activate envs\/smake<\/code><\/pre>\n\n\n\n<div style=\"height:22px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Defining parameters<\/h2>\n\n\n\n<p>Add the <code>cluster submission<\/code> section at the bottom of <code>profile\/config.yaml<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>---\n\nsnakefile: snakeFile\ncores: 1\n\nlatency-wait: 60\nreason: True\nshow-failed-logs: True\nkeep-going: True\nprintshellcmds: True\nrerun-incomplete: True\nrestart-times: 3\n\n# Cluster submission\njobname: \"{rule}.{jobid}\"              # Provide a custom name for the jobscript that is submitted to the cluster.\nmax-jobs-per-second: 1                 #Maximal number of cluster\/drmaa jobs per second, default is 10, fractions allowed.\nmax-status-checks-per-second: 10       #Maximal number of job status checks per second, default is 10\njobs: 400                              #Use at most N CPU cluster\/cloud jobs in parallel.\n<\/code><\/pre>\n\n\n\n<p> <code>jobname<\/code> has the default value of &#8220;snakejob.{name}.{jobid}.sh&#8221;. I made it shorter in the code above. One last thing to do is to define how the cluster will handle the jobs. This is system-specific and the choice of options is subjective.<\/p>\n\n\n\n<p> <br>In this section, <strong>I will show how to define the options on a <a href=\"https:\/\/slurm.schedmd.com\/sbatch.html\" target=\"_blank\" rel=\"noreferrer noopener\">slurm<\/a> system<\/strong>. Please adapt the code to yours. For a complete list of options check <code>sbatch --help<\/code>. A minimal setup would consist of: <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cluster: \"sbatch --output=\\\"jobs\/{rule}\/slurm_%x_%j.out\\\" --error=\\\"jobs\/{rule}\/slurm_%x_%j.log\\\"\"<\/code><\/pre>\n\n\n\n<p>This instruction tells the cluster to write the console output in the file &#8220;jobs\/printContent\/slurm_printContent.1_355014.out&#8221; and the potential errors to &#8220;jobs\/printContent\/slurm_printContent.1_355014.log&#8221;. The <code>{rule}<\/code> wildcards has<br>been replaced by <code>printContent<\/code>; <code>%x<\/code> is a slurm variable corresponding to the job name (that we defined as &#8220;{rule}.{jobid}&#8221;); and %j is a slurm variable corresponding to the job number attributed by the cluster. Add this line to <code>profile\/config.yaml<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>---\n\nsnakefile: snakeFile\ncores: 1\n\nlatency-wait: 60\nreason: True\nshow-failed-logs: True\nkeep-going: True\nprintshellcmds: True\nrerun-incomplete: True\nrestart-times: 3\n\n# Cluster submission\njobname: \"{rule}.{jobid}\"              # Provide a custom name for the jobscript that is submitted to the cluster.\nmax-jobs-per-second: 1                 #Maximal number of cluster\/drmaa jobs per second, default is 10, fractions allowed.\nmax-status-checks-per-second: 10       #Maximal number of job status checks per second, default is 10\njobs: 400                              #Use at most N CPU cluster\/cloud jobs in parallel.\ncluster: \"sbatch --output=\\\"jobs\/{rule}\/slurm_%x_%j.out\\\" --error=\\\"jobs\/{rule}\/slurm_%x_%j.log\\\"\"\n<\/code><\/pre>\n\n\n\n<p>We need to create the <code>jobs\/{rule}<\/code> folders when snakemake runs. We can use an <code>onstart<\/code> section in <code>snakeFile<\/code> that will trigger instructions when the pipeline is loaded:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>onstart:\n    print(\"##### Creating profile pipeline #####\\n\") \n    print(\"\\t Creating jobs output subfolders...\\n\")\n    shell(\"mkdir -p jobs\/printContent\")\n\nrule all:\n  input:\n    expand(\"results\/{sampleName}.txt\", sampleName=&#91;\"hello\", \"helloBis\"])\n\nrule printContent:\n  input:\n    \"inputs\/{sampleName}.txt\"\n  output:\n    \"results\/{sampleName}.txt\"\n  shell:\n    \"\"\"\n    cat {input} &gt; {output}\n    \"\"\"\n<\/code><\/pre>\n\n\n\n<p>First, perform a dry run to verify that everything works and then run the pipeline <em>per se<\/em>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\n# If you did not already, activate the environment\nconda activate envs\/smake\n\n# Perform dry run\nsnakemake --profile profile\/ -n\n\n# Run the pipeline\nsnakemake --profile profile\/\n<\/code><\/pre>\n\n\n\n<p>Verify that the two jobs <code>printContent<\/code> are indeed running on your cluster. In slurm try <code>squeue -i10 --user myusername<\/code>. You will also notice messages in your console such as <code>Submitted job 1 with external jobid 'Submitted batch job 35248057'<\/code>.<br>Now verify that the files were created in the jobs folder:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nls jobs\/printContent\/*\nmore jobs\/printContent\/*\n<\/code><\/pre>\n\n\n\n<p>As the command of the rule <code>printContent<\/code> does not fail, you should get empty <code>slurm_printContent.[1-2]_[0-9]+.out<\/code> files. The <code>.log<\/code> files should contain what was printed on your console during the run:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Building DAG of jobs...\nUsing shell: \/usr\/bin\/bash\nProvided cores: 2\nRules claiming more threads will be scaled down.\nSelect jobs to execute...\n\n&#91;Fri Mar  4 14:37:34 2022]\nrule printContent:\n    input: inputs\/hello.txt\n    output: results\/hello.txt\n    jobid: 0\n    wildcards: sampleName=hello\n    resources: mem_mb=1000, disk_mb=1000, tmpdir=\/scratch\/jobs\/35248057\n\n\n    cat inputs\/hello.txt &gt; results\/hello.txt\n    \n&#91;Fri Mar  4 14:37:35 2022]\nFinished job 0.\n1 of 1 steps (100%) done\n<\/code><\/pre>\n\n\n\n<p>Above you can see that two new pieces of information were added to <code>resources<\/code>: mem_mb and disk_mb. These specify the amount of RAM and disk space used by the job. The value of 1000 was given by default.<\/p>\n\n\n\n<p>Next week, we will see how to define these resources in <code>profile\/config.yaml<\/code>. Stay tuned!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The power of snakemake is to enable parallelization. Through the use of a cluster, jobs can be processed in parallel automatically. One needs to define how snakemake will handle the job submission process. If you already followed the two first posts (1,2), you can skip the first section.&hellip;<\/p>\n","protected":false},"author":5,"featured_media":310,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[4096],"tags":[4100,4098],"embl_taxonomy":[],"class_list":["post-288","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technical","tag-profile","tag-snakemake"],"acf":[],"embl_taxonomy_terms":[],"featured_image_src":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-content\/uploads\/2022\/04\/lucianfreud.jpeg","_links":{"self":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts\/288"}],"collection":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/comments?post=288"}],"version-history":[{"count":11,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts\/288\/revisions"}],"predecessor-version":[{"id":312,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts\/288\/revisions\/312"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/media\/310"}],"wp:attachment":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/media?parent=288"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/categories?post=288"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/tags?post=288"},{"taxonomy":"embl_taxonomy","embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/embl_taxonomy?post=288"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}