{"id":216,"date":"2022-03-24T17:03:40","date_gmt":"2022-03-24T17:03:40","guid":{"rendered":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/?p=216"},"modified":"2022-04-07T10:06:19","modified_gmt":"2022-04-07T10:06:19","slug":"snakemake-profile-1-getting-started-with-snakemake","status":"publish","type":"post","link":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/blog\/2022\/03\/snakemake-profile-1-getting-started-with-snakemake\/","title":{"rendered":"Snakemake profile &#8211; 1: Getting started with snakemake"},"content":{"rendered":"\n<p>This blog post is the first of a series on creating snakemake profiles. Some content was directly copied from the <a rel=\"noreferrer noopener\" href=\"\" target=\"_blank\">snakemake manual<\/a>. It is supposed that the reader has some basic concepts of snakemake, even if I start from the very beginning.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p>Adapting Snakemake to a particular environment can entail many flags and options. Therefore, since Snakemake 4.1, it is possible to specify a configuration profile to be used to obtain default options:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nsnakemake --profile myprofileFolder<\/code><\/pre>\n\n\n\n<p>In this section, I am going to explain how to create a <code>profile<\/code>. The <code>profile<\/code> folder will contain all the configuration parameters to successfully run your pipeline. Of note, no <code>cluster.json<\/code> file will be used since it has been <a rel=\"noreferrer noopener\" href=\"https:\/\/snakemake.readthedocs.io\/en\/stable\/snakefiles\/configuration.html?highlight=cluster.json#cluster-configuration-deprecated\" target=\"_blank\">deprecated<\/a> since snakemake v4.1. The profile folder is expected to contain a file <code>config.yaml<\/code> that defines default values for the Snakemake command-line arguments and jobs resources.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Folder structure<\/h2>\n\n\n\n<p>Let&#8217;s start by creating a folder having the right structure:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\n# Create the folder containing the files needed for this tutorial\nmkdir snakemake-profile-demo\n\n# Enter the created folder\ncd snakemake-profile-demo\n\n# Create an empty file containing the snakemake code\ntouch snakeFile\n\n# Create toy input files\nmkdir inputs\necho \"toto\" &gt; inputs\/hello.txt\necho \"totoBis\" &gt; inputs\/helloBis.txt\n\n# Create an empty folder to create a conda environment\n# This is done to make sure that you use the same snakemake version as I do\nmkdir envs\ntouch envs\/environment.yaml<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Creating the conda environment<\/h2>\n\n\n\n<p>For more information about the <code>YAML<\/code> format see <a rel=\"noreferrer noopener\" href=\"https:\/\/www.cloudbees.com\/blog\/yaml-tutorial-everything-you-need-get-started\" target=\"_blank\">here<\/a>. For the next step, you need to have <a rel=\"noreferrer noopener\" href=\"https:\/\/docs.conda.io\/projects\/conda\/en\/latest\/user-guide\/install\/index.html\" target=\"_blank\">conda<\/a> installed on your computer. Copy the following content to <code>envs\/environment.yaml<\/code> (the indentations consist of two spaces):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>channels:\n  - bioconda\ndependencies:\n  - snakemake-minimal=6.15.1<\/code><\/pre>\n\n\n\n<p>Then execute the following command to create a conda environment containing snakemake v6.15.1:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nconda env create -p envs\/smake --file envs\/environment.yaml<\/code><\/pre>\n\n\n\n<p>The conda environment creation should display:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Collecting package metadata (repodata.json): done\nSolving environment: done\n\nDownloading and Extracting Packages\nlibgomp-11.2.0       | 428 KB    | \n#######################################################\n########## | 100% \nlibgfortran-ng-11.2. | 19 KB     |\n#######################################################\n########## | 100% \nlibgfortran5-11.2.0  | 1.7 MB    | \n#######################################################\n########## | 100% \nlibstdcxx-ng-11.2.0  | 4.2 MB    |\n#######################################################\n########## | 100% \nlibgcc-ng-11.2.0     | 906 KB    | \n#######################################################\n########## | 100% \nPreparing transaction: done\nVerifying transaction: done\nExecuting transaction: done\n#\n# To activate this environment, use\n#\n#     $ conda activate \/g\/romebioinfo\/tmp\/snakemake-profile-demo\/envs\/smake\n#\n# To deactivate an active environment, use\n#\n#     $ conda deactivate<\/code><\/pre>\n\n\n\n<p>The option <code>-p<\/code> indicates the path to the conda environment that we call <code>smake<\/code> and the <code>--file<\/code> option gives the path to the file containing what we want to include in the environment. This step is important to make sure that we will use the same configuration. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Creating the first rule<\/h2>\n\n\n\n<p>Let&#8217;s not activate <code>smake<\/code> for the moment. Rather, let&#8217;s write our first rule in <code>snakeFile<\/code>. Copy the following content (here I use two spaces as indentation):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>rule printContent:\n  input:\n    file1=\"inputs\/hello.txt\",\n    file2=\"inputs\/helloBis.txt\"\n  output:\n    file1=\"results\/hello-content.txt\",\n    file2=\"results\/helloBis-content.txt\"\n  shell:\n    \"\"\"\n    cat {input.file1} &gt; {output.file1}\n    cat {input.file2} &gt; {output.file2}\n    \"\"\"<\/code><\/pre>\n\n\n\n<p>Here we wrote a rule called <code>printContent<\/code> that takes two files as input (hello.txt and helloBis.txt), and that prints their content to two other files (hello-content.txt and helloBis-content.txt) that are written in the <code>results\/<\/code> folder. This folder will be automatically created by snakemake. Let&#8217;s try to execute this file and see what happens:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\n# Activate the conda environment\nconda activate envs\/smake\n\n# Run the snakeFile (if --cores is not defined, nothing will happen: try removing it)\nsnakemake --snakefile snakeFile --cores=1<\/code><\/pre>\n\n\n\n<p>You should obtain the following message:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Building DAG of jobs...\nUsing shell: \/usr\/bin\/bash\nProvided cores: 1 (use --cores to define parallelism)\nRules claiming more threads will be scaled down.\nJob stats:\njob             count    min threads    max threads\n------------  -------  -------------  -------------\nprintContent        1              1              1\ntotal               1              1              1\n\nSelect jobs to execute...\n\n&#91;Thu Mar  3 15:19:56 2022]\nrule printContent:\n    input: inputs\/hello.txt, inputs\/helloBis.txt\n    output: results\/hello-content.txt, results\/helloBis-content.txt\n    jobid: 0\n    resources: tmpdir=\/tmp\n\n&#91;Thu Mar  3 15:19:56 2022]\nFinished job 0.\n1 of 1 steps (100%) done\nComplete log: .snakemake\/log\/2022-03-03T151955.846643.snakemake.log<\/code><\/pre>\n\n\n\n<p>Congratulations, you have executed your first snakemake pipeline. Check now that the files hello-content.txt and helloBis-content.txt have been created in the <code>result\/<\/code> folder:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nls results\/*<\/code><\/pre>\n\n\n\n<p>That should give:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>results\/helloBis-content.txt  results\/hello-content.txt<\/code><\/pre>\n\n\n\n<p>Execute again the snakemake command (your conda environment should still be active):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\n# Run the snakeFile \nsnakemake --snakefile snakeFile --cores=1<\/code><\/pre>\n\n\n\n<p>You should see:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Building DAG of jobs...\nNothing to be done (all requested files are present and up to date).\nComplete log: snakemake-profile-demo\/.snakemake\/log\/2022-03-03T152616.378173.snakemake.log<\/code><\/pre>\n\n\n\n<p>This is expected. Since you created hello-content.txt and helloBis-content.txt, snakemake considers that <code>Nothing to be done<\/code>. To continue this tutorial, delete the <code>results\/<\/code> folder:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nrm -r results\/<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Getting rid of the hard-coded paths<\/h2>\n\n\n\n<p>You might have noticed the main limitation in what we have done so far. We had to hard code the path to each input file. This is ok if you have only a few of them but if you have hundreds or thousands, you are in trouble. This is where <code>wildcards<\/code> get in the game. You can use them as variables that can take several values. Here, we want to replace the strings <code>hello<\/code> and <code>helloBis<\/code> with a wildcards that we will call <code>sampleName<\/code>. snakemake requires this wildcards to be included between brackets:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>rule printContent:\n  input:\n    file1=\"inputs\/{sampleName}.txt\",\n    file2=\"inputs\/{sampleName}.txt\"\n  output:\n    file1=\"results\/{sampleName}.txt\",\n    file2=\"results\/{sampleName}.txt\"\n  shell:\n    \"\"\"\n    cat {input.file1} &gt; {output.file1}\n    cat {input.file2} &gt; {output.file2}\n    \"\"\"<\/code><\/pre>\n\n\n\n<p>In the code above, we can see that the two lines of the <code>input<\/code> section and the two lines of the <code>output<\/code> section are the same. We can therefore simplify the code as follows:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>rule printContent:\n  input:\n    file1=\"inputs\/{sampleName}.txt\"\n  output:\n    file1=\"results\/{sampleName}.txt\"\n  shell:\n    \"\"\"\n    cat {input.file1} &gt; {output.file1}\n    cat {input.file2} &gt; {output.file2}\n    \"\"\"<\/code><\/pre>\n\n\n\n<p>You can now observe that the <code>file1<\/code> variable is not useful anymore and that the second line of the shell script is not relevant either. Below I hence remove <code>file1<\/code> everywhere and also the second line of the shell script:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>rule printContent:\n  input:\n    \"inputs\/{sampleName}.txt\"\n  output:\n    \"results\/{sampleName}.txt\"\n  shell:\n    \"\"\"\n    cat {input} &gt; {output}\n    \"\"\"<\/code><\/pre>\n\n\n\n<p>The code is much shorter now. Replace the content of <code>snakeFile<\/code> with this rule and run the snakemake command (your conda environment should be active. In case it is not, just do <code>conda activate envs\/smake<\/code>. To deactivate it, do <code>conda deactivate<\/code>):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\n# Run the snakeFile .\nsnakemake --snakefile snakeFile --cores=1<\/code><\/pre>\n\n\n\n<p>You might have thought: &#8220;How is snakemake going to know what to put in the <code>sampleName<\/code> wildcards?&#8221;. And indeed, this message tells you that it does not know:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Building DAG of jobs...\nWorkflowError:\nTarget rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical \"rule all\" which just collects all results you want to generate in the end)<\/code><\/pre>\n\n\n\n<p>Then let&#8217;s do the most obvious thing, define the sampleName variable! SPOILER: snakemake is written in Python, so if you do not know Python (like me), you will end up googling stupid things like &#8220;for loop in python&#8221;. But do not worry, this is working pretty well, and the more you do it, the more you learn!<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sampleName=&#91;\"hello\", \"helloBis\"]\n\nrule printContent:\n  input:\n    \"inputs\/{sampleName}.txt\"\n  output:\n    \"results\/{sampleName}.txt\"\n  shell:\n    \"\"\"\n    cat {input} &gt; {output}\n    \"\"\"<\/code><\/pre>\n\n\n\n<p>Add the top line to your <code>snakeFile<\/code> and run the snakemake command again:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\n# Run the snakeFile .\nsnakemake --snakefile snakeFile --cores=1<\/code><\/pre>\n\n\n\n<p>Are you getting the same WorkflowError message? SPOILER: snakemake is not intuitive! You get the same error because the logic of snakemake is a bit particular. You need first to define the files to produce in a <code>rule all<\/code> section. Be careful, even if you want to define &#8220;output files&#8221; there, you need the &#8220;input&#8221; keyword. Then you need to define the values of <code>sampleName<\/code> with the <code>expand<\/code> function:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>rule all:\n  input:\n    expand(\"results\/{sampleName}.txt\", sampleName=&#91;\"hello\", \"helloBis\"])\n\nrule printContent:\n  input:\n    \"inputs\/{sampleName}.txt\"\n  output:\n    \"results\/{sampleName}.txt\"\n  shell:\n    \"\"\"\n    cat {input} &gt; {output}\n    \"\"\"<\/code><\/pre>\n\n\n\n<p>After replacing the content of your <code>snakeFile<\/code> with the above code, run snakemake:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\n# Run the snakeFile .\nsnakemake --snakefile snakeFile --cores=1<\/code><\/pre>\n\n\n\n<p>You should get:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Building DAG of jobs...\nUsing shell: \/usr\/bin\/bash\nProvided cores: 1 (use --cores to define parallelism)\nRules claiming more threads will be scaled down.\nJob stats:\njob             count    min threads    max threads\n------------  -------  -------------  -------------\nall                 1              1              1\nprintContent        2              1              1\ntotal               3              1              1\n\nSelect jobs to execute...\n\n&#91;Thu Mar  3 16:44:09 2022]\nrule printContent:\n    input: inputs\/hello.txt\n    output: results\/hello.txt\n    jobid: 1\n    wildcards: sampleName=hello\n    resources: tmpdir=\/tmp\n\n&#91;Thu Mar  3 16:44:09 2022]\nFinished job 1.\n1 of 3 steps (33%) done\nSelect jobs to execute...\n\n&#91;Thu Mar  3 16:44:09 2022]\nrule printContent:\n    input: inputs\/helloBis.txt\n    output: results\/helloBis.txt\n    jobid: 2\n    wildcards: sampleName=helloBis\n    resources: tmpdir=\/tmp\n\n&#91;Thu Mar  3 16:44:09 2022]\nFinished job 2.\n2 of 3 steps (67%) done\nSelect jobs to execute...\n\n&#91;Thu Mar  3 16:44:09 2022]\nlocalrule all:\n    input: results\/hello.txt, results\/helloBis.txt\n    jobid: 0\n    resources: tmpdir=\/tmp\n\n&#91;Thu Mar  3 16:44:09 2022]\nFinished job 0.\n3 of 3 steps (100%) done\nComplete log: \/g\/romebioinfo\/tmp\/snakemake-profile-demo\/.snakemake\/log\/2022-03-03T164408.691275.snakemake.log<\/code><\/pre>\n\n\n\n<p>This is quite different from what we got before. First look at the table at the beginning (on the left below) and compare it with the previous one (on the right below):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>job          count    min threads    max threads        job    count    min threads    max threads\n---------   -------  -------------  -------------     -------  -------  -------------  -----------\nall            1            1              1                            \nprintContent   2            1              1      printContent   1              1              1\ntotal          3            1              1      total          1              1              1<\/code><\/pre>\n\n\n\n<p>First, you can see that <code>all<\/code> is considered as a job (the job that verifies that all the files are produced). Secondly, <code>printContent<\/code> has 2 jobs instead of 1. This is the key of snakemake, it processed each file (inputs\/hello.txt and inputs\/helloBis.txt) in two separate jobs. This is what is called parallelization! In the next post, we will see how to submit the jobs to a cluster. Since the jobs are all submitted to your HPC, it then processes inputs\/hello.txt and inputs\/helloBis.txt at the same time i.e. in parallel. One can understand immediately the advantage of using a workflow manager such as snakemake. When you have hundreds of files, it enables you to save a lot of time! Note also that snakemake outputs a summary of each rule that is processed by the workflow:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>rule printContent:\n    input: inputs\/hello.txt\n    output: results\/hello.txt\n    jobid: 1\n    wildcards: sampleName=hello\n    resources: tmpdir=\/tmp<\/code><\/pre>\n\n\n\n<p>This enables you to verify what are your parameters. This is going to be important in the next posts. Finally, when a job has finished, in other words when the output file of a rule is produced, snakemake indicates the jobid (that you can find in the above summary) and the number of remaining jobs:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;Thu Mar  3 16:44:09 2022]\nFinished job 1.\n1 of 3 steps (33%) done\nSelect jobs to execute...<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>As you can imagine there is a lot more to say but this first post only aims at giving the concepts to understand the following ones. Next week, we are going to talk about snakemake options (the one you can give to the command line when you type <code>snakemake --snakefile snakeFile --cores=1<\/code>. To have an idea of the number of options available, run <code>snakemake --help<\/code>. <\/p>\n\n\n\n<p>Stay tuned and see you next week! (<a href=\"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/blog\/2022\/03\/snakemake-profile-2-reducing-command-line-options-with-profile\/\">next post<\/a>)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog post is the first of a series on creating snakemake profiles. Some content was directly copied from the snakemake manual. It is supposed that the reader has some basic concepts of snakemake, even if I start from the very beginning. Introduction Adapting Snakemake to a particular&hellip;<\/p>\n","protected":false},"author":5,"featured_media":148,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[4096],"tags":[4102,4466,4100,4098],"embl_taxonomy":[],"class_list":["post-216","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technical","tag-beginner","tag-introduction","tag-profile","tag-snakemake"],"acf":[],"embl_taxonomy_terms":[],"featured_image_src":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-content\/uploads\/2022\/02\/hitchcock.jpeg","_links":{"self":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts\/216"}],"collection":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/comments?post=216"}],"version-history":[{"count":17,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts\/216\/revisions"}],"predecessor-version":[{"id":284,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts\/216\/revisions\/284"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/media\/148"}],"wp:attachment":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/media?parent=216"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/categories?post=216"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/tags?post=216"},{"taxonomy":"embl_taxonomy","embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/embl_taxonomy?post=216"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}