{"id":3138,"date":"2022-10-15T09:10:07","date_gmt":"2022-10-15T09:10:07","guid":{"rendered":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/?p=3138"},"modified":"2022-10-15T09:16:48","modified_gmt":"2022-10-15T09:16:48","slug":"an-example-of-what-the-hell-error-in-snakemake","status":"publish","type":"post","link":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/blog\/2022\/10\/an-example-of-what-the-hell-error-in-snakemake\/","title":{"rendered":"An example of what-the-hell error in Snakemake"},"content":{"rendered":"\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>I want this post to describe a bug that I got recently that led me to use constraints on wildcards. The error was not intuitive and can be very disturbing at first if one is not used to the Snakemake logic. Below I will describe step by step how to build a minimal working example that generates the error, what I did to isolate the problem and finally how I solved it.<\/p>\n\n\n\n<div style=\"height:21px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Preparing the files<\/h2>\n\n\n\n<p>Run the following script to create the folder structure:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\n## Create the project and environment folders\nmkdir demo-constraints\ncd demo-constraints\nmkdir envs\n\n## create the snakeFile that will connect the subworkflows snakeA and snakeB\ntouch Snakefile\ntouch config.yaml\n<\/code><\/pre>\n\n\n\n<p>For more information about the <code>YAML<\/code> format see <a href=\"https:\/\/www.cloudbees.com\/blog\/yaml-tutorial-everything-you-need-get-started\">here<\/a>. For the next step, you need to have <a href=\"https:\/\/mamba.readthedocs.io\/en\/latest\/installation.html\">mamba<\/a> installed on your computer. Run the commands:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>mamba create -p envs\/smake\nmamba activate envs\/smake\nmamba install snakemake=7.15.2\n<\/code><\/pre>\n\n\n\n<p>You can verify that the correct version of Snakemake has been installed with:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nsnakemake --version\n<\/code><\/pre>\n\n\n\n<div style=\"height:21px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">The workflow<\/h2>\n\n\n\n<p>In this workflow, we will code two rules to download fastq files from <a href=\"https:\/\/www.encodeproject.org\/\">Encode<\/a>. The URLs will be retrieved from the config file as the information about the species and the associated protocols. Add the following content to <code>config.yaml<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>testDatasets:\n  technique: &#91;\"RNASeq\", \"ChIPSeq\", \"ATACSeq\"]\n  organism: &#91;\"Mus_musculus\", \"Homo_sapiens\"]\n  RNASeq:\n    Mus_musculus:\n      nameSingleEnd: \"limbPolyAPlus\"\n      singleEnded: \"https:\/\/www.encodeproject.org\/files\/ENCFF678XFK\/@@download\/ENCFF678XFK.fastq.gz\"\n      namePairedEnd: \"CD4PolyAPlus\"\n      pairedEnded1: \"https:\/\/www.encodeproject.org\/files\/ENCFF817RHL\/@@download\/ENCFF817RHL.fastq.gz\"\n      pairedEnded2: \"https:\/\/www.encodeproject.org\/files\/ENCFF533VQX\/@@download\/ENCFF533VQX.fastq.gz\"\n    Homo_sapiens:\n      nameSingleEnd: \"GM12878PolyAPlus\"\n      singleEnded: \"https:\/\/www.encodeproject.org\/files\/ENCFF729YAX\/@@download\/ENCFF729YAX.fastq.gz\"\n      namePairedEnd: \"adrenalGlandPolyAPlus\"\n      pairedEnded1: \"https:\/\/www.encodeproject.org\/files\/ENCFF028DUO\/@@download\/ENCFF028DUO.fastq.gz\"\n      pairedEnded2: \"https:\/\/www.encodeproject.org\/files\/ENCFF470RWW\/@@download\/ENCFF470RWW.fastq.gz\"\n  ChIPSeq:\n    Mus_musculus:\n      nameSingleEnd: \"H3K27acMacro\"\n      singleEnded: \"https:\/\/www.encodeproject.org\/files\/ENCFF937IMG\/@@download\/ENCFF937IMG.fastq.gz\"\n      namePairedEnd: \"H3K27me3Patski\"\n      pairedEnded1: \"https:\/\/www.encodeproject.org\/files\/ENCFF090PQE\/@@download\/ENCFF090PQE.fastq.gz\"\n      pairedEnded2: \"https:\/\/www.encodeproject.org\/files\/ENCFF362BSH\/@@download\/ENCFF362BSH.fastq.gz\"\n    Homo_sapiens:\n      nameSingleEnd: \"H3K36me3BlaER1\"\n      singleEnded: \"https:\/\/www.encodeproject.org\/files\/ENCFF354RIC\/@@download\/ENCFF354RIC.fastq.gz\"\n      namePairedEnd: \"BMI1MCF7\"\n      pairedEnded1: \"https:\/\/www.encodeproject.org\/files\/ENCFF825HMN\/@@download\/ENCFF825HMN.fastq.gz\"\n      pairedEnded2: \"https:\/\/www.encodeproject.org\/files\/ENCFF240ZBS\/@@download\/ENCFF240ZBS.fastq.gz\"\n  ATACSeq:\n    Mus_musculus:\n      nameSingleEnd: \"ATACErythroblast\"\n      singleEnded: \"https:\/\/www.encodeproject.org\/files\/ENCFF535AXY\/@@download\/ENCFF535AXY.fastq.gz\"\n      namePairedEnd: \"ATACPatski\"\n      pairedEnded1: \"https:\/\/www.encodeproject.org\/files\/ENCFF600AUS\/@@download\/ENCFF600AUS.fastq.gz\"\n      pairedEnded2: \"https:\/\/www.encodeproject.org\/files\/ENCFF273QXE\/@@download\/ENCFF273QXE.fastq.gz\"\n    Homo_sapiens:\n      nameSingleEnd: \"ATACA549\"\n      singleEnded: \"https:\/\/www.encodeproject.org\/files\/ENCFF022FVS\/@@download\/ENCFF022FVS.fastq.gz\"\n      namePairedEnd: \"ATACCD4positive\"\n      pairedEnded1: \"https:\/\/www.encodeproject.org\/files\/ENCFF526FOQ\/@@download\/ENCFF526FOQ.fastq.gz\"\n      pairedEnded2: \"https:\/\/www.encodeproject.org\/files\/ENCFF310QJA\/@@download\/ENCFF310QJA.fastq.gz\"\n<\/code><\/pre>\n\n\n\n<p>Define the workflow in <code>Snakefile<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\nconfigfile: \"config.yaml\"\n\nonstart:\n    print(\"##### DOWNLOAD FASTQ FILES #####\\n\") \n\n\n###############################################################################\n# Creating input table\n###############################################################################\n\n# Build the table of test datasets to download\nsamplesData = &#91;]\n\nfor tech in config&#91;\"testDatasets\"]&#91;\"technique\"]:\n  for org in config&#91;\"testDatasets\"]&#91;\"organism\"]:\n    pathSingle = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"singleEnded\"]\n    nameSingle = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"nameSingleEnd\"]\n    pathPaired1 = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"pairedEnded1\"]\n    pathPaired2 = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"pairedEnded2\"]\n    namePaired = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"namePairedEnd\"]\n    samplesData.append(&#91;nameSingle, tech, org, \"single\", pathSingle, \"NA\"])\n    samplesData.append(&#91;namePaired, tech, org, \"paired\", pathPaired1, pathPaired2])\n\ndf = pd.DataFrame(samplesData)\ndf.rename(columns={0: 'samples', 1: 'library_strategy', 2: 'organism', 3: 'library_layout', 4: 'link1', 5: 'link2'}, inplace=True)\n\n\n###############################################################################\n# Variables definition\n###############################################################################\n\n# Splitting the table into single or paired end experiments\n\nindex_single = df&#91;'library_layout'] == 'single'\nindex_paired = df&#91;'library_layout'] == 'paired'\ndf_single = df&#91;index_single]\ndf_paired = df&#91;index_paired]\n\n# Output files names\n\nSINGLESAMPLES = df_single&#91;'samples'].tolist()\nPAIREDSAMPLES = df_paired&#91;'samples'].tolist()\n\n# For Retrieving links to download sra files\n\nsamples_single_forlinks = pd.DataFrame(df_single).set_index(\"samples\",drop=False)\nsamples_paired_forlinks = pd.DataFrame(df_paired).set_index(\"samples\",drop=False)\n\n# Technique names\nSINGLETECH = df_single&#91;'library_strategy'].tolist()\nPAIREDTECH = df_paired&#91;'library_strategy'].tolist()\n\n## Species name\nSPECIESSINGLE = df_single&#91;'organism'].tolist()\nSPECIESPAIRED = df_paired&#91;'organism'].tolist()\n\n## Layout names\nLAYOUTSINGLE = df_single&#91;'library_layout'].tolist()\nLAYOUTPAIRED = df_paired&#91;'library_layout'].tolist()\n\n\n############\n# Rule all\n############\n\nrule all:\n  input:\n    expand(\"results\/{speciessingle}\/fastq\/{techniquesingle}\/{layoutsingle}\/allchrom\/{samplenamesingle}.fastq.gz\", zip, speciessingle=SPECIESSINGLE, techniquesingle=SINGLETECH, layoutsingle=LAYOUTSINGLE, samplenamesingle=SINGLESAMPLES),\n    expand(\"results\/{speciespaired}\/fastq\/{techniquepaired}\/{layoutpaired}\/allchrom\/{samplenamepaired}_1.fastq.gz\", zip, speciespaired=SPECIESPAIRED, techniquepaired=PAIREDTECH, layoutpaired=LAYOUTPAIRED, samplenamepaired=PAIREDSAMPLES),\n    expand(\"results\/{speciespaired}\/fastq\/{techniquepaired}\/{layoutpaired}\/allchrom\/{samplenamepaired}_2.fastq.gz\", zip, speciespaired=SPECIESPAIRED, techniquepaired=PAIREDTECH, layoutpaired=LAYOUTPAIRED, samplenamepaired=PAIREDSAMPLES)\n\n\nrule download_fastq_single:\n  output:\n    \"results\/{speciessingle}\/fastq\/{techniquesingle}\/{layoutsingle}\/allchrom\/{samplenamesingle}.fastq.gz\"\n  params:\n    outputdirectory = lambda wildcards: f\"results\/{wildcards.speciessingle}\/fastq\/{wildcards.techniquesingle}\/{wildcards.layoutsingle}\/fastq\/allchrom\",\n    linksingle = lambda wildcards: samples_single_forlinks.loc&#91;wildcards.samplenamesingle, \"link1\"]\n  threads: 1\n  shell:\n    \"\"\"\n    echo \"Downloading {params.linksingle}\"\n    wget --directory-prefix={params.outputdirectory} {params.linksingle}\n    sleep 10s\n    FILENAME=`basename {params.linksingle}`\n    mv {params.outputdirectory}\/$FILENAME {output}  \n    \"\"\"\n\nrule download_fastq_paired:\n  output:\n    pair1 = \"results\/{speciespaired}\/fastq\/{techniquepaired}\/{layoutpaired}\/allchrom\/{samplenamepaired}_1.fastq.gz\",\n    pair2 = \"results\/{speciespaired}\/fastq\/{techniquepaired}\/{layoutpaired}\/allchrom\/{samplenamepaired}_2.fastq.gz\"\n  threads: 1\n  params:\n    outputdirectory = lambda wildcards: f\"results\/{wildcards.speciespaired}\/fastq\/{wildcards.techniquepaired}\/{wildcards.layoutpaired}\/fastq\/allchrom\",\n    linkpair1 = lambda wildcards: samples_paired_forlinks.loc&#91;wildcards.samplenamepaired, \"link1\"],\n    linkpair2 = lambda wildcards: samples_paired_forlinks.loc&#91;wildcards.samplenamepaired, \"link2\"]\n  shell:\n    \"\"\"\n    echo \"Downloading {params.linkpair1} and {params.linkpair2}\"\n    wget --directory-prefix={params.outputdirectory} {params.linkpair1}\n    wget --directory-prefix={params.outputdirectory} {params.linkpair2}\n    sleep 10s\n    FILENAME1=`basename {params.linkpair1}`\n    FILENAME2=`basename {params.linkpair2}`\n    mv {params.outputdirectory}\/$FILENAME1 {output.pair1}\n    mv {params.outputdirectory}\/$FILENAME2 {output.pair2}\n    \"\"\"\n<\/code><\/pre>\n\n\n\n<div style=\"height:21px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">The workflow explained step by step<\/h2>\n\n\n\n<p>The Snakefile starts with:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\nconfigfile: \"config.yaml\"\n\nonstart:\n    print(\"##### DOWNLOAD FASTQ FILES #####\\n\") \n<\/code><\/pre>\n\n\n\n<p>These commands first import the python module <code>pandas<\/code> that will be used with the prefix <code>pd<\/code>. The path to the configuration file is then indicated with <code>configfile: \"config.yaml\"<\/code>. The configuration file (see above) contains information about the different techniques that were used to generate the data (<code>technique: [\"RNASeq\", \"ChIPSeq\", \"ATACSeq\"]<\/code>), the organism from which the data were generated (<code>organism: [\"Mus_musculus\", \"Homo_sapiens\"]<\/code>) and the URLs are further organised according to the techniques and organisms. The <code>onstart<\/code> section prints a message in the terminal upon workflow invocation.<\/p>\n\n\n\n<p>The <code>Creating input table<\/code> section builds a panda dataframe from the <code>config.yaml<\/code> with the following information:<\/p>\n\n\n\n<div style=\"height:11px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>samples <\/td><td>library_strategy <\/td><td>organism <\/td><td>library_layout <\/td><td>link1 <\/td><td>link2<\/td><\/tr><tr><td>limbPolyAPlus <\/td><td>RNASeq<\/td><td>Mus_musculus <\/td><td>single <\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF678XF\u2026 <\/td><td>NA<\/td><\/tr><tr><td>CD4PolyAPlus <\/td><td>RNASeq <\/td><td>Mus_musculus <\/td><td>paired<\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF817RH\u2026 <\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF533VQ\u2026<\/td><\/tr><tr><td>GM12878PolyAPlus <\/td><td>RNASeq <\/td><td>Homo_sapiens <\/td><td>single<\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF729YA\u2026 <\/td><td>NA<\/td><\/tr><tr><td>adrenalGlandPolyAPlus <\/td><td>RNASeq <\/td><td>Homo_sapiens <\/td><td>paired<\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF028DU\u2026 <\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF470RW\u2026<\/td><\/tr><tr><td>H3K27acMacro <\/td><td>ChIPSeq <\/td><td>Mus_musculus <\/td><td>single<\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF937IM\u2026 <\/td><td>NA<\/td><\/tr><tr><td>H3K27me3Patski <\/td><td>ChIPSeq <\/td><td>Mus_musculus <\/td><td>paired<\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF090PQ\u2026 <\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF362BS\u2026<\/td><\/tr><tr><td>H3K36me3BlaER1 <\/td><td>ChIPSeq <\/td><td>Homo_sapiens <\/td><td>single<\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF354RI\u2026 <\/td><td>NA<\/td><\/tr><tr><td>BMI1MCF7 <\/td><td>ChIPSeq <\/td><td>Homo_sapiens <\/td><td>paired<\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF825HM\u2026<\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF240ZB\u2026<\/td><\/tr><tr><td>ATACErythroblast<\/td><td>ATACSeq <\/td><td>Mus_musculus <\/td><td>single<\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF535AX\u2026 <\/td><td>NA<\/td><\/tr><tr><td>ATACPatski <\/td><td>ATACSeq <\/td><td>Mus_musculus <\/td><td>paired<\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF600AU\u2026<\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF273QX\u2026<\/td><\/tr><tr><td>ATACA549<\/td><td>ATACSeq<\/td><td>Homo_sapiens <\/td><td>single <\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF022FV\u2026 <\/td><td>NA<\/td><\/tr><tr><td>ATACCD4positive <\/td><td>ATACSeq <\/td><td>Homo_sapiens <\/td><td>paired<\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF526FO\u2026 <\/td><td>https:\/\/www.encodeproject.org\/files\/ENCFF310QJ\u2026<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<div style=\"height:11px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>The rows are then separated into two tables of <code>single<\/code> and <code>paired<\/code> samples with:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Splitting the table into single or paired end experiments\n\nindex_single = df&#91;'library_layout'] == 'single'\nindex_paired = df&#91;'library_layout'] == 'paired'\ndf_single = df&#91;index_single]\ndf_paired = df&#91;index_paired]\n<\/code><\/pre>\n\n\n\n<p>The data frames are then indexed in order to be able to retrieve the URLs with the samples names (<code>sample<\/code>) and the different variables used to build the output paths are created:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Output files names\n\nSINGLESAMPLES = df_single&#91;'samples'].tolist()\nPAIREDSAMPLES = df_paired&#91;'samples'].tolist()\n\n# For Retrieving links to download sra files\n\nsamples_single_forlinks = pd.DataFrame(df_single).set_index(\"samples\",drop=False)\nsamples_paired_forlinks = pd.DataFrame(df_paired).set_index(\"samples\",drop=False)\n\n# Technique names\nSINGLETECH = df_single&#91;'library_strategy'].tolist()\nPAIREDTECH = df_paired&#91;'library_strategy'].tolist()\n\n## Species name\nSPECIESSINGLE = df_single&#91;'organism'].tolist()\nSPECIESPAIRED = df_paired&#91;'organism'].tolist()\n\n## Layout names\nLAYOUTSINGLE = df_single&#91;'library_layout'].tolist()\nLAYOUTPAIRED = df_paired&#91;'library_layout'].tolist()\n<\/code><\/pre>\n\n\n\n<p>Finally, <code>rule all<\/code> contains the files to be created and the rules <code>download_fastq_single<\/code>\/<code>download_fastq_paired<\/code> use the indexed data frames <code>samples_single_forlinks<\/code>\/<code>samples_paired_forlinks<\/code> to download the fastq files.<\/p>\n\n\n\n<div style=\"height:21px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Running the pipeline and error<\/h2>\n\n\n\n<p>To dry-run the pipeline use:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nsnakemake --cores 1 -n\n<\/code><\/pre>\n\n\n\n<p>You should see the error:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>AmbiguousRuleException:\nRules download_fastq_paired and download_fastq_single are ambiguous for the file results\/Mus_musculus\/fastq\/RNASeq\/paired\/allchrom\/CD4PolyAPlus_1.fastq.gz.\nConsider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.\nWildcards:\n    download_fastq_paired: layoutpaired=paired,samplenamepaired=CD4PolyAPlus,speciespaired=Mus_musculus,techniquepaired=RNASeq\n    download_fastq_single: layoutsingle=paired,samplenamesingle=CD4PolyAPlus_1,speciessingle=Mus_musculus,techniquesingle=RNASeq\nExpected input files:\n    download_fastq_paired: \n    download_fastq_single: Expected output files:\n    download_fastq_paired: results\/Mus_musculus\/fastq\/RNASeq\/paired\/allchrom\/CD4PolyAPlus_1.fastq.gz results\/Mus_musculus\/fastq\/RNASeq\/paired\/allchrom\/CD4PolyAPlus_2.fastq.gz\n    download_fastq_single: results\/Mus_musculus\/fastq\/RNASeq\/paired\/allchrom\/CD4PolyAPlus_1.fastq.gz\n<\/code><\/pre>\n\n\n\n<p>The first thing to notice is that both rules, even if built on two different dataframes, process the same sample (<code>results\/Mus_musculus\/fastq\/RNASeq\/paired\/allchrom\/CD4PolyAPlus_1.fastq.gz<\/code>). Looking at the <code>Wildcards<\/code> section, you will notice that the rule dedicated to the single-ended samples is now receiving <code>paired<\/code> parameters (<code>layoutsingle=paired<\/code>). More interestingly, the <code>download_fastq_single<\/code> rule does not contain the correct wildcards <code>samplenamesingle<\/code> and the <code>_1<\/code> suffix was added. This suffix is only present in the output of <code>download_fastq_paired<\/code>. Notice also that the wildcards <code>samplenamepaired=CD4PolyAPlus<\/code> is correct as the expected output files (<code>results\/Mus_musculus\/fastq\/RNASeq\/paired\/allchrom\/CD4PolyAPlus_1.fastq.gz<\/code>\/<code>results\/Mus_musculus\/fastq\/RNASeq\/paired\/allchrom\/CD4PolyAPlus_2.fastq.gz<\/code>).<\/p>\n\n\n\n<p>=&gt; <strong>The question is how can the paired samples could be injected to the <code>download_fastq_single<\/code>? Why does the &#8216;_1&#8217; suffix suddenly appeared in that rule?<\/strong><\/p>\n\n\n\n<div style=\"height:21px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Simplifying the workflow to isolate the problem<\/h2>\n\n\n\n<p>Remove everything related to the paired rule to generate only the files of <code>download_fastq_single<\/code>. Then simplify the rule to a minimal working example. Copy the following content into a <code>Snakefile2<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\nconfigfile: \"config.yaml\"\n\nonstart:\n    print(\"##### DOWNLOAD FASTQ FILES #####\\n\") \n\n\n###############################################################################\n# Creating input table\n###############################################################################\n\n# Build the table of test datasets to download\nsamplesData = &#91;]\n\nfor tech in config&#91;\"testDatasets\"]&#91;\"technique\"]:\n  for org in config&#91;\"testDatasets\"]&#91;\"organism\"]:\n    pathSingle = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"singleEnded\"]\n    nameSingle = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"nameSingleEnd\"]\n    pathPaired1 = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"pairedEnded1\"]\n    pathPaired2 = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"pairedEnded2\"]\n    namePaired = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"namePairedEnd\"]\n    samplesData.append(&#91;nameSingle, tech, org, \"single\", pathSingle, \"NA\"])\n    samplesData.append(&#91;namePaired, tech, org, \"paired\", pathPaired1, pathPaired2])\n\ndf = pd.DataFrame(samplesData)\ndf.rename(columns={0: 'samples', 1: 'library_strategy', 2: 'organism', 3: 'library_layout', 4: 'link1', 5: 'link2'}, inplace=True)\n\n\n###############################################################################\n# Variables definition\n###############################################################################\n\n# Splitting the table into single or paired end experiments\nindex_single = df&#91;'library_layout'] == 'single'\ndf_single = df&#91;index_single]\n\n# Output files names\nSINGLESAMPLES = df_single&#91;'samples'].tolist()\n\n# For Retrieving links to download sra files\nsamples_single_forlinks = pd.DataFrame(df_single).set_index(\"samples\",drop=False)\n\n# Technique names\nSINGLETECH = df_single&#91;'library_strategy'].tolist()\n\n## Species name\nSPECIESSINGLE = df_single&#91;'organism'].tolist()\n\n## Layout names\nLAYOUTSINGLE = df_single&#91;'library_layout'].tolist()\n\n############\n# Rule all\n############\n\nrule all:\n  input:\n    expand(\"results\/{speciessingle}\/fastq\/{techniquesingle}\/{layoutsingle}\/allchrom\/{samplenamesingle}.fastq.gz\", zip, speciessingle=SPECIESSINGLE, techniquesingle=SINGLETECH, layoutsingle=LAYOUTSINGLE, samplenamesingle=SINGLESAMPLES)\n\nrule download_fastq_single:\n  output:\n    \"results\/{speciessingle}\/fastq\/{techniquesingle}\/{layoutsingle}\/allchrom\/{samplenamesingle}.fastq.gz\"\n  threads: 1\n  shell:\n    \"echo 'hello' > {output}\"\n<\/code><\/pre>\n\n\n\n<p>Perform a dry-run:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nsnakemake --cores 1 --snakefile Snakefile2 -n\n<\/code><\/pre>\n\n\n\n<p>You should obtain:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Building DAG of jobs...\nJob stats:\njob                      count    min threads    max threads\n---------------------  -------  -------------  -------------\nall                          1              1              1\ndownload_fastq_single        6              1              1\ntotal                        7              1              1\n\n\n&#91;Thu Oct 13 15:46:16 2022]\nrule download_fastq_single:\n    output: results\/Homo_sapiens\/fastq\/RNASeq\/single\/allchrom\/GM12878PolyAPlus.fastq.gz\n    jobid: 2\n    reason: Missing output files: results\/Homo_sapiens\/fastq\/RNASeq\/single\/allchrom\/GM12878PolyAPlus.fastq.gz\n    wildcards: speciessingle=Homo_sapiens, techniquesingle=RNASeq, layoutsingle=single, samplenamesingle=GM12878PolyAPlus\n    resources: tmpdir=\/tmp\n\n\n&#91;Thu Oct 13 15:46:16 2022]\nrule download_fastq_single:\n    output: results\/Mus_musculus\/fastq\/ChIPSeq\/single\/allchrom\/H3K27acMacro.fastq.gz\n    jobid: 3\n    reason: Missing output files: results\/Mus_musculus\/fastq\/ChIPSeq\/single\/allchrom\/H3K27acMacro.fastq.gz\n    wildcards: speciessingle=Mus_musculus, techniquesingle=ChIPSeq, layoutsingle=single, samplenamesingle=H3K27acMacro\n    resources: tmpdir=\/tmp\n\n\n&#91;Thu Oct 13 15:46:16 2022]\nrule download_fastq_single:\n    output: results\/Homo_sapiens\/fastq\/ChIPSeq\/single\/allchrom\/H3K36me3BlaER1.fastq.gz\n    jobid: 4\n    reason: Missing output files: results\/Homo_sapiens\/fastq\/ChIPSeq\/single\/allchrom\/H3K36me3BlaER1.fastq.gz\n    wildcards: speciessingle=Homo_sapiens, techniquesingle=ChIPSeq, layoutsingle=single, samplenamesingle=H3K36me3BlaER1\n    resources: tmpdir=\/tmp\n\n\n&#91;Thu Oct 13 15:46:16 2022]\nrule download_fastq_single:\n    output: results\/Mus_musculus\/fastq\/ATACSeq\/single\/allchrom\/ATACErythroblast.fastq.gz\n    jobid: 5\n    reason: Missing output files: results\/Mus_musculus\/fastq\/ATACSeq\/single\/allchrom\/ATACErythroblast.fastq.gz\n    wildcards: speciessingle=Mus_musculus, techniquesingle=ATACSeq, layoutsingle=single, samplenamesingle=ATACErythroblast\n    resources: tmpdir=\/tmp\n\n\n&#91;Thu Oct 13 15:46:16 2022]\nrule download_fastq_single:\n    output: results\/Homo_sapiens\/fastq\/ATACSeq\/single\/allchrom\/ATACA549.fastq.gz\n    jobid: 6\n    reason: Missing output files: results\/Homo_sapiens\/fastq\/ATACSeq\/single\/allchrom\/ATACA549.fastq.gz\n    wildcards: speciessingle=Homo_sapiens, techniquesingle=ATACSeq, layoutsingle=single, samplenamesingle=ATACA549\n    resources: tmpdir=\/tmp\n\n\n&#91;Thu Oct 13 15:46:16 2022]\nrule download_fastq_single:\n    output: results\/Mus_musculus\/fastq\/RNASeq\/single\/allchrom\/limbPolyAPlus.fastq.gz\n    jobid: 1\n    reason: Missing output files: results\/Mus_musculus\/fastq\/RNASeq\/single\/allchrom\/limbPolyAPlus.fastq.gz\n    wildcards: speciessingle=Mus_musculus, techniquesingle=RNASeq, layoutsingle=single, samplenamesingle=limbPolyAPlus\n    resources: tmpdir=\/tmp\n\n\n&#91;Thu Oct 13 15:46:16 2022]\nlocalrule all:\n    input: results\/Mus_musculus\/fastq\/RNASeq\/single\/allchrom\/limbPolyAPlus.fastq.gz, results\/Homo_sapiens\/fastq\/RNASeq\/single\/allchrom\/GM12878PolyAPlus.fastq.gz, results\/Mus_musculus\/fastq\/ChIPSeq\/single\/allchrom\/H3K27acMacro.fastq.gz, results\/Homo_sapiens\/fastq\/ChIPSeq\/single\/allchrom\/H3K36me3BlaER1.fastq.gz, results\/Mus_musculus\/fastq\/ATACSeq\/single\/allchrom\/ATACErythroblast.fastq.gz, results\/Homo_sapiens\/fastq\/ATACSeq\/single\/allchrom\/ATACA549.fastq.gz\n    jobid: 0\n    reason: Input files updated by another job: results\/Mus_musculus\/fastq\/RNASeq\/single\/allchrom\/limbPolyAPlus.fastq.gz, results\/Homo_sapiens\/fastq\/ATACSeq\/single\/allchrom\/ATACA549.fastq.gz, results\/Homo_sapiens\/fastq\/RNASeq\/single\/allchrom\/GM12878PolyAPlus.fastq.gz, results\/Homo_sapiens\/fastq\/ChIPSeq\/single\/allchrom\/H3K36me3BlaER1.fastq.gz, results\/Mus_musculus\/fastq\/ChIPSeq\/single\/allchrom\/H3K27acMacro.fastq.gz, results\/Mus_musculus\/fastq\/ATACSeq\/single\/allchrom\/ATACErythroblast.fastq.gz\n    resources: tmpdir=\/tmp\n\nJob stats:\njob                      count    min threads    max threads\n---------------------  -------  -------------  -------------\nall                          1              1              1\ndownload_fastq_single        6              1              1\ntotal                        7              1              1\n\nReasons:\n    (check individual jobs above for details)\n    input files updated by another job:\n        all\n    missing output files:\n        download_fastq_single\n\nThis was a dry-run (flag -n). The order of jobs does not reflect the order of execution.\n<\/code><\/pre>\n\n\n\n<p>You can see above that the <code>single<\/code> samples are processed correctly. We can deduce that the problem was coming from the <code>download_fastq_paired<\/code>. Let&#8217;s now add back the paired rule but with a minimal structure to reduce the number of possible sources of error. We want first to see if the error could come from the <code>output<\/code> section of the rule. Copy the following content into <code>Snakefile3<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\nconfigfile: \"config.yaml\"\n\nonstart:\n    print(\"##### DOWNLOAD FASTQ FILES #####\\n\") \n\n\n###############################################################################\n# Creating input table\n###############################################################################\n\n# Build the table of test datasets to download\nsamplesData = &#91;]\n\nfor tech in config&#91;\"testDatasets\"]&#91;\"technique\"]:\n  for org in config&#91;\"testDatasets\"]&#91;\"organism\"]:\n    pathSingle = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"singleEnded\"]\n    nameSingle = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"nameSingleEnd\"]\n    pathPaired1 = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"pairedEnded1\"]\n    pathPaired2 = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"pairedEnded2\"]\n    namePaired = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"namePairedEnd\"]\n    samplesData.append(&#91;nameSingle, tech, org, \"single\", pathSingle, \"NA\"])\n    samplesData.append(&#91;namePaired, tech, org, \"paired\", pathPaired1, pathPaired2])\n\ndf = pd.DataFrame(samplesData)\ndf.rename(columns={0: 'samples', 1: 'library_strategy', 2: 'organism', 3: 'library_layout', 4: 'link1', 5: 'link2'}, inplace=True)\n\n\n###############################################################################\n# Variables definition\n###############################################################################\n\n# Splitting the table into single or paired end experiments\n\nindex_single = df&#91;'library_layout'] == 'single'\nindex_paired = df&#91;'library_layout'] == 'paired'\ndf_single = df&#91;index_single]\ndf_paired = df&#91;index_paired]\n\n# Output files names\n\nSINGLESAMPLES = df_single&#91;'samples'].tolist()\nPAIREDSAMPLES = df_paired&#91;'samples'].tolist()\n\n# For Retrieving links to download sra files\n\nsamples_single_forlinks = pd.DataFrame(df_single).set_index(\"samples\",drop=False)\nsamples_paired_forlinks = pd.DataFrame(df_paired).set_index(\"samples\",drop=False)\n\n# Technique names\nSINGLETECH = df_single&#91;'library_strategy'].tolist()\nPAIREDTECH = df_paired&#91;'library_strategy'].tolist()\n\n## Species name\nSPECIESSINGLE = df_single&#91;'organism'].tolist()\nSPECIESPAIRED = df_paired&#91;'organism'].tolist()\n\n## Layout names\nLAYOUTSINGLE = df_single&#91;'library_layout'].tolist()\nLAYOUTPAIRED = df_paired&#91;'library_layout'].tolist()\n\n\n############\n# Rule all\n############\n\nrule all:\n  input:\n    expand(\"results\/{speciessingle}\/fastq\/{techniquesingle}\/{layoutsingle}\/allchrom\/{samplenamesingle}.fastq.gz\", zip, speciessingle=SPECIESSINGLE, techniquesingle=SINGLETECH, layoutsingle=LAYOUTSINGLE, samplenamesingle=SINGLESAMPLES),\n    expand(\"results\/{speciespaired}\/fastq\/{techniquepaired}\/{layoutpaired}\/allchrom\/{samplenamepaired}_1.fastq.gz\", zip, speciespaired=SPECIESPAIRED, techniquepaired=PAIREDTECH, layoutpaired=LAYOUTPAIRED, samplenamepaired=PAIREDSAMPLES)\n\n\nrule download_fastq_single:\n  output:\n    \"results\/{speciessingle}\/fastq\/{techniquesingle}\/{layoutsingle}\/allchrom\/{samplenamesingle}.fastq.gz\"\n  threads: 1\n  shell:\n    \"echo 'hello' > {output}\"\n\nrule download_fastq_paired:\n  output:\n    \"results\/{speciespaired}\/fastq\/{techniquepaired}\/{layoutpaired}\/allchrom\/{samplenamepaired}_1.fastq.gz\"\n  threads: 1\n  shell:\n    \"echo 'hello' > {output}\"\n<\/code><\/pre>\n\n\n\n<p>Perform a dry-run:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/bash\n\nsnakemake --cores 1 --snakefile Snakefile3 -n\n<\/code><\/pre>\n\n\n\n<p>You can see that the error is back:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Building DAG of jobs...\nAmbiguousRuleException:\nRules download_fastq_paired and download_fastq_single are ambiguous for the file results\/Mus_musculus\/fastq\/RNASeq\/paired\/allchrom\/CD4PolyAPlus_1.fastq.gz.\nConsider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.\nWildcards:\n    download_fastq_paired: layoutpaired=paired,samplenamepaired=CD4PolyAPlus,speciespaired=Mus_musculus,techniquepaired=RNASeq\n    download_fastq_single: layoutsingle=paired,samplenamesingle=CD4PolyAPlus_1,speciessingle=Mus_musculus,techniquesingle=RNASeq\nExpected input files:\n    download_fastq_paired: \n    download_fastq_single: Expected output files:\n    download_fastq_paired: results\/Mus_musculus\/fastq\/RNASeq\/paired\/allchrom\/CD4PolyAPlus_1.fastq.gz\n    download_fastq_single: results\/Mus_musculus\/fastq\/RNASeq\/paired\/allchrom\/CD4PolyAPlus_1.fastq.gz\n<\/code><\/pre>\n\n\n\n<p>We can now be confident that the problem comes from the line <code>\"results\/{speciespaired}\/fastq\/{techniquepaired}\/{layoutpaired}\/allchrom\/{samplenamepaired}_1.fastq.gz\"<\/code>. To be even more sure about it, let&#8217;s change this line (use a <code>Snakefile4<\/code>):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\nconfigfile: \"config.yaml\"\n\nonstart:\n    print(\"##### DOWNLOAD FASTQ FILES #####\\n\") \n\n\n###############################################################################\n# Creating input table\n###############################################################################\n\n# Build the table of test datasets to download\nsamplesData = &#91;]\n\nfor tech in config&#91;\"testDatasets\"]&#91;\"technique\"]:\n  for org in config&#91;\"testDatasets\"]&#91;\"organism\"]:\n    pathSingle = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"singleEnded\"]\n    nameSingle = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"nameSingleEnd\"]\n    pathPaired1 = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"pairedEnded1\"]\n    pathPaired2 = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"pairedEnded2\"]\n    namePaired = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"namePairedEnd\"]\n    samplesData.append(&#91;nameSingle, tech, org, \"single\", pathSingle, \"NA\"])\n    samplesData.append(&#91;namePaired, tech, org, \"paired\", pathPaired1, pathPaired2])\n\ndf = pd.DataFrame(samplesData)\ndf.rename(columns={0: 'samples', 1: 'library_strategy', 2: 'organism', 3: 'library_layout', 4: 'link1', 5: 'link2'}, inplace=True)\n\n\n###############################################################################\n# Variables definition\n###############################################################################\n\n# Splitting the table into single or paired end experiments\n\nindex_single = df&#91;'library_layout'] == 'single'\nindex_paired = df&#91;'library_layout'] == 'paired'\ndf_single = df&#91;index_single]\ndf_paired = df&#91;index_paired]\n\n# Output files names\n\nSINGLESAMPLES = df_single&#91;'samples'].tolist()\nPAIREDSAMPLES = df_paired&#91;'samples'].tolist()\n\n# For Retrieving links to download sra files\n\nsamples_single_forlinks = pd.DataFrame(df_single).set_index(\"samples\",drop=False)\nsamples_paired_forlinks = pd.DataFrame(df_paired).set_index(\"samples\",drop=False)\n\n# Technique names\nSINGLETECH = df_single&#91;'library_strategy'].tolist()\nPAIREDTECH = df_paired&#91;'library_strategy'].tolist()\n\n## Species name\nSPECIESSINGLE = df_single&#91;'organism'].tolist()\nSPECIESPAIRED = df_paired&#91;'organism'].tolist()\n\n## Layout names\nLAYOUTSINGLE = df_single&#91;'library_layout'].tolist()\nLAYOUTPAIRED = df_paired&#91;'library_layout'].tolist()\n\n\n############\n# Rule all\n############\n\nrule all:\n  input:\n    expand(\"results\/{speciessingle}\/fastq\/{techniquesingle}\/{layoutsingle}\/allchrom\/{samplenamesingle}.fastq.gz\", zip, speciessingle=SPECIESSINGLE, techniquesingle=SINGLETECH, layoutsingle=LAYOUTSINGLE, samplenamesingle=SINGLESAMPLES),\n    \"results\/test\/test.txt\"\n\n\nrule download_fastq_single:\n  output:\n    \"results\/{speciessingle}\/fastq\/{techniquesingle}\/{layoutsingle}\/allchrom\/{samplenamesingle}.fastq.gz\"\n  threads: 1\n  shell:\n    \"echo 'hello' > {output}\"\n\nrule download_fastq_paired:\n  output:\n    \"results\/test\/test.txt\"\n  threads: 1\n  shell:\n    \"echo 'hello' > {output}\"\n<\/code><\/pre>\n\n\n\n<p>Perform a dry-run (<code>snakemake --cores 1 --snakefile Snakefile4 -n<\/code>) and you will notice that the error is gone. We can now be sure that the problem is coming from the output section of <code>download_fastq_paired<\/code> and more precisely from its wildcards. In order to isolate which wildcards is problematic, let&#8217;s test separately the ones from the path and the one from the file name.<\/p>\n\n\n\n<p>Copy this content into <code>Snakefile5<\/code> and perform a dry-run (<code>snakemake --cores 1 --snakefile Snakefile5 -n<\/code>):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\nconfigfile: \"config.yaml\"\n\nonstart:\n    print(\"##### DOWNLOAD FASTQ FILES #####\\n\") \n\n\n###############################################################################\n# Creating input table\n###############################################################################\n\n# Build the table of test datasets to download\nsamplesData = &#91;]\n\nfor tech in config&#91;\"testDatasets\"]&#91;\"technique\"]:\n  for org in config&#91;\"testDatasets\"]&#91;\"organism\"]:\n    pathSingle = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"singleEnded\"]\n    nameSingle = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"nameSingleEnd\"]\n    pathPaired1 = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"pairedEnded1\"]\n    pathPaired2 = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"pairedEnded2\"]\n    namePaired = config&#91;\"testDatasets\"]&#91;tech]&#91;org]&#91;\"namePairedEnd\"]\n    samplesData.append(&#91;nameSingle, tech, org, \"single\", pathSingle, \"NA\"])\n    samplesData.append(&#91;namePaired, tech, org, \"paired\", pathPaired1, pathPaired2])\n\ndf = pd.DataFrame(samplesData)\ndf.rename(columns={0: 'samples', 1: 'library_strategy', 2: 'organism', 3: 'library_layout', 4: 'link1', 5: 'link2'}, inplace=True)\n\n\n###############################################################################\n# Variables definition\n###############################################################################\n\n# Splitting the table into single or paired end experiments\n\nindex_single = df&#91;'library_layout'] == 'single'\nindex_paired = df&#91;'library_layout'] == 'paired'\ndf_single = df&#91;index_single]\ndf_paired = df&#91;index_paired]\n\n# Output files names\n\nSINGLESAMPLES = df_single&#91;'samples'].tolist()\nPAIREDSAMPLES = df_paired&#91;'samples'].tolist()\n\n# For Retrieving links to download sra files\n\nsamples_single_forlinks = pd.DataFrame(df_single).set_index(\"samples\",drop=False)\nsamples_paired_forlinks = pd.DataFrame(df_paired).set_index(\"samples\",drop=False)\n\n# Technique names\nSINGLETECH = df_single&#91;'library_strategy'].tolist()\nPAIREDTECH = df_paired&#91;'library_strategy'].tolist()\n\n## Species name\nSPECIESSINGLE = df_single&#91;'organism'].tolist()\nSPECIESPAIRED = df_paired&#91;'organism'].tolist()\n\n## Layout names\nLAYOUTSINGLE = df_single&#91;'library_layout'].tolist()\nLAYOUTPAIRED = df_paired&#91;'library_layout'].tolist()\n\n\n############\n# Rule all\n############\n\nrule all:\n  input:\n    expand(\"results\/{speciessingle}\/fastq\/{techniquesingle}\/{layoutsingle}\/allchrom\/{samplenamesingle}.fastq.gz\", zip, speciessingle=SPECIESSINGLE, techniquesingle=SINGLETECH, layoutsingle=LAYOUTSINGLE, samplenamesingle=SINGLESAMPLES),\n    expand(\"results\/{speciespaired}\/fastq\/{techniquepaired}\/{layoutpaired}\/allchrom\/test_1.txt\", zip, speciespaired=SPECIESPAIRED, techniquepaired=PAIREDTECH, layoutpaired=LAYOUTPAIRED, samplenamepaired=PAIREDSAMPLES)\n\n\nrule download_fastq_single:\n  output:\n    \"results\/{speciessingle}\/fastq\/{techniquesingle}\/{layoutsingle}\/allchrom\/{samplenamesingle}.fastq.gz\"\n  threads: 1\n  shell:\n    \"echo 'hello' > {output}\"\n\nrule download_fastq_paired:\n  output:\n    \"results\/{speciespaired}\/fastq\/{techniquepaired}\/{layoutpaired}\/allchrom\/test_1.txt\"\n  threads: 1\n  shell:\n    \"echo 'hello' > {output}\"\n<\/code><\/pre>\n\n\n\n<p>Since the error was not generated, we can think that the problem is coming from the file name. Let&#8217;s change <code>test_1.txt<\/code> to <code>test_1.fastq.gz<\/code> (<code>cp Snakefile5 Snakefile6<\/code>, modify the file and run <code>snakemake --cores 1 --snakefile Snakefile6 -n<\/code>). You should obtain:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>AmbiguousRuleException:\nRules download_fastq_paired and download_fastq_single are ambiguous for the file results\/Mus_musculus\/fastq\/RNASeq\/paired\/allchrom\/test_1.fastq.gz.\nConsider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.\nWildcards:\n    download_fastq_paired: layoutpaired=paired,speciespaired=Mus_musculus,techniquepaired=RNASeq\n    download_fastq_single: layoutsingle=paired,samplenamesingle=test_1,speciessingle=Mus_musculus,techniquesingle=RNASeq\nExpected input files:\n    download_fastq_paired: \n    download_fastq_single: Expected output files:\n    download_fastq_paired: results\/Mus_musculus\/fastq\/RNASeq\/paired\/allchrom\/test_1.fastq.gz\n    download_fastq_single: results\/Mus_musculus\/fastq\/RNASeq\/paired\/allchrom\/test_1.fastq.gz\n<\/code><\/pre>\n\n\n\n<div style=\"height:21px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Interpretation<\/h2>\n\n\n\n<p>You can see that the problem observed before is back. That means that when two files have the same structure (<code>test_1.fastq.gz<\/code> and <code>{samplenamesingle}.fastq.gz<\/code>), the back-propagation system of Snakemake can create ambiguity about the files to produce. Where one could see two different structures in:<\/p>\n\n\n\n<p>1) &#8220;results\/{speciessingle}\/fastq\/{techniquesingle}\/{layoutsingle}\/allchrom\/<br>{samplenamesingle}.fastq.gz&#8221;<br>2) &#8220;results\/{speciespaired}\/fastq\/{techniquepaired}\/{layoutpaired}\/allchrom\/<br>{samplenamepaired}_1.fastq.gz&#8221;<\/p>\n\n\n\n<p>Snakemake seems to interpret them as the single pattern: <code>folder1\/folder2\/folder3\/folder4\/folder5\/folder6\/filename.fastq.gz<\/code>. It will then look if this pattern found in <code>download_fastq_paired<\/code> can be found in another rule upstream (here <code>download_fastq_single<\/code>) and will back-propagate the wildcards <code>filename<\/code> to this other rule.<\/p>\n\n\n\n<p>The Snakemake manual gives this example to explain how wildcards can be mixed:<\/p>\n\n\n\n<p><em>&#8220;Multiple wildcards in one filename can cause ambiguity. Consider the pattern {dataset}.{group}.txt and assume that a file 101.B.normal.txt is available. It is not clear whether dataset=101.B and group=normal or dataset=101 and group=B.normal in this case.&#8221;<\/em><\/p>\n\n\n\n<div style=\"height:21px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Fixing the problem<\/h2>\n\n\n\n<p>You might have noticed in the <code>AmbiguousRuleException<\/code> the message <code>Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive<\/code>. Snakemake offers a way to <a href=\"download_fastq_paired\">constrain wildcards<\/a> with the keyword <code>wildcard_constraints<\/code>. The trick here is to remove the possibility of the underscore being included in the wildcards with a regular expression. Add the following code to each rule of <code>Snakefile<\/code> and perform a dry-run (be careful about the indentation when you paste the code below):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;...]\n\nrule download_fastq_single:\n&#91;...]\n  wildcard_constraints:\n    samplenamesingle=\"&#91;0-9A-Za-z]+\"\n&#91;...]\n\nrule download_fastq_paired:\n&#91;...]\n  wildcard_constraints:\n    samplenamepaired=\"&#91;0-9A-Za-z]+\"\n&#91;...]\n<\/code><\/pre>\n\n\n\n<div style=\"height:21px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Having in mind the fact that Snakemake will look at the targets first and then go backward to determine the rules that created them is key to understand some of the errors that one could encounter. We have seen here that the origin of the problem impacting the <code>download_fastq_single<\/code> was actually a downstream rule. Our example being limited to two rules, it was not very complicated to figure it out. This can become really difficult when working with a huge number of rules interconnected in multiple ways. I would say that the key in such a case is to sequentially simplify the DAG to be able to spot the problematic rule.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I want this post to describe a bug that I got recently that led me to use constraints on wildcards. The error was not intuitive and can be very disturbing at first if one is not used to the Snakemake logic. Below I will describe step by step how to build a minimal working example [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":3190,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[4096],"tags":[4098],"embl_taxonomy":[],"class_list":["post-3138","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technical","tag-snakemake"],"acf":[],"embl_taxonomy_terms":[],"featured_image_src":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-content\/uploads\/2022\/10\/munch-resize.jpg","_links":{"self":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts\/3138"}],"collection":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/comments?post=3138"}],"version-history":[{"count":18,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts\/3138\/revisions"}],"predecessor-version":[{"id":3662,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/posts\/3138\/revisions\/3662"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/media\/3190"}],"wp:attachment":[{"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/media?parent=3138"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/categories?post=3138"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/tags?post=3138"},{"taxonomy":"embl_taxonomy","embeddable":true,"href":"https:\/\/www.embl.org\/groups\/bioinformatics-rome\/wp-json\/wp\/v2\/embl_taxonomy?post=3138"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}