pytask-dev · tobiasraabe · Jul 19, 2024 · Jun 9, 2024 · Jun 9, 2024 · Jun 24, 2024
diff --git a/docs/source/changes.md b/docs/source/changes.md
@@ -7,6 +7,7 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and
 
 ## 0.5.1 - 2024-xx-xx
 
+- {pull}`616` redesigns the guide on "Scaling Tasks".
 - {pull}`617` fixes an interaction with provisional nodes and `@mark.persist`.
 - {pull}`618` ensures that `root_dir` of `DirectoryNode` is created before the task is
   executed.

diff --git a/docs/source/how_to_guides/bp_complex_task_repetitions.md b/docs/source/how_to_guides/bp_complex_task_repetitions.md
@@ -0,0 +1,83 @@
+# Complex task repetitions
+
+{doc}`Task repetitions <../tutorials/repeating_tasks_with_different_inputs>` are amazing
+if you want to execute lots of tasks while not repeating yourself in code.
+
+But, in any bigger project, repetitions can become hard to maintain because there are
+multiple layers or dimensions of repetition.
+
+Here you find some tips on how to set up your project such that adding dimensions and
+increasing dimensions becomes much easier.
+
+## Example
+
+You can write multiple loops around a task function where each loop stands for a
+different dimension. A dimension might represent different datasets or model
+specifications to analyze the datasets like in the following example. The task arguments
+are derived from the dimensions.
+
+```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/example.py
+---
+caption: task_example.py
+---
+```
+
+There is nothing wrong with using nested loops for simpler projects. But, often projects
+are growing over time and you run into these problems.
+
+- When you add a new task, you need to duplicate the nested loops in another module.
+- When you add a dimension, you need to touch multiple files in your project and add
+  another loop and level of indentation.
+
+## Solution
+
+The main idea for the solution is quickly explained. We will, first, formalize
+dimensions into objects and, secondly, combine them in one object such that we only have
+to iterate over instances of this object in a single loop.
+
+We will start by defining the dimensions using {class}`~typing.NamedTuple` or
+{func}`~dataclasses.dataclass`.
+
+Then, we will define the object that holds both pieces of information together and for
+the lack of a better name, we will call it an experiment.
+
+```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/experiment.py
+---
+caption: config.py
+---
+```
+
+There are some things to be said.
+
+- The names on each dimension need to be unique and ensure that by combining them for
+  the name of the experiment, we get a unique and descriptive id.
+- Dimensions might need more attributes than just a name, like paths, or other arguments
+  for the task. Add them.
+
+Next, we will use these newly defined data structures and see how our tasks change when
+we use them.
+
+```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/example_improved.py
+---
+caption: task_example.py
+---
+```
+
+As you see, we replaced
+
+## Using the `DataCatalog`
+
+## Adding another dimension
+
+## Adding another level
+
+## Executing a subset
+
+## Grouping and aggregating
+
+## Extending repetitions
+
+Some parametrized tasks are costly to run - costly in terms of computing power, memory,
+or time. Users often extend repetitions triggering all repetitions to be rerun. Thus,
+use the {func}`@pytask.mark.persist <pytask.mark.persist>` decorator, which is explained
+in more detail in this {doc}`tutorial <../tutorials/making_tasks_persist>`.
diff --git a/docs/source/how_to_guides/bp_scaling_tasks.md b/docs/source/how_to_guides/bp_scaling_tasks.md
diff --git a/docs/source/how_to_guides/bp_structure_of_task_files.md b/docs/source/how_to_guides/bp_structure_of_task_files.md
@@ -14,7 +14,7 @@ are looking for orientation or inspiration, here are some tips.
   module is for.
 
   ```{seealso}
-  The only exception might be for {doc}`repetitions <bp_scaling_tasks>`.
+  The only exception might be for {doc}`repetitions <bp_complex_task_repetitions>`.
   ```
 
 - The purpose of the task function is to handle IO operations like loading and saving

diff --git a/docs/source/how_to_guides/index.md b/docs/source/how_to_guides/index.md
@@ -42,5 +42,5 @@ maxdepth: 1
 bp_structure_of_a_research_project
 bp_structure_of_task_files
 bp_templates_and_projects
-bp_scaling_tasks
+bp_complex_task_repetitions
 ```
diff --git a/docs/source/tutorials/repeating_tasks_with_different_inputs.md b/docs/source/tutorials/repeating_tasks_with_different_inputs.md
@@ -291,7 +291,8 @@ for id_, kwargs in ID_TO_KWARGS.items():
     def task_create_random_data(i, produces): ...
 ```
 
-The {doc}`best-practices guide on parametrizations <../how_to_guides/bp_scaling_tasks>`
+The
+{doc}`best-practices guide on parametrizations <../how_to_guides/bp_complex_task_repetitions>`
 goes into even more detail on how to scale parametrizations.
 
 ## A warning on globals

diff --git a/docs_src/how_to_guides/bp_complex_task_repetitions/example.py b/docs_src/how_to_guides/bp_complex_task_repetitions/example.py
@@ -0,0 +1,19 @@
+from pathlib import Path
+from typing import Annotated
+
+from pytask import Product
+from pytask import task
+
+SRC = Path(__file__).parent
+BLD = SRC / "bld"
+
+
+for data_name in ("a", "b", "c"):
+    for model_name in ("ols", "logit", "linear_prob"):
+
+        @task(id=f"{model_name}-{data_name}")
+        def task_fit_model(
+            path_to_data: Path = SRC / f"{data_name}.pkl",
+            path_to_model: Annotated[Path, Product] = BLD
+            / f"{data_name}-{model_name}.pkl",
+        ) -> None: ...
diff --git a/docs_src/how_to_guides/bp_complex_task_repetitions/example_improved.py b/docs_src/how_to_guides/bp_complex_task_repetitions/example_improved.py
@@ -0,0 +1,14 @@
+from pathlib import Path
+from typing import Annotated
+
+from myproject.config import EXPERIMENTS
+from pytask import Product
+from pytask import task
+
+for experiment in EXPERIMENTS:
+
+    @task(id=experiment.name)
+    def task_fit_model(
+        path_to_data: experiment.dataset.path,
+        path_to_model: Annotated[Path, Product] = experiment.path,
+    ) -> None: ...
diff --git a/docs_src/how_to_guides/bp_complex_task_repetitions/experiment.py b/docs_src/how_to_guides/bp_complex_task_repetitions/experiment.py
@@ -0,0 +1,37 @@
+from pathlib import Path
+from typing import NamedTuple
+
+SRC = Path(__file__).parent
+BLD = SRC / "bld"
+
+
+class Dataset(NamedTuple):
+    name: str
+
+    @property
+    def path(self) -> Path:
+        return SRC / f"{self.name}.pkl"
+
+
+class Model(NamedTuple):
+    name: str
+
+
+DATASETS = [Dataset("a"), Dataset("b"), Dataset("c")]
+MODELS = [Model("ols"), Model("logit"), Model("linear_prob")]
+
+
+class Experiment(NamedTuple):
+    dataset: Dataset
+    model: Model
+
+    @property
+    def name(self) -> str:
+        return f"{self.model.name}-{self.dataset.name}"
+
+    @property
+    def path(self) -> Path:
+        return BLD / f"{self.name}.pkl"
+
+
+EXPERIMENTS = [Experiment(dataset, model) for dataset in DATASETS for model in MODELS]
diff --git a/docs_src/how_to_guides/bp_scaling_tasks_1.py b/docs_src/how_to_guides/bp_scaling_tasks_1.py
diff --git a/docs_src/how_to_guides/bp_scaling_tasks_2.py b/docs_src/how_to_guides/bp_scaling_tasks_2.py
diff --git a/docs_src/how_to_guides/bp_scaling_tasks_3.py b/docs_src/how_to_guides/bp_scaling_tasks_3.py