Skip to content

Redesign the scaling tasks guide. #616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and

## 0.5.1 - 2024-xx-xx

- {pull}`616` redesigns the guide on "Scaling Tasks".
- {pull}`617` fixes an interaction with provisional nodes and `@mark.persist`.
- {pull}`618` ensures that `root_dir` of `DirectoryNode` is created before the task is
executed.
Expand Down
83 changes: 83 additions & 0 deletions docs/source/how_to_guides/bp_complex_task_repetitions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Complex task repetitions

{doc}`Task repetitions <../tutorials/repeating_tasks_with_different_inputs>` are amazing
if you want to execute lots of tasks while not repeating yourself in code.

But, in any bigger project, repetitions can become hard to maintain because there are
multiple layers or dimensions of repetition.

Here you find some tips on how to set up your project such that adding dimensions and
increasing dimensions becomes much easier.

## Example

You can write multiple loops around a task function where each loop stands for a
different dimension. A dimension might represent different datasets or model
specifications to analyze the datasets like in the following example. The task arguments
are derived from the dimensions.

```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/example.py
---
caption: task_example.py
---
```

There is nothing wrong with using nested loops for simpler projects. But, often projects
are growing over time and you run into these problems.

- When you add a new task, you need to duplicate the nested loops in another module.
- When you add a dimension, you need to touch multiple files in your project and add
another loop and level of indentation.

## Solution

The main idea for the solution is quickly explained. We will, first, formalize
dimensions into objects and, secondly, combine them in one object such that we only have
to iterate over instances of this object in a single loop.

We will start by defining the dimensions using {class}`~typing.NamedTuple` or
{func}`~dataclasses.dataclass`.

Then, we will define the object that holds both pieces of information together and for
the lack of a better name, we will call it an experiment.

```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/experiment.py
---
caption: config.py
---
```

There are some things to be said.

- The names on each dimension need to be unique and ensure that by combining them for
the name of the experiment, we get a unique and descriptive id.
- Dimensions might need more attributes than just a name, like paths, or other arguments
for the task. Add them.

Next, we will use these newly defined data structures and see how our tasks change when
we use them.

```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/example_improved.py
---
caption: task_example.py
---
```

As you see, we replaced

## Using the `DataCatalog`

## Adding another dimension

## Adding another level

## Executing a subset

## Grouping and aggregating

## Extending repetitions

Some parametrized tasks are costly to run - costly in terms of computing power, memory,
or time. Users often extend repetitions triggering all repetitions to be rerun. Thus,
use the {func}`@pytask.mark.persist <pytask.mark.persist>` decorator, which is explained
in more detail in this {doc}`tutorial <../tutorials/making_tasks_persist>`.
101 changes: 0 additions & 101 deletions docs/source/how_to_guides/bp_scaling_tasks.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/source/how_to_guides/bp_structure_of_task_files.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ are looking for orientation or inspiration, here are some tips.
module is for.

```{seealso}
The only exception might be for {doc}`repetitions <bp_scaling_tasks>`.
The only exception might be for {doc}`repetitions <bp_complex_task_repetitions>`.
```

- The purpose of the task function is to handle IO operations like loading and saving
Expand Down
2 changes: 1 addition & 1 deletion docs/source/how_to_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,5 +42,5 @@ maxdepth: 1
bp_structure_of_a_research_project
bp_structure_of_task_files
bp_templates_and_projects
bp_scaling_tasks
bp_complex_task_repetitions
```
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,8 @@ for id_, kwargs in ID_TO_KWARGS.items():
def task_create_random_data(i, produces): ...
```

The {doc}`best-practices guide on parametrizations <../how_to_guides/bp_scaling_tasks>`
The
{doc}`best-practices guide on parametrizations <../how_to_guides/bp_complex_task_repetitions>`
goes into even more detail on how to scale parametrizations.

## A warning on globals
Expand Down
19 changes: 19 additions & 0 deletions docs_src/how_to_guides/bp_complex_task_repetitions/example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from pathlib import Path
from typing import Annotated

from pytask import Product
from pytask import task

SRC = Path(__file__).parent
BLD = SRC / "bld"


for data_name in ("a", "b", "c"):
for model_name in ("ols", "logit", "linear_prob"):

@task(id=f"{model_name}-{data_name}")
def task_fit_model(
path_to_data: Path = SRC / f"{data_name}.pkl",
path_to_model: Annotated[Path, Product] = BLD
/ f"{data_name}-{model_name}.pkl",
) -> None: ...
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from pathlib import Path
from typing import Annotated

from myproject.config import EXPERIMENTS
from pytask import Product
from pytask import task

for experiment in EXPERIMENTS:

@task(id=experiment.name)
def task_fit_model(
path_to_data: experiment.dataset.path,
path_to_model: Annotated[Path, Product] = experiment.path,
) -> None: ...
37 changes: 37 additions & 0 deletions docs_src/how_to_guides/bp_complex_task_repetitions/experiment.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
from pathlib import Path
from typing import NamedTuple

SRC = Path(__file__).parent
BLD = SRC / "bld"


class Dataset(NamedTuple):
name: str

@property
def path(self) -> Path:
return SRC / f"{self.name}.pkl"


class Model(NamedTuple):
name: str


DATASETS = [Dataset("a"), Dataset("b"), Dataset("c")]
MODELS = [Model("ols"), Model("logit"), Model("linear_prob")]


class Experiment(NamedTuple):
dataset: Dataset
model: Model

@property
def name(self) -> str:
return f"{self.model.name}-{self.dataset.name}"

@property
def path(self) -> Path:
return BLD / f"{self.name}.pkl"


EXPERIMENTS = [Experiment(dataset, model) for dataset in DATASETS for model in MODELS]
20 changes: 0 additions & 20 deletions docs_src/how_to_guides/bp_scaling_tasks_1.py

This file was deleted.

39 changes: 0 additions & 39 deletions docs_src/how_to_guides/bp_scaling_tasks_2.py

This file was deleted.

18 changes: 0 additions & 18 deletions docs_src/how_to_guides/bp_scaling_tasks_3.py

This file was deleted.

Loading
Loading