diff --git a/docs/source/changes.md b/docs/source/changes.md index 48c57983..6b13c8ec 100644 --- a/docs/source/changes.md +++ b/docs/source/changes.md @@ -7,6 +7,7 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and ## 0.5.1 - 2024-xx-xx +- {pull}`616` redesigns the guide on "Scaling Tasks". - {pull}`617` fixes an interaction with provisional nodes and `@mark.persist`. - {pull}`618` ensures that `root_dir` of `DirectoryNode` is created before the task is executed. diff --git a/docs/source/how_to_guides/bp_complex_task_repetitions.md b/docs/source/how_to_guides/bp_complex_task_repetitions.md new file mode 100644 index 00000000..68e44569 --- /dev/null +++ b/docs/source/how_to_guides/bp_complex_task_repetitions.md @@ -0,0 +1,83 @@ +# Complex task repetitions + +{doc}`Task repetitions <../tutorials/repeating_tasks_with_different_inputs>` are amazing +if you want to execute lots of tasks while not repeating yourself in code. + +But, in any bigger project, repetitions can become hard to maintain because there are +multiple layers or dimensions of repetition. + +Here you find some tips on how to set up your project such that adding dimensions and +increasing dimensions becomes much easier. + +## Example + +You can write multiple loops around a task function where each loop stands for a +different dimension. A dimension might represent different datasets or model +specifications to analyze the datasets like in the following example. The task arguments +are derived from the dimensions. + +```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/example.py +--- +caption: task_example.py +--- +``` + +There is nothing wrong with using nested loops for simpler projects. But, often projects +are growing over time and you run into these problems. + +- When you add a new task, you need to duplicate the nested loops in another module. +- When you add a dimension, you need to touch multiple files in your project and add + another loop and level of indentation. + +## Solution + +The main idea for the solution is quickly explained. We will, first, formalize +dimensions into objects and, secondly, combine them in one object such that we only have +to iterate over instances of this object in a single loop. + +We will start by defining the dimensions using {class}`~typing.NamedTuple` or +{func}`~dataclasses.dataclass`. + +Then, we will define the object that holds both pieces of information together and for +the lack of a better name, we will call it an experiment. + +```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/experiment.py +--- +caption: config.py +--- +``` + +There are some things to be said. + +- The names on each dimension need to be unique and ensure that by combining them for + the name of the experiment, we get a unique and descriptive id. +- Dimensions might need more attributes than just a name, like paths, or other arguments + for the task. Add them. + +Next, we will use these newly defined data structures and see how our tasks change when +we use them. + +```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/example_improved.py +--- +caption: task_example.py +--- +``` + +As you see, we replaced + +## Using the `DataCatalog` + +## Adding another dimension + +## Adding another level + +## Executing a subset + +## Grouping and aggregating + +## Extending repetitions + +Some parametrized tasks are costly to run - costly in terms of computing power, memory, +or time. Users often extend repetitions triggering all repetitions to be rerun. Thus, +use the {func}`@pytask.mark.persist ` decorator, which is explained +in more detail in this {doc}`tutorial <../tutorials/making_tasks_persist>`. diff --git a/docs/source/how_to_guides/bp_scaling_tasks.md b/docs/source/how_to_guides/bp_scaling_tasks.md deleted file mode 100644 index fa7cb5e9..00000000 --- a/docs/source/how_to_guides/bp_scaling_tasks.md +++ /dev/null @@ -1,101 +0,0 @@ -# Scaling tasks - -In any bigger project you quickly come to the point where you stack multiple repetitions -of tasks on top of each other. - -For example, you have one dataset, four different ways to prepare it, and three -statistical models to analyze the data. The cartesian product of all steps combined -comprises twelve differently fitted models. - -Here you find some tips on how to set up your tasks such that you can easily modify the -cartesian product of steps. - -## Scalability - -Let us dive right into the aforementioned example. We start with one dataset `data.csv`. -Then, we will create four different specifications of the data and, finally, fit three -different models to each specification. - -This is the structure of the project. - -``` -my_project -├───pyproject.toml -│ -├───src -│ └───my_project -│ ├────config.py -│ │ -│ ├───data -│ │ └────data.csv -│ │ -│ ├───data_preparation -│ │ ├────__init__.py -│ │ ├────config.py -│ │ └────task_prepare_data.py -│ │ -│ └───estimation -│ ├────__init__.py -│ ├────config.py -│ └────task_estimate_models.py -│ -├───.pytask -│ └────... -│ -└───bld -``` - -The folder structure, the main `config.py` which holds `SRC` and `BLD`, and the tasks -follow the same structure advocated throughout the tutorials. - -New are the local configuration files in each subfolder of `my_project`, which contain -objects shared across tasks. For example, `config.py` holds the paths to the processed -data and the names of the data sets. - -```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_1.py -``` - -The task file `task_prepare_data.py` uses these objects to build the repetitions. - -```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_2.py -``` - -All arguments for the loop and the {func}`@task ` decorator are built -within a function to keep the logic in one place and the module's namespace clean. - -Ids are used to make the task {ref}`ids ` more descriptive and to simplify their -selection with {ref}`expressions `. Here is an example of the task ids with -an explicit id. - -``` -# With id -.../my_project/data_preparation/task_prepare_data.py::task_prepare_data[data_0] -``` - -Next, we move to the estimation to see how we can build another repetition on top. - -```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_3.py -``` - -In the local configuration, we define `ESTIMATIONS` which combines the information on -data and model. The dictionary's key can be used as a task id whenever the estimation is -involved. It allows triggering all tasks related to one estimation - estimation, -figures, tables - with one command. - -```console -pytask -k linear_probability_data_0 -``` - -And here is the task file. - -```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_4.py -``` - -Replicating this pattern across a project allows a clean way to define repetitions. - -## Extending repetitions - -Some parametrized tasks are costly to run - costly in terms of computing power, memory, -or time. Users often extend repetitions triggering all repetitions to be rerun. Thus, -use the {func}`@pytask.mark.persist ` decorator, which is explained -in more detail in this {doc}`tutorial <../tutorials/making_tasks_persist>`. diff --git a/docs/source/how_to_guides/bp_structure_of_task_files.md b/docs/source/how_to_guides/bp_structure_of_task_files.md index 857f6479..84e16789 100644 --- a/docs/source/how_to_guides/bp_structure_of_task_files.md +++ b/docs/source/how_to_guides/bp_structure_of_task_files.md @@ -14,7 +14,7 @@ are looking for orientation or inspiration, here are some tips. module is for. ```{seealso} - The only exception might be for {doc}`repetitions `. + The only exception might be for {doc}`repetitions `. ``` - The purpose of the task function is to handle IO operations like loading and saving diff --git a/docs/source/how_to_guides/index.md b/docs/source/how_to_guides/index.md index 8f0e9f47..53068ee0 100644 --- a/docs/source/how_to_guides/index.md +++ b/docs/source/how_to_guides/index.md @@ -42,5 +42,5 @@ maxdepth: 1 bp_structure_of_a_research_project bp_structure_of_task_files bp_templates_and_projects -bp_scaling_tasks +bp_complex_task_repetitions ``` diff --git a/docs/source/tutorials/repeating_tasks_with_different_inputs.md b/docs/source/tutorials/repeating_tasks_with_different_inputs.md index 750435d6..136152ed 100644 --- a/docs/source/tutorials/repeating_tasks_with_different_inputs.md +++ b/docs/source/tutorials/repeating_tasks_with_different_inputs.md @@ -291,7 +291,8 @@ for id_, kwargs in ID_TO_KWARGS.items(): def task_create_random_data(i, produces): ... ``` -The {doc}`best-practices guide on parametrizations <../how_to_guides/bp_scaling_tasks>` +The +{doc}`best-practices guide on parametrizations <../how_to_guides/bp_complex_task_repetitions>` goes into even more detail on how to scale parametrizations. ## A warning on globals diff --git a/docs_src/how_to_guides/bp_complex_task_repetitions/example.py b/docs_src/how_to_guides/bp_complex_task_repetitions/example.py new file mode 100644 index 00000000..3e3bf14e --- /dev/null +++ b/docs_src/how_to_guides/bp_complex_task_repetitions/example.py @@ -0,0 +1,19 @@ +from pathlib import Path +from typing import Annotated + +from pytask import Product +from pytask import task + +SRC = Path(__file__).parent +BLD = SRC / "bld" + + +for data_name in ("a", "b", "c"): + for model_name in ("ols", "logit", "linear_prob"): + + @task(id=f"{model_name}-{data_name}") + def task_fit_model( + path_to_data: Path = SRC / f"{data_name}.pkl", + path_to_model: Annotated[Path, Product] = BLD + / f"{data_name}-{model_name}.pkl", + ) -> None: ... diff --git a/docs_src/how_to_guides/bp_complex_task_repetitions/example_improved.py b/docs_src/how_to_guides/bp_complex_task_repetitions/example_improved.py new file mode 100644 index 00000000..741d2c19 --- /dev/null +++ b/docs_src/how_to_guides/bp_complex_task_repetitions/example_improved.py @@ -0,0 +1,14 @@ +from pathlib import Path +from typing import Annotated + +from myproject.config import EXPERIMENTS +from pytask import Product +from pytask import task + +for experiment in EXPERIMENTS: + + @task(id=experiment.name) + def task_fit_model( + path_to_data: experiment.dataset.path, + path_to_model: Annotated[Path, Product] = experiment.path, + ) -> None: ... diff --git a/docs_src/how_to_guides/bp_complex_task_repetitions/experiment.py b/docs_src/how_to_guides/bp_complex_task_repetitions/experiment.py new file mode 100644 index 00000000..002c669e --- /dev/null +++ b/docs_src/how_to_guides/bp_complex_task_repetitions/experiment.py @@ -0,0 +1,37 @@ +from pathlib import Path +from typing import NamedTuple + +SRC = Path(__file__).parent +BLD = SRC / "bld" + + +class Dataset(NamedTuple): + name: str + + @property + def path(self) -> Path: + return SRC / f"{self.name}.pkl" + + +class Model(NamedTuple): + name: str + + +DATASETS = [Dataset("a"), Dataset("b"), Dataset("c")] +MODELS = [Model("ols"), Model("logit"), Model("linear_prob")] + + +class Experiment(NamedTuple): + dataset: Dataset + model: Model + + @property + def name(self) -> str: + return f"{self.model.name}-{self.dataset.name}" + + @property + def path(self) -> Path: + return BLD / f"{self.name}.pkl" + + +EXPERIMENTS = [Experiment(dataset, model) for dataset in DATASETS for model in MODELS] diff --git a/docs_src/how_to_guides/bp_scaling_tasks_1.py b/docs_src/how_to_guides/bp_scaling_tasks_1.py deleted file mode 100644 index 52d6ea61..00000000 --- a/docs_src/how_to_guides/bp_scaling_tasks_1.py +++ /dev/null @@ -1,20 +0,0 @@ -# Content of config.py -from pathlib import Path - -from my_project.config import BLD -from my_project.config import SRC - -DATA = { - "data_0": {"subset": "subset_1"}, - "data_1": {"subset": "subset_2"}, - "data_2": {"subset": "subset_3"}, - "data_3": {"subset": "subset_4"}, -} - - -def path_to_input_data(name: str) -> Path: - return SRC / "data" / "data.csv" - - -def path_to_processed_data(name: str) -> Path: - return BLD / "data" / f"processed_{name}.pkl" diff --git a/docs_src/how_to_guides/bp_scaling_tasks_2.py b/docs_src/how_to_guides/bp_scaling_tasks_2.py deleted file mode 100644 index f31cfc64..00000000 --- a/docs_src/how_to_guides/bp_scaling_tasks_2.py +++ /dev/null @@ -1,39 +0,0 @@ -# Content of task_prepare_data.py -from pathlib import Path - -from my_project.data_preparation.config import DATA -from my_project.data_preparation.config import path_to_input_data -from my_project.data_preparation.config import path_to_processed_data -from pandas import pd -from pytask import Product -from pytask import task -from typing_extensions import Annotated - - -def _create_parametrization(data: list[str]) -> dict[str, Path]: - id_to_kwargs = {} - for data_name, kwargs in data.items(): - id_to_kwargs[data_name] = { - "path_to_input_data": path_to_input_data(data_name), - "path_to_processed_data": path_to_processed_data(data_name), - **kwargs, - } - - return id_to_kwargs - - -_ID_TO_KWARGS = _create_parametrization(DATA) - - -for id_, kwargs in _ID_TO_KWARGS.items(): - - @task(id=id_, kwargs=kwargs) - def task_prepare_data( - path_to_input_data: Path, - subset: str, - path_to_processed_data: Annotated[Path, Product], - ) -> None: - df = pd.read_csv(path_to_input_data) - # ... transform the data. - subset = df.loc[df["subset"].eq(subset)] - subset.to_pickle(path_to_processed_data) diff --git a/docs_src/how_to_guides/bp_scaling_tasks_3.py b/docs_src/how_to_guides/bp_scaling_tasks_3.py deleted file mode 100644 index 1e2103d4..00000000 --- a/docs_src/how_to_guides/bp_scaling_tasks_3.py +++ /dev/null @@ -1,18 +0,0 @@ -# Content of config.py -from pathlib import Path - -from my_project.config import BLD -from my_project.data_preparation.config import DATA - -_MODELS = ["linear_probability", "logistic_model", "decision_tree"] - - -ESTIMATIONS = { - f"{data_name}_{model_name}": {"model": model_name, "data": data_name} - for model_name in _MODELS - for data_name in DATA -} - - -def path_to_estimation_result(name: str) -> Path: - return BLD / "estimation" / f"estimation_{name}.pkl" diff --git a/docs_src/how_to_guides/bp_scaling_tasks_4.py b/docs_src/how_to_guides/bp_scaling_tasks_4.py deleted file mode 100644 index a6c66539..00000000 --- a/docs_src/how_to_guides/bp_scaling_tasks_4.py +++ /dev/null @@ -1,36 +0,0 @@ -# Content of task_estimate_models.py -from pathlib import Path - -from my_project.data_preparation.config import path_to_processed_data -from my_project.estimations.config import ESTIMATIONS -from my_project.estimations.config import path_to_estimation_result -from pytask import Product -from pytask import task -from typing_extensions import Annotated - - -def _create_parametrization( - estimations: dict[str, dict[str, str]], -) -> dict[str, str | Path]: - id_to_kwargs = {} - for name, config in estimations.items(): - id_to_kwargs[name] = { - "path_to_data": path_to_processed_data(config["data"]), - "model": config["model"], - "path_to_estimation": path_to_estimation_result(name), - } - - return id_to_kwargs - - -_ID_TO_KWARGS = _create_parametrization(ESTIMATIONS) - - -for id_, kwargs in _ID_TO_KWARGS.items(): - - @task(id=id_, kwargs=kwargs) - def task_estmate_models( - path_to_data: Path, model: str, path_to_estimation: Annotated[Path, Product] - ) -> None: - if model == "linear_probability": - ...