Skip to content

ENH: DAG facilitating nested DataCatalog structure #648

@felixschmitz

Description

@felixschmitz

Is your feature request related to a problem?

When using a nested DataCatalog of the kind

from pytask import DataCatalog


MODEL_NAMES = ("ols", "logistic_regression")
DATA_NAMES = ("data_1", "data_2")


nested_data_catalogs = {
    model_name: {
        data_name: DataCatalog(name=f"{model_name}-{data_name}")
        for data_name in DATA_NAMES
    }
    for model_name in MODEL_NAMES
}

and adding products to a DataCatalog e.g. via the following task:

from pathlib import Path
from pytask import task
from typing_extensions import Annotated

from my_project.config import DATA_NAMES
from my_project.config import MODEL_NAMES
from my_project.config import nested_data_catalogs


for model_name in MODEL_NAMES:
    for data_name in DATA_NAMES:

        @task
        def fit_model(
            path: Path = Path("...", data_name)
        ) -> Annotated[
            Any, nested_data_catalogs[model_name][data_name]["fitted_model"]
        ]:
            data = ...
            fitted_model = ...
            return fitted_model

as described in the extended DataCatalog guide, I would expect the DAG to facilitate the nested structure of the DataCatalog.

For now the PickleNode's name, "fitted_model" in the example, is only used in the representation of the DAG. When having multiple models and datasets, the information "fitted_model" is on the one hand insufficient, and on the other hand, produces a DAG which implies the wrong structure and dependencies.

Describe the solution you'd like

I would want the DAG to facilitate the nested structure of the DataCatalog and not only use the PickleNode's name. One approach would be to display in the DAG the name of the DataCatalog and the PickleNode, e.g. ols1-data_1-fitted_model. Another approach would be to use the key values of nested_data_catalogs and join these with the PickleNode's name, producing a similar result in the example above, but guaranteeing a more informative name in general.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions