-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Is your feature request related to a problem?
When using a nested DataCatalog
of the kind
from pytask import DataCatalog
MODEL_NAMES = ("ols", "logistic_regression")
DATA_NAMES = ("data_1", "data_2")
nested_data_catalogs = {
model_name: {
data_name: DataCatalog(name=f"{model_name}-{data_name}")
for data_name in DATA_NAMES
}
for model_name in MODEL_NAMES
}
and adding products to a DataCatalog
e.g. via the following task:
from pathlib import Path
from pytask import task
from typing_extensions import Annotated
from my_project.config import DATA_NAMES
from my_project.config import MODEL_NAMES
from my_project.config import nested_data_catalogs
for model_name in MODEL_NAMES:
for data_name in DATA_NAMES:
@task
def fit_model(
path: Path = Path("...", data_name)
) -> Annotated[
Any, nested_data_catalogs[model_name][data_name]["fitted_model"]
]:
data = ...
fitted_model = ...
return fitted_model
as described in the extended DataCatalog guide, I would expect the DAG to facilitate the nested structure of the DataCatalog
.
For now the PickleNode
's name, "fitted_model" in the example, is only used in the representation of the DAG. When having multiple models and datasets, the information "fitted_model" is on the one hand insufficient, and on the other hand, produces a DAG which implies the wrong structure and dependencies.
Describe the solution you'd like
I would want the DAG to facilitate the nested structure of the DataCatalog
and not only use the PickleNode
's name. One approach would be to display in the DAG the name of the DataCatalog
and the PickleNode
, e.g. ols1-data_1-fitted_model
. Another approach would be to use the key values of nested_data_catalogs
and join these with the PickleNode
's name, producing a similar result in the example above, but guaranteeing a more informative name in general.