Skip to content

Add a data catalog. #419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Nov 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
82859e7
Draft datastore.
tobiasraabe Sep 18, 2023
47d0fd6
Merge remote-tracking branch 'origin/main' into draft-datastore
tobiasraabe Sep 18, 2023
4001f98
Finish draft of data catalog.
tobiasraabe Sep 18, 2023
93e6da0
Fix tests.
tobiasraabe Sep 18, 2023
9524c30
more data catalog.
tobiasraabe Sep 19, 2023
f45ad78
Merge branch 'main' into draft-datastore
tobiasraabe Oct 12, 2023
891f4a6
Allow to use the datastore independent of pytask.
tobiasraabe Oct 14, 2023
49e684f
Add deepdiff for tests.
tobiasraabe Oct 14, 2023
90fb3f7
Merge branch 'main' into draft-datastore
tobiasraabe Oct 15, 2023
f5948d9
Add collection to data store.
tobiasraabe Oct 15, 2023
dc204c1
Merge branch 'main' into draft-datastore
tobiasraabe Oct 16, 2023
8457c86
Add some tests.
tobiasraabe Oct 17, 2023
9a14830
better docs.
tobiasraabe Oct 17, 2023
d9f6493
Merge branch 'main' into draft-datastore
tobiasraabe Oct 17, 2023
99dd59a
more test.
tobiasraabe Oct 17, 2023
88bf7a9
Merge branch 'main' into draft-datastore
tobiasraabe Oct 18, 2023
7ed8612
Merge branch 'main' into draft-datastore
tobiasraabe Oct 18, 2023
3e14cf8
fix.
tobiasraabe Oct 20, 2023
5282bf3
Merge branch 'main' into draft-datastore
tobiasraabe Oct 21, 2023
51242b6
Merge branch 'main' into draft-datastore
tobiasraabe Oct 25, 2023
d3ccb4a
extend guide.
tobiasraabe Oct 25, 2023
e018a2c
fix.
tobiasraabe Oct 25, 2023
eb2d973
Merge branch 'main' into draft-datastore
tobiasraabe Oct 25, 2023
552c8e8
Fix version.
tobiasraabe Oct 25, 2023
caccdc4
Merge branch 'main' into draft-datastore
tobiasraabe Oct 27, 2023
9208e3f
add more tests.
tobiasraabe Oct 27, 2023
5f5a4f5
Merge branch 'draft-datastore' of https://github.com/pytask-dev/pytas…
tobiasraabe Oct 27, 2023
391fc23
fix.
tobiasraabe Oct 27, 2023
adad36a
Revert some changes.
tobiasraabe Oct 27, 2023
a7f8c78
Fix.
tobiasraabe Oct 27, 2023
f502889
Remove unnecessary files.
tobiasraabe Oct 27, 2023
3a5d01f
Better description.
tobiasraabe Oct 28, 2023
ab65f9d
Fix.
tobiasraabe Oct 28, 2023
38aac82
fix.
tobiasraabe Oct 31, 2023
3f74d3f
Merge branch 'main' into draft-datastore
tobiasraabe Oct 31, 2023
d0e626f
Add tutorial.
tobiasraabe Nov 1, 2023
5cabf22
fix.
tobiasraabe Nov 1, 2023
52da000
Fix.
tobiasraabe Nov 1, 2023
dbab505
Align how to guide.
tobiasraabe Nov 1, 2023
adbba61
Remove some things.
tobiasraabe Nov 1, 2023
6fe3ec4
Add example of custom node.
tobiasraabe Nov 2, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ _generated
.eggs

.pytask.sqlite3
.pytask

build
dist
Expand Down
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ repos:
docs/source/tutorials/repeating_tasks_with_different_inputs.md|
docs/source/tutorials/selecting_tasks.md|
docs/source/tutorials/set_up_a_project.md|
docs/source/tutorials/using_a_data_catalog.md|
docs/source/tutorials/write_a_task.md
)$
- repo: https://github.com/nbQA-dev/nbQA
Expand Down
26 changes: 26 additions & 0 deletions docs/source/_static/md/defining-dependencies-products.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
<div class="termy">

```console

$ pytask
──────────────────────────── Start pytask session ────────────────────────────
Platform: win32 -- Python <span style="color: var(--termynal-blue)">3.10.0</span>, pytask <span style="color: var(--termynal-blue)">0.4.0</span>, pluggy <span style="color: var(--termynal-blue)">1.0.0</span>
Root: C:\Users\pytask-dev\git\my_project
Collected <span style="color: var(--termynal-blue)">2</span> task.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Task ┃ Outcome ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ <span class="termynal-dim">task_data_preparation.py::</span>task_create_random_data │ <span class="termynal-success">.</span> │
│ <span class="termynal-dim">task_plot_data.py::</span>task_plot_data │ <span class="termynal-success">.</span> │
└───────────────────────────────────────────────────┴─────────┘

<span class="termynal-dim">──────────────────────────────────────────────────────────────────────────────</span>
<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
<span class="termynal-success">│</span> <span style="font-weight: bold;"> 2 Collected tasks </span> <span class="termynal-success">│</span>
<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 2 Succeeded (100.0%) </span> <span class="termynal-success">│</span>
<span class="termynal-success">╰────────────────────────────────╯</span>
<span class="termynal-success">───────────────────────── Succeeded in 0.06 seconds ──────────────────────────</span>
```

</div>
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@

intersphinx_mapping = {
"click": ("https://click.palletsprojects.com/en/8.0.x/", None),
"deepdiff": ("https://zepworks.com/deepdiff/current/", None),
"networkx": ("https://networkx.org/documentation/stable", None),
"pandas": ("https://pandas.pydata.org/docs", None),
"pluggy": ("https://pluggy.readthedocs.io/en/latest", None),
Expand Down
4 changes: 2 additions & 2 deletions docs/source/how_to_guides/hashing_inputs_of_tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,10 @@ from interpreter session to interpreter session for security reasons (see
```

{class}`list` and {class}`dict` are not hashable by default. Luckily, there are
libraries who provide this functionality like `deepdiff`. We can use them to pass a
libraries who provide this functionality like {mod}`deepdiff`. We can use them to pass a
function to the {class}`~pytask.PythonNode` that generates a stable hash.

First, install `deepdiff`.
First, install {mod}`deepdiff`.

```console
$ pip install deepdiff
Expand Down
1 change: 1 addition & 0 deletions docs/source/how_to_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ hashing_inputs_of_tasks
using_task_returns
writing_custom_nodes
how_to_write_a_plugin
the_data_catalog
```

## Best Practice Guides
Expand Down
73 changes: 73 additions & 0 deletions docs/source/how_to_guides/the_data_catalog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# The `DataCatalog` - Revisited

An introduction to the data catalog can be found in the
[tutorial](../tutorials/using_a_data_catalog.md).

This guide explains some details that were left out of the tutorial.

## Changing the default node

The data catalog uses the {class}`~pytask.PickleNode` by default to serialize any kind
of Python object. You can use any other node that follows the {protocol}`~pytask.PNode`
protocol and register it when creating the data catalog.

For example, use the {class}`~pytask.PythonNode` as the default.

```python
from pytask import PythonNode


data_catalog = DataCatalog(default_node=PythonNode)
```

Or, learn to write your own node by reading {doc}`writing_custom_nodes`.

Here, is an example for a `PickleNode` that uses cloudpickle instead of the normal
`pickle` module.

```{literalinclude} ../../../docs_src/how_to_guides/the_data_catalog.py
```

## Changing the name and the default path

By default, the data catalogs store their data in a directory `.pytask/data_catalogs`.
If you use a `pyproject.toml` with a `[tool.pytask.ini_options]` section, then the
`.pytask` folder is in the same folder as the configuration file.

The default name for a catalog is `"default"` and so you will find its data in
`.pytask/data_catalogs/default`. If you assign a different name like
`"data_management"`, you will find the data in `.pytask/data_catalogs/data_management`.

```python
data_catalog = DataCatalog(name="data_management")
```

You can also change the path where the data catalogs will be stored by changing the
`path` attribute. Here, we store the data catalog's data next to the module where the
data catalog is defined in `.data`.

```python
from pathlib import Path


data_catalog = DataCatalog(path=Path(__file__).parent / ".data")
```

## Multiple data catalogs

You can use multiple data catalogs when you want to separate your datasets across
multiple catalogs or when you want to use the same names multiple times (although it is
not recommended!).

Make sure you assign different names to the data catalogs so that their data is stored
in different directories.

```python
# Stored in .pytask/data_catalog/a
data_catalog_a = DataCatalog(name="a")

# Stored in .pytask/data_catalog/b
data_catalog_b = DataCatalog(name="b")
```

Or, use different paths as explained above.
9 changes: 7 additions & 2 deletions docs/source/reference_guides/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ To write to the terminal, use pytask's console.
pytask uses marks to attach additional information to task functions which is processed
by the host or by plugins. The following marks are available by default.

### Marks
### Built-in marks

```{eval-rst}
.. function:: pytask.mark.depends_on(objects: Any | Iterable[Any] | dict[Any, Any])
Expand Down Expand Up @@ -236,7 +236,8 @@ The remaining exceptions convey specific errors.

```{eval-rst}
.. autoclass:: pytask.Session

.. autoclass:: pytask.DataCatalog
:members:
```

## Protocols
Expand All @@ -262,7 +263,11 @@ Nodes are the interface for different kinds of dependencies or products.

```{eval-rst}
.. autoclass:: pytask.PathNode
:members: load, save
.. autoclass:: pytask.PickleNode
:members: load, save
.. autoclass:: pytask.PythonNode
:members: load, save
```

To parse dependencies and products from nodes, use the following functions.
Expand Down
52 changes: 42 additions & 10 deletions docs/source/tutorials/defining_dependencies_products.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,47 @@
To ensure pytask executes all tasks in the correct order, you need to define
dependencies and products for each task.

This tutorial offers you different interfaces. One important difference between them is
that if you are comfortable with type annotations or not afraid to try them, take a look
at the tabs named `Python 3.10+` or `Python 3.8+`.
This tutorial offers you different interfaces. If you are comfortable with type
annotations or not afraid to try them, take a look at the tabs named `Python 3.10+` or
`Python 3.8+`.

If you want to avoid type annotations for now, look at the tab named `produces`.

The deprecated approaches can be found in the tabs named `Decorators`.

```{seealso}
An overview on the different interfaces and their strength and weaknesses is given in
{doc}`../explanations/interfaces_for_dependencies_products`.
```

Let's first focus on how to define products which should already be familiar to you.
First, we focus on how to define products which should already be familiar to you. Then,
we focus on how task dependencies can be declared.

We use the same project layout as before and add a `task_plot_data.py` module.

```text
my_project
├───pyproject.toml
├───src
│ └───my_project
│ ├────config.py
│ ├────task_data_preparation.py
│ └────task_plot_data.py
├───setup.py
├───.pytask.sqlite3
└───bld
├────data.pkl
└────plot.png
```

## Products

Let's revisit the task from the {doc}`previous tutorial <write_a_task>`.
Let's revisit the task from the {doc}`previous tutorial <write_a_task>` that we defined
in `task_data_preparation.py`.

::::{tab-set}

Expand Down Expand Up @@ -90,7 +115,9 @@ beneficial for handling paths conveniently and across platforms.
Most tasks have dependencies and it is important to specify. Then, pytask ensures that
the dependencies are available before executing the task.

In the example you see a task that creates a plot while relying on some data set.
As an example, we want to extend our project with another task that plots the data that
we generated with `task_create_random_data`. The task is called `task_plot_data` and we
will define it in `task_plot_data.py`.

::::{tab-set}

Expand All @@ -104,7 +131,7 @@ pytask assumes that all function arguments that do not have the {class}`~pytask.
annotation are dependencies of the task.

```{literalinclude} ../../../docs_src/tutorials/defining_dependencies_products_dependencies_py310.py
:emphasize-lines: 9
:emphasize-lines: 11
```

:::
Expand All @@ -119,7 +146,7 @@ pytask assumes that all function arguments that do not have the {class}`~pytask.
annotation are dependencies of the task.

```{literalinclude} ../../../docs_src/tutorials/defining_dependencies_products_dependencies_py38.py
:emphasize-lines: 9
:emphasize-lines: 11
```

:::
Expand All @@ -134,7 +161,7 @@ pytask assumes that all function arguments that are not passed to the argument
`produces` are dependencies of the task.

```{literalinclude} ../../../docs_src/tutorials/defining_dependencies_products_dependencies_produces.py
:emphasize-lines: 7
:emphasize-lines: 9
```

:::
Expand All @@ -152,12 +179,17 @@ Equivalent to products, you can use the
access the dependency path inside the function and load the data.

```{literalinclude} ../../../docs_src/tutorials/defining_dependencies_products_dependencies_decorators.py
:emphasize-lines: 7, 9
:emphasize-lines: 9, 11
```

:::
::::

Now, let us execute the two paths.

```{include} ../_static/md/defining-dependencies-products.md
```

## Relative paths

Dependencies and products do not have to be absolute paths. If paths are relative, they
Expand Down
1 change: 1 addition & 0 deletions docs/source/tutorials/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ installation
set_up_a_project
write_a_task
defining_dependencies_products
using_a_data_catalog
invoking_pytask
configuration
plugins
Expand Down
Loading