This repository was archived by the owner on Oct 24, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 42
Data structures docs #103
Merged
TomNicholas
merged 54 commits into
xarray-contrib:main
from
TomNicholas:data_structures_docs2
Jun 26, 2022
Merged
Data structures docs #103
Changes from all commits
Commits
Show all changes
54 commits
Select commit
Hold shift + click to select a range
22004e4
sketching out changes needed to integrate variables into DataTree
TomNicholas 8f0518c
Merge branch 'main' into initial_integration_refactor
TomNicholas 3a4f874
fixed some other basic conflicts
TomNicholas 8c6a68a
fix mypy errors
TomNicholas b503b06
can create basic datatree node objects again
TomNicholas 1efd7f2
child-variable name collisions dectected correctly
TomNicholas 438d73a
in-progres
TomNicholas 904eae3
merge in main
TomNicholas 2ca1c1a
add _replace method
TomNicholas 547d1ac
updated tests to assert identical instead of check .ds is expected_ds
TomNicholas 6f78fcd
refactor .ds setter to use _replace
TomNicholas 715ce49
refactor init to use _replace
TomNicholas edd2f67
refactor test tree to avoid init
TomNicholas b2c51aa
attempt at copy methods
TomNicholas a20e85f
rewrote implementation of .copy method
TomNicholas 8387a1c
xfailing test for deepcopying
TomNicholas 52ef23b
pseudocode implementation of DatasetView
TomNicholas 4a5317e
Revert "pseudocode implementation of DatasetView"
TomNicholas b60a4af
removed duplicated implementation of copy
TomNicholas 3077bf7
reorganise API docs
TomNicholas 5368f8b
expose data_vars, coords etc. properties
TomNicholas cae0a4e
try except with calculate_dimensions private import
TomNicholas 72af61c
add keys/values/items methods
TomNicholas ec11072
don't use has_data when .variables would do
TomNicholas 7c2c4f8
explanation of basic properties
TomNicholas 66b7adf
add data structures page to index
TomNicholas b61e940
revert adding documentation in favour of that going in a different PR
TomNicholas 163e54d
explanation of basic properties
TomNicholas ab0dfe1
add data structures page to index
TomNicholas 5c36b18
create tree node-by-node
TomNicholas c75fb0b
create tree from dict
TomNicholas a59ff54
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 91c7afd
dict-like interface
TomNicholas 0e9b384
correct deepcopy tests
TomNicholas d56f89b
use .data_vars in copy tests
TomNicholas cf5051c
Merge branch 'main' into initial_integration_refactor
TomNicholas 78e7faa
Merge branch 'data_structures_docs2' of https://github.com/TomNichola…
TomNicholas 0910d79
Merge branch 'initial_integration_refactor' into data_structures_docs2
TomNicholas e191660
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 0c1dd29
black
TomNicholas d3ebd12
Merge branch 'data_structures_docs2' of https://github.com/TomNichola…
TomNicholas 7bd3b47
Merge branch 'main' into data_structures_docs2
TomNicholas 438a090
Merge branch 'main' into data_structures_docs2
TomNicholas 3ce120b
whatsnew
TomNicholas fcd94a6
Merge branch 'data_structures_docs2' of https://github.com/TomNichola…
TomNicholas f87ef2f
data contents
TomNicholas 44b8db5
dictionary-like access
TomNicholas 02f63a2
TODOs
TomNicholas b74b94f
test assigning int
TomNicholas 86f218b
allow assigning coercible values
TomNicholas 94fd6c7
Merge branch 'main' of https://github.com/xarray-contrib/datatree
TomNicholas c947539
Merge branch 'main' into data_structures_docs2
TomNicholas 657d5c9
simplify example using #115
TomNicholas 566ca1a
add note about fully qualified names
TomNicholas File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,212 @@ | ||
.. _data structures: | ||
|
||
Data Structures | ||
=============== | ||
|
||
.. ipython:: python | ||
:suppress: | ||
|
||
import numpy as np | ||
import pandas as pd | ||
import xarray as xr | ||
import datatree | ||
|
||
np.random.seed(123456) | ||
np.set_printoptions(threshold=10) | ||
|
||
.. note:: | ||
|
||
This page builds on the information given in xarray's main page on | ||
`data structures <https://docs.xarray.dev/en/stable/user-guide/data-structures.html>`_, so it is suggested that you | ||
are familiar with those first. | ||
|
||
DataTree | ||
-------- | ||
|
||
:py:class:``DataTree`` is xarray's highest-level data structure, able to organise heterogeneous data which | ||
could not be stored inside a single ``Dataset`` object. This includes representing the recursive structure of multiple | ||
`groups`_ within a netCDF file or `Zarr Store`_. | ||
|
||
.. _groups: https://www.unidata.ucar.edu/software/netcdf/workshops/2011/groups-types/GroupsIntro.html | ||
.. _Zarr Store: https://zarr.readthedocs.io/en/stable/tutorial.html#groups | ||
|
||
Each ``DataTree`` object (or "node") contains the same data that a single ``xarray.Dataset`` would (i.e. ``DataArray`` objects | ||
stored under hashable keys), and so has the same key properties: | ||
|
||
- ``dims``: a dictionary mapping of dimension names to lengths, for the variables in this node, | ||
- ``data_vars``: a dict-like container of DataArrays corresponding to variables in this node, | ||
- ``coords``: another dict-like container of DataArrays, corresponding to coordinate variables in this node, | ||
- ``attrs``: dict to hold arbitary metadata relevant to data in this node. | ||
|
||
A single ``DataTree`` object acts much like a single ``Dataset`` object, and has a similar set of dict-like methods | ||
defined upon it. However, ``DataTree``'s can also contain other ``DataTree`` objects, so they can be thought of as nested dict-like | ||
containers of both ``xarray.DataArray``'s and ``DataTree``'s. | ||
|
||
A single datatree object is known as a "node", and its position relative to other nodes is defined by two more key | ||
properties: | ||
|
||
- ``children``: An ordered dictionary mapping from names to other ``DataTree`` objects, known as its' "child nodes". | ||
- ``parent``: The single ``DataTree`` object whose children this datatree is a member of, known as its' "parent node". | ||
|
||
Each child automatically knows about its parent node, and a node without a parent is known as a "root" node | ||
(represented by the ``parent`` attribute pointing to ``None``). | ||
Nodes can have multiple children, but as each child node has at most one parent, there can only ever be one root node in a given tree. | ||
|
||
The overall structure is technically a `connected acyclic undirected rooted graph`, otherwise known as a | ||
`"Tree" <https://en.wikipedia.org/wiki/Tree_(graph_theory)>`_. | ||
|
||
.. note:: | ||
|
||
Technically a ``DataTree`` with more than one child node forms an `"Ordered Tree" <https://en.wikipedia.org/wiki/Tree_(graph_theory)#Ordered_tree>`_, | ||
because the children are stored in an Ordered Dictionary. However, this distinction only really matters for a few | ||
edge cases involving operations on multiple trees simultaneously, and can safely be ignored by most users. | ||
|
||
|
||
``DataTree`` objects can also optionally have a ``name`` as well as ``attrs``, just like a ``DataArray``. | ||
Again these are not normally used unless explicitly accessed by the user. | ||
|
||
|
||
Creating a DataTree | ||
~~~~~~~~~~~~~~~~~~~ | ||
|
||
There are two ways to create a ``DataTree`` from scratch. The first is to create each node individually, | ||
specifying the nodes' relationship to one another as you create each one. | ||
|
||
The ``DataTree`` constructor takes: | ||
|
||
- ``data``: The data that will be stored in this node, represented by a single ``xarray.Dataset``, or a named ``xarray.DataArray``. | ||
- ``parent``: The parent node (if there is one), given as a ``DataTree`` object. | ||
- ``children``: The various child nodes (if there are any), given as a mapping from string keys to ``DataTree`` objects. | ||
- ``name``: A string to use as the name of this node. | ||
|
||
Let's make a datatree node without anything in it: | ||
|
||
.. ipython:: python | ||
|
||
from datatree import DataTree | ||
|
||
# create root node | ||
node1 = DataTree(name="Oak") | ||
|
||
node1 | ||
|
||
At this point our node is also the root node, as every tree has a root node. | ||
|
||
We can add a second node to this tree either by referring to the first node in the constructor of the second: | ||
|
||
.. ipython:: python | ||
|
||
# add a child by referring to the parent node | ||
node2 = DataTree(name="Bonsai", parent=node1) | ||
|
||
or by dynamically updating the attributes of one node to refer to another: | ||
|
||
.. ipython:: python | ||
|
||
# add a grandparent by updating the .parent property of an existing node | ||
node0 = DataTree(name="General Sherman") | ||
node1.parent = node0 | ||
|
||
Our tree now has three nodes within it, and one of the two new nodes has become the new root: | ||
|
||
.. ipython:: python | ||
|
||
node0 | ||
|
||
Is is at tree construction time that consistency checks are enforced. For instance, if we try to create a `cycle` the constructor will raise an error: | ||
|
||
.. ipython:: python | ||
:okexcept: | ||
|
||
node0.parent = node2 | ||
|
||
The second way is to build the tree from a dictionary of filesystem-like paths and corresponding ``xarray.Dataset`` objects. | ||
|
||
This relies on a syntax inspired by unix-like filesystems, where the "path" to a node is specified by the keys of each intermediate node in sequence, | ||
separated by forward slashes. The root node is referred to by ``"/"``, so the path from our current root node to its grand-child would be ``"/Oak/Bonsai"``. | ||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
A path specified from the root (as opposed to being specified relative to an arbitrary node in the tree) is sometimes also referred to as a | ||
`"fully qualified name" <https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf-zarr-data-model-specification#nczarr_fqn>`_. | ||
|
||
If we have a dictionary where each key is a valid path, and each value is either valid data or ``None``, | ||
we can construct a complex tree quickly using the alternative constructor ``:py:func::DataTree.from_dict``: | ||
|
||
.. ipython:: python | ||
|
||
d = { | ||
"/": xr.Dataset({"foo": "orange"}), | ||
"/a": xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}), | ||
"/a/b": xr.Dataset({"zed": np.NaN}), | ||
"a/c/d": None, | ||
} | ||
dt = DataTree.from_dict(d) | ||
dt | ||
|
||
Notice that this method will also create any intermediate empty node necessary to reach the end of the specified path | ||
(i.e. the node labelled `"c"` in this case.) | ||
|
||
Finally if you have a file containing data on disk (such as a netCDF file or a Zarr Store), you can also create a datatree by opening the | ||
file using ``:py:func::~datatree.open_datatree``. | ||
|
||
|
||
DataTree Contents | ||
~~~~~~~~~~~~~~~~~ | ||
|
||
Like ``xarray.Dataset``, ``DataTree`` implements the python mapping interface, but with values given by either ``xarray.DataArray`` objects or other ``DataTree`` objects. | ||
|
||
.. ipython:: python | ||
|
||
dt["a"] | ||
dt["foo"] | ||
|
||
Iterating over keys will iterate over both the names of variables and child nodes. | ||
|
||
We can also access all the data in a single node through a dataset-like view | ||
|
||
.. ipython:: python | ||
|
||
dt["a"].ds | ||
|
||
This demonstrates the fact that the data in any one node is equivalent to the contents of a single ``xarray.Dataset`` object. | ||
The ``DataTree.ds`` property returns an immutable view, but we can instead extract the node's data contents as a new (and mutable) | ||
``xarray.Dataset`` object via ``.to_dataset()``: | ||
|
||
.. ipython:: python | ||
|
||
dt["a"].to_dataset() | ||
|
||
Like with ``Dataset``, you can access the data and coordinate variables of a node separately via the ``data_vars`` and ``coords`` attributes: | ||
|
||
.. ipython:: python | ||
|
||
dt["a"].data_vars | ||
dt["a"].coords | ||
|
||
|
||
Dictionary-like methods | ||
~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
We can update the contents of the tree in-place using a dictionary-like syntax. | ||
|
||
We can update a datatree in-place using Python's standard dictionary syntax, similar to how we can for Dataset objects. | ||
For example, to create this example datatree from scratch, we could have written: | ||
|
||
# TODO update this example using ``.coords`` and ``.data_vars`` as setters, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. did you forget to do this, or are you leaving it for another PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are some undocumented bugs I noticed with trying to use |
||
|
||
.. ipython:: python | ||
|
||
dt = DataTree() | ||
dt["foo"] = "orange" | ||
dt["a"] = DataTree(data=xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])})) | ||
dt["a/b/zed"] = np.NaN | ||
dt["a/c/d"] = DataTree() | ||
dt | ||
|
||
To change the variables in a node of a ``DataTree``, you can use all the standard dictionary | ||
methods, including ``values``, ``items``, ``__delitem__``, ``get`` and | ||
:py:meth:`~xarray.DataTree.update`. | ||
Note that assigning a ``DataArray`` object to a ``DataTree`` variable using ``__setitem__`` or ``update`` will | ||
:ref:`automatically align<update>` the array(s) to the original node's indexes. | ||
|
||
If you copy a ``DataTree`` using the ``:py:func::copy`` function or the :py:meth:`~xarray.DataTree.copy` it will copy the entire tree, | ||
including all parents and children. | ||
Like for ``Dataset``, this copy is shallow by default, but you can copy all the data by calling ``dt.copy(deep=True)``. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.