This repository was archived by the owner on Oct 24, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 42
Name collisions between Dataset variables and child tree nodes #38
Labels
Comments
Merged
#40 fixes 2/3 of these possible name collisions via better checks, but the last one I still don't know how to fix:
|
@shoyer here is a short code example to demonstrate the problem, should work with most recent version of datatree (and xarray): In [1]: import numpy as np
In [2]: import xarray as xr
In [3]: from datatree import DataNode
In [4]: dt = DataNode('root', data=xr.Dataset(), children=[DataNode('group')])
In [5]: print(dt)
DataNode('root')
│ Dimensions: ()
│ Data variables:
│ *empty*
└── DataNode('group') Now we are going to do the modification that I want to prevent In [6]: dt.ds['group'] = np.array(0)
In [7]: print(dt)
DataNode('root')
│ Dimensions: ()
│ Data variables:
│ group int64 0
└── DataNode('group')
In [8]: dt['group']
Out[8]:
<xarray.DataArray 'group' ()>
array(0) The problem is that at |
4 tasks
3 tasks
5 tasks
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
Uh oh!
There was an error while loading. Please reload this page.
I realised that it is currently possible to get a tree into a state which (a) cannot be represented as a netCDF file, and (b) means
__getitem__
becomes ambiguous.See this example:
Here
print(dt)
shows thatdt
is in a form forbidden by netCDF, because we have a child node and a variable with the same name (equivalent to having a group and a variable with the same name at the same level in netcdf).Furthermore, when choosing an item via
DataTree.__getitem__
it merrily picks out the DataArray even though this is an ambiguous situation and I might have intended to pick out the child node'a'
instead.The node is still accessible via
.get_node
, but only because.get_node
is inherited fromTreeNode
, which has no concept of data variables.Contrast this silent collision of variable and child names with what happens if you try to assign two children with the same name:
To prevent this we need better checks on assignment between variables and children. For example
TreeNode.set_node(key, new_child)
currently checks for any existing children with namekey
, but it also needs to check for any variables in the dataset with namekey
. (That's not too hard to implement, it could be done by overloadingset_node
onDataTree
to check against variables as well as children, for example.)What is more difficult is if a child with name
key
exists, but the user tries to assign a variable with namekey
to the wrapped dataset. If the user does this vianode.ds.assign(key=new_da)
then that's manageable - in that caseassign()
has a return value, which they need to assign to the node vianode.ds = node.ds.assign(key=new_da)
. We could check for name conflicts with children in the.ds
property setter method.However if the user adds a variable via
node.ds[key] = new_da
then I thinknode.ds
will be updated in-place without it's wrappingDataTree
class ever having a chance to intervene. A similar issue withnode[key] = new_da
is preventable by improving checking inDataTree.__setitem__
, but I don't know how we can prevent this happening when all that is being called isDataset.__setitem__
.I don't really know what to do about this, other than have a much more complicated class design which is no longer simple composition 😕 Any ideas @dcherian maybe?
The text was updated successfully, but these errors were encountered: