You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yesterday during the flexible indexes weekly meeting we have discussed with @shoyer and @jhamman on what would be the best approach to implement the new data model described here. In this issue I summarize the implementation of the current data model as well as some suggestions for the new data model along with their pros / cons (I might still be missing important ones!). I don't think there's an easy or ideal solution unfortunately, so @pydata/xarray any feedback would be very welcome!
Current data model implementation
Currently any (pandas) index is wrapped into an IndexVariable object through an intermediate adapter to preserve dtypes and handle explicit indexing. This allows directly reusing the index data as a xarray coordinate variable. For a pandas multi-index, virtual coordinates are created for each level from the IndexVariable object wrapping the index. Although relying on "virtual coordinates" more or less worked so far, it is over-complicated. Moreover, this wouldn't work with the new data model where an index may be built from a set of coordinates with different dimensions.
Proposed alternatives
Option 1: independent (coordinate) variables and indexes
Indexes and coordinates are loosely coupled, i.e., a xarray.Index holds a reference (mapping) to the coordinate variable(s) from which it is built but both manage their own data independently of each other.
Pros:
separation of concerns.
we don't need anymore those complicated adapters for reusing the index data as xarray (virtual) variable(s), which may simplify some xarray internals.
if we drop an index, that's simple, we just drop it and all its related coordinate variables are left as-is.
we could theoretically build a (pandas) index from a chunked coordinate, and then when we drop the index we still have this chunked coordinate left untouched.
Cons:
data duplication
this would clearly be a regression when using pandas indexes, but maybe less so for other indexes like kd-trees where adapting those objects for using it like coordinate variables wouldn't be easy or even possible.
what if we want to build a DataArray or Dataset from one or more existing indexes (pandas or other)? Passing an index and treating as an array then re-building an index from this array is not optimal.
keeping an index and its corresponding coordinate variable(s) in a consistent, in-sync state may be tricky, given that those variables may be mutable (although we could prevent this by encapsulating those variables using a very lightweight wrapper inspired by IndexVariable).
Option 2: indexes hold coordinate variables
This is the opposite approach of the current one. Here, a xarray.Index would wrap one or more xarray.Variable objects.
Pros:
probably easier to keep an index and its corresponding coordinate variable(s) in-sync.
sharing data between an index and its coordinate variables may be easier.
Cons:
accessing / iterating through all coordinate variables in a DataArray or Dataset may be less straightforward.
when the index is dropped, we might need some logic / API to return the coordinates as new xarray.Variable objects with their own data (or should we simply always drop the corresponding coordinates too? maybe not...).
more responsibility / work for developers who want to provide 3rd party xarray indexes.
Option 3: intermediate solution
When an index is set (or unset), it returns a new set of coordinate variables to replace the existing ones.
Pros:
it keeps some separation of concerns, while it allows data sharing through adapters and/or ensures that variables are immutable using lightweight wrappers.
Cons:
like option 2, more things to care of for 3rd party xarray index developers.
The text was updated successfully, but these errors were encountered:
Would we need to duplicate data with option 1? It seems like a wrapper that makes the data in a Variable immutable might resolve that? We could even compute the variable's data on the fly using Xarray's lazy indexing machinery.
I was initially leaning towards 2 due to cleaner handling of "virtual" coordinates, but I'm not so sure that that would actually help anymore.
Uh oh!
There was an error while loading. Please reload this page.
Yesterday during the flexible indexes weekly meeting we have discussed with @shoyer and @jhamman on what would be the best approach to implement the new data model described here. In this issue I summarize the implementation of the current data model as well as some suggestions for the new data model along with their pros / cons (I might still be missing important ones!). I don't think there's an easy or ideal solution unfortunately, so @pydata/xarray any feedback would be very welcome!
Current data model implementation
Currently any (pandas) index is wrapped into an
IndexVariable
object through an intermediate adapter to preserve dtypes and handle explicit indexing. This allows directly reusing the index data as a xarray coordinate variable. For a pandas multi-index, virtual coordinates are created for each level from theIndexVariable
object wrapping the index. Although relying on "virtual coordinates" more or less worked so far, it is over-complicated. Moreover, this wouldn't work with the new data model where an index may be built from a set of coordinates with different dimensions.Proposed alternatives
Option 1: independent (coordinate) variables and indexes
Indexes and coordinates are loosely coupled, i.e., a
xarray.Index
holds a reference (mapping) to the coordinate variable(s) from which it is built but both manage their own data independently of each other.Pros:
Cons:
DataArray
orDataset
from one or more existing indexes (pandas or other)? Passing an index and treating as an array then re-building an index from this array is not optimal.IndexVariable
).Option 2: indexes hold coordinate variables
This is the opposite approach of the current one. Here, a
xarray.Index
would wrap one or morexarray.Variable
objects.Pros:
Cons:
DataArray
orDataset
may be less straightforward.xarray.Variable
objects with their own data (or should we simply always drop the corresponding coordinates too? maybe not...).Option 3: intermediate solution
When an index is set (or unset), it returns a new set of coordinate variables to replace the existing ones.
Pros:
Cons:
The text was updated successfully, but these errors were encountered: