-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Flexible indexes: review the implementation of alignment and merge #5647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Conceptually, This is potentially ambiguous for cases with multiple (non-MultiIndex) indexes, if the result of aligning the separate indexes does not match, e.g., if we have:
We should raise an error in this cases (and/or suggest setting a MultiIndex). It should also be OK if not every index implements alignment, in which case they should raise an error if coordinates do not match exactly. With regards to your concern:
I don't think we should try to support alignment with multi-dimensional (non-orthogonal) inside But if your indexes correpsond to multi-dimensional arrays (rather than just multiple coordinates), joining indexes together is a much messier operation, one that may not be possible without thinking carefully about interpolation/regridding. In many cases it may not be possible to retain the multi-dimensional nature of the indexes in the result (e.g., the union of two partially overlapping grids). Since the desired behavior is not clear, it is better to force the user to make a choice, either by stacking the dimensions in multi-dimensional indexes into 1D (like a MultiIndex) or by calling a specialized method for interpolation/regridding. |
Agreed.
I could see some examples (e.g., union or intersection of staggered grids) where it could still be useful to have a (meta-)index that implements alignment, though. Actually, while working on #5636 I was thinking more about how to perform alignment based on We could maybe add an Alternatively, we could allow I'd lean towards the 2nd option, which is consistent with the class method constructors added in #5636. Not sure the 1st option is a good idea and it doesn't solve all the issues mentioned in my comment above. |
@shoyer I'm looking more deeply into this. I think it will be impossible to avoid a heavy refactoring in I'm thinking about the following approach:
Does that sounds right to you? I'd really appreciate any guidance on this before going further as I'm worried about missing something important or another more straightforward approach. |
I've now re-implemented most of the alignment logic in #5692, using a dedicated class so that all intermediate steps are implemented in a clearer and more maintainable way. One remaining task is to get the re-indexing logic right. Rather than relying on an DimIntIndexers = Dict[Hashable, Any]
class Index:
def reindex(self, dim_labels: Mapping[Hashable, Any]) -> DimIntIndexers:
...
def reindex_like(self, other: "Index") -> DimIntIndexers:
... For alignment, I think we could directly call |
Ok, I'm now hitting another obstacle while working on reindex. So one general approach for both alignment and re-indexing that is "pretty straightforward" to implement with the new Xarray index data model is: (1) find matching indexes based on their corresponding coordinate/dimension names and index type, and (2) call Relaxing any of the constraints in (1) would be much more complicated to implement. We would need to do some sort of mapping from dimension labels to all involved (multi-/meta-)indexes, then check for conflicts in dimension indexers returned from multiple indexes, possibly handle/remove multi-index coordinates (or convert back to non-indexed coordinates), etc. One problem with
Cases 3 and 4 are a big obstacle for me right now. I really don't know how we can still support those special cases without deeply re-thinking the problem. If they could be considered as a bug, then the new implementation would already raise an nice error message :-). |
I agree, this doesn't make sense in the long term because the "name" of the MultiIndex is no longer necessary: it's just the same index that happens to index each of the levels. Let's preserve it (for now) for backwards compat, but in the long term the ideal is certainly either (a) using
If part of Xarray's API currently doesn't raise an error but instead returns all NaNs, and this case can be detected based on static type information (e.g., shapes, dtypes, dimension names, variable name, index types), then I agree that the best user experience is almost certainly to raise an error instead. NaNs in numerical computing are essentially a mechanism for "runtime error handling" that can arise from values in array computation. If something can be identified based on type information, that is a better user experience.
I think we can consider these edge cases bugs and fix them :) |
Sounds good to me! If we like, we could probably even add static type information on T = TypeVar("T", bound="Index")
...
def reindex_like(self: T, other: T) -> DimIntIndexers:
... |
Yes, I did that for |
I've used the following key type to find matching indexes: CoordNamesAndDims = FrozenSet[Tuple[Hashable, Tuple[Hashable, ...]]]
MatchingIndexKey = Tuple[CoordNamesAndDims, Type[Index]] where the order of coordinates doesn't matter. For Are there potential custom indexes where the order of coordinates doesn't matter? Maybe a good example is a meta-index for staggered grids where the cell center coordinate and the cell edges coordinates might be given in any order. Possible solutions to address this:
Option 2 is more flexible but option 1 might be enough. Option 1 may also be great for clearer indexes and coordinates sections in Xarray objects |
Other possible solutions (from discussion with @shoyer):
|
@shoyer I'm now reviewing the merging logic. I works mostly fine, with some minor concerns:
Let's take this example: >>> from xarray.tests.test_dataset import create_test_multiindex
>>> data = create_test_multiindex()
>>> data
<xarray.Dataset>
Dimensions: (x: 4)
Coordinates:
* x (x) MultiIndex
- level_1 (x) object 'a' 'a' 'b' 'b'
- level_2 (x) int64 1 2 1 2
Data variables:
*empty* >>> other = xr.Dataset({"z": ("level_1", [0, 1])})
>>> merged = data.merge(other)
ValueError: conflicting level / dimension names. level_1 already exists as a level name. Do we raise a Note: the following example does not raise any error in xarray: >>> data = xr.Dataset(coords={"x": [0, 1, 2, 3], "level_1": ("x", ["a", "a", "b", "b"])})
>>> other = xr.Dataset({"z": ("level_1", [0, 1])})
>>> merged = data.merge(other)
>>> merged
<xarray.Dataset>
Dimensions: (x: 4, level_1: 2)
Coordinates:
* x (x) int64 0 1 2 3
level_1 (x) <U1 'a' 'a' 'b' 'b'
Data variables:
z (level_1) int64 0 1 |
The current implementation of the
align
function is problematic in the context of flexible indexes because:This currently works well since a pd.Index can be directly treated as a 1-d array but this won’t be always the case anymore with custom indexes.
I'm opening this issue to gather ideas on how best to handle alignment in a more flexible way (I haven't been thinking much at this problem yet).
The text was updated successfully, but these errors were encountered: