Skip to content

Flexible indexes: review the implementation of alignment and merge #5647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
benbovy opened this issue Jul 29, 2021 · 12 comments
Closed

Flexible indexes: review the implementation of alignment and merge #5647

benbovy opened this issue Jul 29, 2021 · 12 comments

Comments

@benbovy
Copy link
Member

benbovy commented Jul 29, 2021

The current implementation of the align function is problematic in the context of flexible indexes because:

  • the sizes of the joined indexes are reused for checking compatibility with unlabelled dimension sizes
  • the joined indexes are used as indexers to compute the aligned Dataset / DataArray.

This currently works well since a pd.Index can be directly treated as a 1-d array but this won’t be always the case anymore with custom indexes.

I'm opening this issue to gather ideas on how best to handle alignment in a more flexible way (I haven't been thinking much at this problem yet).

@shoyer
Copy link
Member

shoyer commented Jul 29, 2021

Conceptually, align should ensure that all indexes in the provided arguments match.

This is potentially ambiguous for cases with multiple (non-MultiIndex) indexes, if the result of aligning the separate indexes does not match, e.g., if we have:

  • Index for x along dimension x: [1, 2, 3] vs [1, 2]
  • Index for y along dimension x: [1, 2, 3] vs [2, 3]

We should raise an error in this cases (and/or suggest setting a MultiIndex). It should also be OK if not every index implements alignment, in which case they should raise an error if coordinates do not match exactly.

With regards to your concern:

This currently works well since a pd.Index can be directly treated as a 1-d array but this won’t be always the case anymore with custom indexes.

I don't think we should try to support alignment with multi-dimensional (non-orthogonal) inside align(). Instead we can just require the coordinates correspondign to such indexes to match exactly (or within a tolerance). I can see supporting alignment between two DataArrays that are indexed by a KDTreeIndex over multiple coordintaes along a single dimension in the same way tthat we support alignment with MultiIndex indexes currently.

But if your indexes correpsond to multi-dimensional arrays (rather than just multiple coordinates), joining indexes together is a much messier operation, one that may not be possible without thinking carefully about interpolation/regridding. In many cases it may not be possible to retain the multi-dimensional nature of the indexes in the result (e.g., the union of two partially overlapping grids). Since the desired behavior is not clear, it is better to force the user to make a choice, either by stacking the dimensions in multi-dimensional indexes into 1D (like a MultiIndex) or by calling a specialized method for interpolation/regridding.

@benbovy
Copy link
Member Author

benbovy commented Jul 29, 2021

We should raise an error in this cases (and/or suggest setting a MultiIndex). It should also be OK if not every index implements alignment, in which case they should raise an error if coordinates do not match exactly.

Agreed.

I don't think we should try to support alignment with multi-dimensional (non-orthogonal) inside align(). Instead we can just require the coordinates corresponding to such indexes to match exactly (or within a tolerance).

I could see some examples (e.g., union or intersection of staggered grids) where it could still be useful to have a (meta-)index that implements alignment, though.

Actually, while working on #5636 I was thinking more about how to perform alignment based on xarray.Index objects given that the concept of an xarray.Index should be clearly distinct from the concept of a Variable.

We could maybe add an xarray.Index.sizes property similarly to Variable.sizes.

Alternatively, we could allow xarray.Index.union() and xarray.Index.intersection() returning both the joined index and the new Variable objects created from the latter index.

I'd lean towards the 2nd option, which is consistent with the class method constructors added in #5636. Not sure the 1st option is a good idea and it doesn't solve all the issues mentioned in my comment above.

@benbovy
Copy link
Member Author

benbovy commented Oct 7, 2021

@shoyer I'm looking more deeply into this. I think it will be impossible to avoid a heavy refactoring in core/alignment.py if we want to support "meta-indexes" and handle non-dimension indexed coordinates, so I'd like to check with you about how best we can tackle this problem.

I'm thinking about the following approach:

  1. Instead of looking at the dimensions of each object to align, look at their indexes.

    • matching indexes must all correspond to the same set of coordinate names and dimension names (should we also check the index type?)
    • if coordinate names vs. indexes match only partially, then raise an error
  2. For each of the matched indexes, check for index equality and perform join if that's not the case (or just pick the matching index that has been explicitly passed to align via its indexes argument)

    • For convenience, we could expose an Index.join(self, other, how) method instead of having two separate Index.union() and Index.intersection() methods. Index.join() might also return index coordinate variables that we can assign to the aligned objects
    • if the indexes do not provide an implementation for .equals, check for coordinate equality
    • if they do not provide an implementation for .join, raise an error or ignore it? Probably better to ignore it, i.e., do not create a joined index in the aligned objects and align the coordinates just as regular variables (this might be taken care by another index)...
  3. Add an Index.dim_indexers(self) -> Dict[Hashable, Any] property that returns label indexers for each dimension involved in the index, and which will be used to conform the objects to the joined indexes.

  4. Merge dim_indexers returned by all joined indexes and raise if there's any conflict (i.e., two distinct indexes should not return indexers along a common dimension).

  5. Check the size of unlabelled dimensions (also against the sizes found in the merged dim_indexers for matching dimensions)

  6. Reindex the objects

Does that sounds right to you? I'd really appreciate any guidance on this before going further as I'm worried about missing something important or another more straightforward approach.

@benbovy
Copy link
Member Author

benbovy commented Oct 15, 2021

I've now re-implemented most of the alignment logic in #5692, using a dedicated class so that all intermediate steps are implemented in a clearer and more maintainable way.

One remaining task is to get the re-indexing logic right. Rather than relying on an Index.dim_indexers property, I feel that it'd be better to have two methods:

DimIntIndexers = Dict[Hashable, Any]

class Index:
    def reindex(self, dim_labels: Mapping[Hashable, Any]) -> DimIntIndexers:
        ...

    def reindex_like(self, other: "Index") -> DimIntIndexers:
        ...

For alignment, I think we could directly call idx.reindex_like(other) for each index in each object to align where other corresponds to the aligned index that matches idx.

@benbovy
Copy link
Member Author

benbovy commented Oct 18, 2021

Ok, I'm now hitting another obstacle while working on reindex.

So one general approach for both alignment and re-indexing that is "pretty straightforward" to implement with the new Xarray index data model is: (1) find matching indexes based on their corresponding coordinate/dimension names and index type, and (2) call idx.join(other) and/or idx.reindex_like(other) where other is another index object of the same type than idx. This is what I've done so far in #5692.

Relaxing any of the constraints in (1) would be much more complicated to implement. We would need to do some sort of mapping from dimension labels to all involved (multi-/meta-)indexes, then check for conflicts in dimension indexers returned from multiple indexes, possibly handle/remove multi-index coordinates (or convert back to non-indexed coordinates), etc.

One problem with Dataset.reindex and DataArray.reindex is that we can pass any {dim: labels} as indexers. How should we adapt the API for flexible indexes? How to ensure backwards compatibility? The current possible cases are:

  1. Both dim and labels correspond (or may be cast) to single pandas indexes: there's no issue in this case with the constraints stated in (1) above.

  2. dim has a multi-index, labels is a pd.MultiIndex and level names exactly match: no real issue either in this case, but shouldn't we discourage this implicit behavior in the mid/long term and instead encourage using reindex_like for such more advanced use case?

  3. dim has a multi-index and labels is (cast to) a single pandas index (or the other way around): this is currently possible in Xarray but it seems silly? After re-indexing, all data along dim is filled with fill_value... Would it be fine to instead raise an error now? Would it really break any user case?

  4. dim has a multi-index, labels is a pd.MultiIndex and multi-index level names don't match: same problem than for case 3.

Cases 3 and 4 are a big obstacle for me right now. I really don't know how we can still support those special cases without deeply re-thinking the problem. If they could be considered as a bug, then the new implementation would already raise an nice error message :-).

@shoyer
Copy link
Member

shoyer commented Oct 18, 2021

2. dim has a multi-index, labels is a pd.MultiIndex and level names exactly match: no real issue either in this case, but shouldn't we discourage this implicit behavior in the mid/long term and instead encourage using reindex_like for such more advanced use case?

I agree, this doesn't make sense in the long term because the "name" of the MultiIndex is no longer necessary: it's just the same index that happens to index each of the levels.

Let's preserve it (for now) for backwards compat, but in the long term the ideal is certainly either (a) using reindex_like or (b) using reindex with separate kwargs for each level

3. dim has a multi-index and labels is (cast to) a single pandas index (or the other way around): this is currently possible in Xarray but it seems silly? After re-indexing, all data along dim is filled with fill_value... Would it be fine to instead raise an error now? Would it really break any user case?

If part of Xarray's API currently doesn't raise an error but instead returns all NaNs, and this case can be detected based on static type information (e.g., shapes, dtypes, dimension names, variable name, index types), then I agree that the best user experience is almost certainly to raise an error instead.

NaNs in numerical computing are essentially a mechanism for "runtime error handling" that can arise from values in array computation. If something can be identified based on type information, that is a better user experience.

4. dim has a multi-index, labels is a pd.MultiIndex and multi-index level names don't match: same problem than for case 3.

Cases 3 and 4 are a big obstacle for me right now. I really don't know how we can still support those special cases without deeply re-thinking the problem. If they could be considered as a bug, then the new implementation would already raise an nice error message :-).

I think we can consider these edge cases bugs and fix them :)

@shoyer
Copy link
Member

shoyer commented Oct 18, 2021

For alignment, I think we could directly call idx.reindex_like(other) for each index in each object to align where other corresponds to the aligned index that matches idx.

Sounds good to me!

If we like, we could probably even add static type information on reindex_like, to require the same index type:

T = TypeVar("T", bound="Index")

...
    def reindex_like(self: T, other: T) -> DimIntIndexers:
        ...

@benbovy
Copy link
Member Author

benbovy commented Oct 18, 2021

If we like, we could probably even add static type information on reindex_like, to require the same index type

Yes, I did that for Index.join() and Index.reindex_like() in #5692.

@benbovy
Copy link
Member Author

benbovy commented Oct 19, 2021

I've used the following key type to find matching indexes:

 CoordNamesAndDims = FrozenSet[Tuple[Hashable, Tuple[Hashable, ...]]]
 MatchingIndexKey = Tuple[CoordNamesAndDims, Type[Index]]

where the order of coordinates doesn't matter. For PandasMultiIndex this is wrong, we should use a Tuple instead of a FrozenSet (assuming that the coordinate names are properly ordered in obj.xindexes), but I'm wondering if using a Tuple wouldn't be too restrictive in some other cases.

Are there potential custom indexes where the order of coordinates doesn't matter? Maybe a good example is a meta-index for staggered grids where the cell center coordinate and the cell edges coordinates might be given in any order.

Possible solutions to address this:

  1. Xarray indexes may reorder the coordinate variables, possibly via Index.create_variables(), to ensure consistent order
  2. Xarray indexes must implement the Index.matching_key abstract property in order to support re-indexing and alignment.

Option 2 is more flexible but option 1 might be enough. Option 1 may also be great for clearer indexes and coordinates sections in Xarray objects repr.

@benbovy
Copy link
Member Author

benbovy commented Oct 19, 2021

Other possible solutions (from discussion with @shoyer):

  1. Take care of coordinate order (and maybe other things) inside Index.join and Index.equals, e.g., for PandasMultiIndex maybe reorder the levels beforehand.
    • pros: more flexible
    • cons: not great to implicitly reorder levels if it's a costly operation?
  2. Find matching indexes using a two-passes approach: (1) group all indexes by dimension name and (2) check compatibility between the indexes listed in each group.

@benbovy benbovy changed the title Flexible indexes: review the implementation of alignment Flexible indexes: review the implementation of alignment and merge Oct 22, 2021
@benbovy
Copy link
Member Author

benbovy commented Oct 22, 2021

@shoyer I'm now reviewing the merging logic. I works mostly fine, with some minor concerns:

  • MergeElement (variable, index) tuples are not grouped per matching (unique) index, so index equality will be checked more than once for multi-coordinate indexes. Not a big deal I think, unless for large multi-indexes with many levels. A proper fix for this would probably require some heavier refactoring, though. Is it ok to leave this as is for now?

  • I'm not sure how we should handle the case where foo is the name of both an un-indexed dimension and a coordinate with other dimension(s).

Let's take this example:

>>> from xarray.tests.test_dataset import create_test_multiindex
>>> data = create_test_multiindex()
>>> data
<xarray.Dataset>
Dimensions:  (x: 4)
Coordinates:
  * x        (x) MultiIndex
  - level_1  (x) object 'a' 'a' 'b' 'b'
  - level_2  (x) int64 1 2 1 2
Data variables:
    *empty*
>>> other = xr.Dataset({"z": ("level_1", [0, 1])})
>>> merged = data.merge(other)
ValueError: conflicting level / dimension names. level_1 already exists as a level name.

Do we raise a ValueError here because otherwise we don't know what to return from merged["level_1"] (i.e., the multi-index virtual coordinate or a default dimension coordinate)? Can we allow this case now that multi-index level coordinates are not virtual anymore?

Note: the following example does not raise any error in xarray:

>>> data = xr.Dataset(coords={"x": [0, 1, 2, 3], "level_1": ("x", ["a", "a", "b", "b"])})
>>> other = xr.Dataset({"z": ("level_1", [0, 1])})
>>> merged = data.merge(other)
>>> merged
<xarray.Dataset>
Dimensions:  (x: 4, level_1: 2)
Coordinates:
  * x        (x) int64 0 1 2 3
    level_1  (x) <U1 'a' 'a' 'b' 'b'
Data variables:
    z        (level_1) int64 0 1

@benbovy
Copy link
Member Author

benbovy commented Sep 7, 2022

I think we can close this issue, which has mostly been addressed in #5692. I opened #7002 for the potential problem of index alignment vs. coordinate ordering.

@benbovy benbovy closed this as completed Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants