Skip to content

Cannot concatenate Datasets containing ordered categoricals with different categories. #10247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 tasks done
ilia-kats opened this issue Apr 24, 2025 · 7 comments
Open
5 tasks done
Labels
bug topic-arrays related to flexible array support

Comments

@ilia-kats
Copy link

What happened?

I'm trying to concatenate two xarray Datasets that contain ordered categorical Pandas extension arrays. Pandas converts these to string (object) arrays during concatenation, but xarray raises a TypeError.

What did you expect to happen?

Concatenation succeeds.

Minimal Complete Verifiable Example

import xarray as xr
import pandas as pd

cat1 = pd.DataFrame({"test": pd.Categorical(["a", "b", "c"], ordered=True)})
cat2 = pd.DataFrame({"test": pd.Categorical(["a", "b", "d"], ordered=True)})
ds1 = xr.Dataset.from_dataframe(cat1)
ds2 = xr.Dataset.from_dataframe(cat2)

xr.concat([ds1, ds2], dim="index")

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

TypeError                                 Traceback (most recent call last)
Cell In[19], line 9
      6 ds1 = xr.Dataset.from_dataframe(cat1)
      7 ds2 = xr.Dataset.from_dataframe(cat2)
----> 9 xr.concat([ds1, ds2], dim="index")

File /data/ilia/envs/famo/lib/python3.11/site-packages/xarray/core/concat.py:277, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    264     return _dataarray_concat(
    265         objs,
    266         dim=dim,
   (...)    274         create_index_for_new_dim=create_index_for_new_dim,
    275     )
    276 elif isinstance(first_obj, Dataset):
--> 277     return _dataset_concat(
    278         objs,
    279         dim=dim,
    280         data_vars=data_vars,
    281         coords=coords,
    282         compat=compat,
    283         positions=positions,
    284         fill_value=fill_value,
    285         join=join,
    286         combine_attrs=combine_attrs,
    287         create_index_for_new_dim=create_index_for_new_dim,
    288     )
    289 else:
    290     raise TypeError(
    291         "can only concatenate xarray Dataset and DataArray "
    292         f"objects, got {type(first_obj)}"
    293     )

File /data/ilia/envs/famo/lib/python3.11/site-packages/xarray/core/concat.py:669, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    667         result_vars[k] = v
    668 else:
--> 669     combined_var = concat_vars(
    670         vars, dim_name, positions, combine_attrs=combine_attrs
    671     )
    672     # reindex if variable is not present in all datasets
    673     if len(variable_index) < concat_index_size:

File /data/ilia/envs/famo/lib/python3.11/site-packages/xarray/core/variable.py:3004, in concat(variables, dim, positions, shortcut, combine_attrs)
   3002     return IndexVariable.concat(variables, dim, positions, shortcut, combine_attrs)
   3003 else:
-> 3004     return Variable.concat(variables, dim, positions, shortcut, combine_attrs)

File /data/ilia/envs/famo/lib/python3.11/site-packages/xarray/core/variable.py:1752, in Variable.concat(cls, variables, dim, positions, shortcut, combine_attrs)
   1750 axis = first_var.get_axis_num(dim)
   1751 dims = first_var_dims
-> 1752 data = duck_array_ops.concatenate(arrays, axis=axis)
   1753 if positions is not None:
   1754     # TODO: deprecate this option -- we don't need it for groupby
   1755     # any more.
   1756     indices = nputils.inverse_permutation(np.concatenate(positions))

File /data/ilia/envs/famo/lib/python3.11/site-packages/xarray/core/duck_array_ops.py:378, in concatenate(arrays, axis)
    376     xp = get_array_namespace(arrays[0])
    377     return xp.concat(as_shared_dtype(arrays, xp=xp), axis=axis)
--> 378 return _concatenate(as_shared_dtype(arrays), axis=axis)

File /data/ilia/envs/famo/lib/python3.11/site-packages/xarray/core/extension_array.py:100, in PandasExtensionArray.__array_function__(self, func, types, args, kwargs)
     98 if func not in HANDLED_EXTENSION_ARRAY_FUNCTIONS:
     99     return func(*args, **kwargs)
--> 100 res = HANDLED_EXTENSION_ARRAY_FUNCTIONS[func](*args, **kwargs)
    101 if is_extension_array_dtype(res):
    102     return type(self)[type(res)](res)

File /data/ilia/envs/famo/lib/python3.11/site-packages/xarray/core/extension_array.py:48, in __extension_duck_array__concatenate(arrays, axis, out)
     44 @implements(np.concatenate)
     45 def __extension_duck_array__concatenate(
     46     arrays: Sequence[T_ExtensionArray], axis: int = 0, out=None
     47 ) -> T_ExtensionArray:
---> 48     return type(arrays[0])._concat_same_type(arrays)

File /data/ilia/envs/famo/lib/python3.11/site-packages/pandas/core/arrays/categorical.py:2527, in Categorical._concat_same_type(cls, to_concat, axis)
   2524     result = res_flat.reshape(len(first), -1, order="F")
   2525     return result
-> 2527 result = union_categoricals(to_concat)
   2528 return result

File /data/ilia/envs/famo/lib/python3.11/site-packages/pandas/core/dtypes/concat.py:341, in union_categoricals(to_union, sort_categories, ignore_order)
    339     if all(c.ordered for c in to_union):
    340         msg = "to union ordered Categoricals, all categories must be the same"
--> 341         raise TypeError(msg)
    342     raise TypeError("Categorical.ordered must be the same")
    344 if ignore_order:

TypeError: to union ordered Categoricals, all categories must be the same

Anything else we need to know?

No response

Environment

commit: None
python: 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0]
python-bits: 64
OS: Linux
OS-release: 6.12.12+bpo-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.4
libnetcdf: None

xarray: 2024.10.0
pandas: 2.2.3
numpy: 2.2.5
scipy: 1.15.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: 3.12.1
zarr: 3.0.6
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: 2025.3.0
distributed: 2025.2.0
matplotlib: 3.9.2
cartopy: None
seaborn: 0.13.2
numbagg: None
fsspec: 2024.10.0
cupy: None
pint: None
sparse: 0.16.0
flox: None
numpy_groupies: None
setuptools: 66.1.1
pip: 23.0.1
conda: None
pytest: 8.3.3
mypy: None
IPython: 9.1.0
sphinx: 8.1.3

@ilia-kats ilia-kats added bug needs triage Issue that has not been reviewed by xarray team member labels Apr 24, 2025
Copy link

welcome bot commented Apr 24, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@ilan-gold
Copy link
Contributor

@ilia-kats I think it would make sense (given the ubiquity of categoricals) to special case things here using something like what you have here: https://github.com/scverse/anndata/pull/1966/files/881a46e9281c3e1137482273ec66ea781b679ba5..efba50287628bc38527e622b721b40c223044404#diff-1aaff521ae74cab9b0cac4c29bedf4d4d05cf8a4f59fc41e6aa61d4c37498134R316-R325 unless you see something in https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray.html (or in the pandas categorical API) that would allow for a general-purpose solution.

I wrote this intentionally to be extremely restrictive in what it did so that it applies to all extension arrays and only does the bare minimum that can be safely promised without unexpected down/upcast - for example, even if you "could" stack extension arrays because the types kind-of go together (say ints and floats), it is disallowed here. But as use-cases arise, I think we can special case things and be very clear about the intention of those special cases, especially when they come from pandas core.

@ilia-kats
Copy link
Author

I'm not aware of anything in the public API. I believe Pandas uses pd.core.dtypes.concat.concat_compat internally, but that is not part of the API.

@ilan-gold
Copy link
Contributor

Looking in the code, they use _concat_same_type which is part of the public ExtensionArray API (at least it's documented) and I used it in the implementation here.

It seems like pandas actually does not even preserve categoricals upon concatenation, which makes sense at least conceptually. Instead they map to object. I am not clear on you would remap the underlying codes to the new categorical objects.

If pandas internally maps to object, I would do that here. Looking at the code you posted in that PR, I'm not sure I follow how you are handling the issue of codes. I see you create this data type as a union type but then I'm not clear how the codes are handled.

@ilia-kats
Copy link
Author

If you're talking about the union in the case of unordered categories, that code was there already before my PR. It definitely does not preserve the codes, as Pandas itself does not:

import xarray as xr
import pandas as pd

cat1 = pd.DataFrame({"test": pd.Categorical(["a", "b", "c"], categories=["a", "b", "c", "d"])})
cat2 = pd.DataFrame({"test": pd.Categorical(["a", "b", "d"], categories=["a", "d", "c", "b"])})

concat = pd.concat((cat1, cat2), axis=0)
In [23]: cat1["test"].cat.codes
Out[23]: 
0    0
1    1
2    2
dtype: int8

In [24]: cat2["test"].cat.codes
Out[24]: 
0    0
1    3
2    1
dtype: int8

In [25]: concat = pd.concat((cat1, cat2), axis=0)

In [26]: cat1["test"].cat.codes
Out[26]: 
0    0
1    1
2    2
dtype: int8

In [27]: cat2["test"].cat.codes
Out[27]: 
0    0
1    3
2    1
dtype: int8

In [28]: concat["test"]
Out[28]: 
0    a
1    b
2    c
0    a
1    b
2    d
Name: test, dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [29]: concat["test"].cat.codes
Out[29]: 
0    0
1    1
2    2
0    0
1    1
2    3
dtype: int8

Similarly for the ordered case: If the categories and their order are the same, they can be concatenated, but again the codes may not be preserved.

@ilan-gold
Copy link
Contributor

ilan-gold commented Apr 28, 2025

Right @ilia-kats my point was that categoricals that cannot be safely concatenated with the same previous dtype are just concatenated to objects, and this was the behavior I wanted to prevent here.

But in retrospect it might make sense to just let pd.concat do whatever it is going to do. Would relying on pd.concat prevent the need for us to use your custom code? Or does pd.concat not handle what is written here https://github.com/scverse/anndata/pull/1966/files/881a46e9281c3e1137482273ec66ea781b679ba5..efba50287628bc38527e622b721b40c223044404#diff-1aaff521ae74cab9b0cac4c29bedf4d4d05cf8a4f59fc41e6aa61d4c37498134R316-R325 ? i..e, does it make sense to just do 1-d pd.concat for these objects instead of disallowing the behavior or writing something custom? I think it might, no?

@ilan-gold
Copy link
Contributor

Ah I see pd.concat only works with series and dataframes never mind. In that case, I am not sure what the best way forward here. Your code seems necessary to achieve the desired concatenation of pure extension arrays. I do wonder what the implication perfromance-wise would be of wrapping the two arrays to be concatenated as a Series each, and then extracting the underlying array from the concatenated pd.Series

@dcherian dcherian added topic-arrays related to flexible array support and removed needs triage Issue that has not been reviewed by xarray team member labels May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-arrays related to flexible array support
Projects
None yet
Development

No branches or pull requests

3 participants