-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Cannot concatenate Datasets containing ordered categoricals with different categories. #10247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
@ilia-kats I think it would make sense (given the ubiquity of categoricals) to special case things here using something like what you have here: https://github.com/scverse/anndata/pull/1966/files/881a46e9281c3e1137482273ec66ea781b679ba5..efba50287628bc38527e622b721b40c223044404#diff-1aaff521ae74cab9b0cac4c29bedf4d4d05cf8a4f59fc41e6aa61d4c37498134R316-R325 unless you see something in https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray.html (or in the pandas categorical API) that would allow for a general-purpose solution. I wrote this intentionally to be extremely restrictive in what it did so that it applies to all extension arrays and only does the bare minimum that can be safely promised without unexpected down/upcast - for example, even if you "could" stack extension arrays because the types kind-of go together (say ints and floats), it is disallowed here. But as use-cases arise, I think we can special case things and be very clear about the intention of those special cases, especially when they come from pandas core. |
I'm not aware of anything in the public API. I believe Pandas uses pd.core.dtypes.concat.concat_compat internally, but that is not part of the API. |
Looking in the code, they use It seems like pandas actually does not even preserve categoricals upon concatenation, which makes sense at least conceptually. Instead they map to If |
If you're talking about the union in the case of unordered categories, that code was there already before my PR. It definitely does not preserve the codes, as Pandas itself does not: import xarray as xr
import pandas as pd
cat1 = pd.DataFrame({"test": pd.Categorical(["a", "b", "c"], categories=["a", "b", "c", "d"])})
cat2 = pd.DataFrame({"test": pd.Categorical(["a", "b", "d"], categories=["a", "d", "c", "b"])})
concat = pd.concat((cat1, cat2), axis=0) In [23]: cat1["test"].cat.codes
Out[23]:
0 0
1 1
2 2
dtype: int8
In [24]: cat2["test"].cat.codes
Out[24]:
0 0
1 3
2 1
dtype: int8
In [25]: concat = pd.concat((cat1, cat2), axis=0)
In [26]: cat1["test"].cat.codes
Out[26]:
0 0
1 1
2 2
dtype: int8
In [27]: cat2["test"].cat.codes
Out[27]:
0 0
1 3
2 1
dtype: int8
In [28]: concat["test"]
Out[28]:
0 a
1 b
2 c
0 a
1 b
2 d
Name: test, dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
In [29]: concat["test"].cat.codes
Out[29]:
0 0
1 1
2 2
0 0
1 1
2 3
dtype: int8 Similarly for the ordered case: If the categories and their order are the same, they can be concatenated, but again the codes may not be preserved. |
Right @ilia-kats my point was that categoricals that cannot be safely concatenated with the same previous dtype are just concatenated to objects, and this was the behavior I wanted to prevent here. But in retrospect it might make sense to just let |
Ah I see |
What happened?
I'm trying to concatenate two xarray Datasets that contain ordered categorical Pandas extension arrays. Pandas converts these to string (object) arrays during concatenation, but xarray raises a
TypeError
.What did you expect to happen?
Concatenation succeeds.
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Anything else we need to know?
No response
Environment
commit: None
python: 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0]
python-bits: 64
OS: Linux
OS-release: 6.12.12+bpo-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.4
libnetcdf: None
xarray: 2024.10.0
pandas: 2.2.3
numpy: 2.2.5
scipy: 1.15.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: 3.12.1
zarr: 3.0.6
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: 2025.3.0
distributed: 2025.2.0
matplotlib: 3.9.2
cartopy: None
seaborn: 0.13.2
numbagg: None
fsspec: 2024.10.0
cupy: None
pint: None
sparse: 0.16.0
flox: None
numpy_groupies: None
setuptools: 66.1.1
pip: 23.0.1
conda: None
pytest: 8.3.3
mypy: None
IPython: 9.1.0
sphinx: 8.1.3
The text was updated successfully, but these errors were encountered: