-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Support for jagged array #1482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
"Supported", yes, in the sense that you can create a However, it is true that xarray shines at handling more structured data and that most examples in the docs are those of dataset variables sharing similar dimensions. What kind of "support" exactly were you thinking of? |
But I cannot combine multiple dimensions into same Variable, no? So if I have a dataset of multiple variables, each variable seems that it has to have uniform dimensions for all its values? Maybe I am misunderstanding dimensions concept.
Maybe examples how to create such jagged dataset? For example, how to have a variable which stores 2D images of different sizes. If I understand correctly, I could batch all images of the same size into its own dimension? That might be also acceptable. |
I have a similar use case and I often use MultiIndex, For example, In [1]: import xarray as xr
...: import numpy as np
...:
...: # image 0, size [3, 4]
...: data0 = xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'],
...: coords={'x': np.linspace(0, 1, 3),
...: 'y': np.linspace(0, 1, 4),
...: 'image_index': 0})
...: # image 1, size [4, 5]
...: data1 = xr.DataArray(np.arange(20).reshape(4, 5), dims=['x', 'y'],
...: coords={'x': np.linspace(0, 1, 4),
...: 'y': np.linspace(0, 1, 5),
...: 'image_index': 1})
...:
...: data = xr.concat([data0.expand_dims('image_index').stack(xy=['x', 'y', 'image_index']),
...: data1.expand_dims('image_index').stack(xy=['x', 'y', 'image_index'])],
...: dim='xy')
In [2]: data
Out[2]:
<xarray.DataArray (xy: 32)>
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
Coordinates:
* xy (xy) MultiIndex
- x (xy) float64 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 ...
- y (xy) float64 0.0 0.3333 0.6667 1.0 0.0 0.3333 0.6667 1.0 ...
- image_index (xy) int64 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 ...
In [3]: data.sel(image_index=0) # gives data0
Out[3]:
<xarray.DataArray (xy: 12)>
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Coordinates:
* xy (xy) MultiIndex
- x (xy) float64 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0
- y (xy) float64 0.0 0.3333 0.6667 1.0 0.0 0.3333 0.6667 1.0 0.0 ...
In [4]: data.sel(x=0.0) # x==0.0 for both images
Out[4]:
<xarray.DataArray (xy: 9)>
array([0, 1, 2, 3, 0, 1, 2, 3, 4])
Coordinates:
* xy (xy) MultiIndex
- y (xy) float64 0.0 0.3333 0.6667 1.0 0.0 0.25 0.5 0.75 1.0
- image_index (xy) int64 0 0 0 0 1 1 1 1 1
EDIT: |
The problem is that these sorts of arrays break the common data model on top of which xarray (and NetCDF) is built.
Yes, if you can pre-process all the images and align them on some common set of dimensions (maybe just xi and yi, denoting integer index in the x and y directions), and pad unused space for each image with NaNs, then you could concatenate everything into a |
Hm, padding might use a lot of extra space, no? |
@mitar it depends on your data/application, right? But that information would also be helpful in figuring out alternative pathways. If you're always going to process the images individually or sequentially, then what advantage is there (aside from convenience) of dumping them in some giant array with forced dimensions/shape per slice? |
I was mostly thinking of using xarray as a basic data format for reusable code. So if I build ML pipelines using reusable components, I have to pass data around. And so initially data might be in jagged arrays and then with various preprocessing before training model, I can get it to be in a more suitable format where images are of the same size so that I can try easier. I hoped I could use the same format for all of these places where I need to pass data around. |
I understand why this could be useful, but I don't see how we could possibly make it work. The notion of "fixed dimension size" is fundamental to both NumPy arrays (upon which xarray is based) and the xarray Dataset/DataArray. There are lots of various workarounds (e.g., using padding or a MultiIndex) but first class support for jagged arrays would break our existing data model too severely. |
As I am not aware of implementation details I am not sure there is a useful link, but maybe progress in #3213 supporting sparse arrays can solve also the jagged array issue. Long time ago I asked there a question about how xarray supports sparse arrays. |
I also recently came across awkward/jagged/ragged arrays, and that's exactly how I would like to operate on multi-dimensional (2D in referenced case) sparse data: Instead of allocating memory with NaNs, empty slots are just not materialized by using You basically create a dense duck array from sparse dtypes, as the Pandas sparse user guide shows: So, all the shape, dtype, and ndim requirements are satisfied, and xarray could implement this as a duck array. And while you can already wrap sparse duck arrays with |
@Material-Scientist We have decent support for pydata/sparse arrays. It seems like these would work for you We do not support the pandas extension arrays at the moment. |
Maybe I am misunderstanding something, but xarray does not support jagged arrays? I would like to store images as arrays into a multi-dimensional array, but every image is potentially of a different size. So by one dimension, every value can be a differently sizes array. Is this supported by xarray?
The text was updated successfully, but these errors were encountered: