Skip to content

Support for jagged array #1482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mitar opened this issue Jul 19, 2017 · 11 comments
Closed

Support for jagged array #1482

mitar opened this issue Jul 19, 2017 · 11 comments

Comments

@mitar
Copy link

mitar commented Jul 19, 2017

Maybe I am misunderstanding something, but xarray does not support jagged arrays? I would like to store images as arrays into a multi-dimensional array, but every image is potentially of a different size. So by one dimension, every value can be a differently sizes array. Is this supported by xarray?

@fmaussion
Copy link
Member

"Supported", yes, in the sense that you can create a DataArray for each of your differently sized arrays without any problem. If you want to store them all in a Dataset, you'll have to give a different dimension name for each new dimension, which can be clumsy.

However, it is true that xarray shines at handling more structured data and that most examples in the docs are those of dataset variables sharing similar dimensions. What kind of "support" exactly were you thinking of?

@mitar
Copy link
Author

mitar commented Jul 19, 2017

If you want to store them all in a Dataset, you'll have to give a different dimension name for each new dimension, which can be clumsy.

But I cannot combine multiple dimensions into same Variable, no? So if I have a dataset of multiple variables, each variable seems that it has to have uniform dimensions for all its values? Maybe I am misunderstanding dimensions concept.

What kind of "support" exactly were you thinking of?

Maybe examples how to create such jagged dataset? For example, how to have a variable which stores 2D images of different sizes.

If I understand correctly, I could batch all images of the same size into its own dimension? That might be also acceptable.

@fujiisoup
Copy link
Member

fujiisoup commented Jul 19, 2017

I have a similar use case and I often use MultiIndex,
which (partly) enables to handle hierarchical data structure.

For example,

In [1]: import xarray as xr
   ...: import numpy as np
   ...:
   ...: # image 0, size [3, 4]
   ...: data0 = xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'],
   ...:                      coords={'x': np.linspace(0, 1, 3), 
   ...:                              'y': np.linspace(0, 1, 4),
   ...:                              'image_index': 0})
   ...: # image 1, size [4, 5]
   ...: data1 = xr.DataArray(np.arange(20).reshape(4, 5), dims=['x', 'y'],
   ...:                      coords={'x': np.linspace(0, 1, 4), 
   ...:                              'y': np.linspace(0, 1, 5),
   ...:                              'image_index': 1})
   ...: 
   ...: data = xr.concat([data0.expand_dims('image_index').stack(xy=['x', 'y', 'image_index']),
   ...:                   data1.expand_dims('image_index').stack(xy=['x', 'y', 'image_index'])],
   ...:                  dim='xy')

In [2]: data
Out[2]: 
<xarray.DataArray (xy: 32)>
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,  0,  1,  2,  3,  4,  5,
        6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
Coordinates:
  * xy           (xy) MultiIndex
  - x            (xy) float64 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 ...
  - y            (xy) float64 0.0 0.3333 0.6667 1.0 0.0 0.3333 0.6667 1.0 ...
  - image_index  (xy) int64 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 ...

In [3]:  data.sel(image_index=0)  # gives data0
Out[3]: 
<xarray.DataArray (xy: 12)>
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
Coordinates:
  * xy       (xy) MultiIndex
  - x        (xy) float64 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0
  - y        (xy) float64 0.0 0.3333 0.6667 1.0 0.0 0.3333 0.6667 1.0 0.0 ...

In [4]: data.sel(x=0.0)  # x==0.0 for both images
Out[4]: 
<xarray.DataArray (xy: 9)>
array([0, 1, 2, 3, 0, 1, 2, 3, 4])
Coordinates:
  * xy           (xy) MultiIndex
  - y            (xy) float64 0.0 0.3333 0.6667 1.0 0.0 0.25 0.5 0.75 1.0
  - image_index  (xy) int64 0 0 0 0 1 1 1 1 1

I think the above solution is essentially equivalent with

all images of the same size into its own dimension

EDIT:
I didn't understand the comment correctly.
The above corresponds to that all the images are flattened out and combined along one large dimension.

@darothen
Copy link

The problem is that these sorts of arrays break the common data model on top of which xarray (and NetCDF) is built.

If I understand correctly, I could batch all images of the same size into its own dimension? That might be also acceptable.

Yes, if you can pre-process all the images and align them on some common set of dimensions (maybe just xi and yi, denoting integer index in the x and y directions), and pad unused space for each image with NaNs, then you could concatenate everything into a Dataset.

@mitar
Copy link
Author

mitar commented Jul 19, 2017

Hm, padding might use a lot of extra space, no?

@darothen
Copy link

@mitar it depends on your data/application, right? But that information would also be helpful in figuring out alternative pathways. If you're always going to process the images individually or sequentially, then what advantage is there (aside from convenience) of dumping them in some giant array with forced dimensions/shape per slice?

@mitar
Copy link
Author

mitar commented Aug 8, 2017

then what advantage is there (aside from convenience) of dumping them in some giant array with forced dimensions/shape per slice?

I was mostly thinking of using xarray as a basic data format for reusable code. So if I build ML pipelines using reusable components, I have to pass data around. And so initially data might be in jagged arrays and then with various preprocessing before training model, I can get it to be in a more suitable format where images are of the same size so that I can try easier. I hoped I could use the same format for all of these places where I need to pass data around.

@shoyer
Copy link
Member

shoyer commented Aug 8, 2017

I understand why this could be useful, but I don't see how we could possibly make it work.

The notion of "fixed dimension size" is fundamental to both NumPy arrays (upon which xarray is based) and the xarray Dataset/DataArray. There are lots of various workarounds (e.g., using padding or a MultiIndex) but first class support for jagged arrays would break our existing data model too severely.

@fmfreeze
Copy link

As I am not aware of implementation details I am not sure there is a useful link, but maybe progress in #3213 supporting sparse arrays can solve also the jagged array issue.

Long time ago I asked there a question about how xarray supports sparse arrays.
But what I actually meant were "Jagged Arrays". I just was not aware of that term and stumbled over it some days ago the very first time.

@Material-Scientist
Copy link

As I am not aware of implementation details I am not sure there is a useful link, but maybe progress in #3213 supporting sparse arrays can solve also the jagged array issue.

Long time ago I asked there a question about how xarray supports sparse arrays. But what I actually meant were "Jagged Arrays". I just was not aware of that term and stumbled over it some days ago the very first time.

I also recently came across awkward/jagged/ragged arrays, and that's exactly how I would like to operate on multi-dimensional (2D in referenced case) sparse data:

image

Instead of allocating memory with NaNs, empty slots are just not materialized by using pd.SparseDtype("float", np.nan) dtype.

You basically create a dense duck array from sparse dtypes, as the Pandas sparse user guide shows:
image

So, all the shape, dtype, and ndim requirements are satisfied, and xarray could implement this as a duck array.

And while you can already wrap sparse duck arrays with xr.Variable, I'm not sure if the wrapper maintains the dtype:
image

@dcherian
Copy link
Contributor

dcherian commented Mar 7, 2023

@Material-Scientist We have decent support for pydata/sparse arrays. It seems like these would work for you

We do not support the pandas extension arrays at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants