Skip to content

"ValueError: missing object_codec for object array" #167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sharkinsspatial opened this issue May 25, 2022 · 13 comments · Fixed by #175
Closed

"ValueError: missing object_codec for object array" #167

sharkinsspatial opened this issue May 25, 2022 · 13 comments · Fixed by #175

Comments

@sharkinsspatial
Copy link

Seeing the same VLEN string issue noted in #102 when attempting to use SingleHdf5ToZarr with a single GEDI HDF5 file.

See this notebook
https://nbviewer.org/gist/sharkinsspatial/b5938e2e3e0c96a1f1cef768d1b4da7e

I attempted testing against #40 but did not see the reported segfault but the same originally reported "ValueError: missing object_codec for object array" exception.

predict_stratum appears to be the offending variable in this case as tests with a new intermediate HDF5 for a selected BEAM group with this variable dropped work as expected.

For more details on the GEDI data structure see
https://github.com/ornldaac/gedi_tutorials/blob/main/3_gedi_l4a_exploring_data.ipynb

@martindurant
Copy link
Member

martindurant commented May 25, 2022

The actual data type here is:

dtype([('predict_stratum', 'O'), ('model_group', 'u1'), ('model_name', 'O'), ('model_id', 'u1'), 
    ('x_transform', 'O'), ('y_transform', 'O'), ('bias_correction_name', 'O'), ('fit_stratum', 'O'), 
    ('rh_index', 'u1', (8,)), ('predictor_id', 'u1', (8,)), ('predictor_max_value', '<f4', (8,)), 
    ('vcov', '<f8', (5, 5)), ('par', '<f8', (5,)), ('rse', '<f4'), ('dof', '<u4'), 
    ('response_max_value', '<f4'), ('bias_correction_value', '<f4'), ('npar', 'u1')])

which appears to be a fixed-length record format of 355 bytes per row. I am not sure if the "object" types are embedded in the data are inline in the data or pointers to some internal heap. If the former, kerchunk can certainly encode it (with the "O"s replaced by "|Sx"s). If pointers, then kerchunk can only extract and inline the whole variable; in this case, that's some 12kB of raw data, an we would presumably need the "pickle" codec. The alternative, as you say, is to drop the whole variable from the output. In #46 we contemplated skipping or erroring with a decent message for such cases (except that this is a compound dtype containing objects, rather than an object dtype).

@martindurant
Copy link
Member

Another even easier thing we can do, if we choose not to handle the string pointer problem, is just to regard the (8-byte?) as the string data, which will of course be wrong but make all the numbers/arrays turn out right.

@sharkinsspatial
Copy link
Author

@martindurant in our specific use case I think it might be acceptable to expand the logic of #46 to ignore compound dtypes as well, but if you feel like there is cleaner approach that doesn't require a big time investment the 👍

@martindurant
Copy link
Member

There probably should be an option as to what to do for cases we know we can't handle. Skipping would be an obvious possibility.

Can you tell, by looking at the bytes at the relevant file offset, whether the strings of that array are embedded or pointers?

@sharkinsspatial
Copy link
Author

I'm unsure how to best determine if the strings are embedded or heap pointers.

@martindurant
Copy link
Member

Following the code, you should already have the file offset and size of the array buffer. Look at the first 355 bytes and see whether it contains the strings of the first row of the array (h5obj[0]).

@sharkinsspatial
Copy link
Author

I'm not very familiar with low level HDF access, does this provide the comparison you want or am I using the offset incorrectly here?
https://gist.github.com/sharkinsspatial/b97f032b3a42c7cc421eb06a014044e9

@martindurant
Copy link
Member

I can confirm that the string values are either pointers or dictionary hashes, not sure which; I suppose both are equivalent from our point of view. They appear to be 16bytes each (which is pretty wasteful, no string is that long!). The actual size of each row is 403 bytes (dsid.get_storage_size() / len(dset)), and 355 bytes is the python size, where each string is instead an 8-byte address pointer to a bytes.

In short, we could

  • inline this smallish dataset in the JSON output
  • complete the proposal in Make dict-string codec and apply to HDF5 #40 to extract and store the dict mapping as a codec (good if many values are repeated)
  • skip problematic variables like this
  • read the data "as is", so that any string field will have opaque, meaningless 16-byte binary in (alternative: a simple codec to strip these or set them to b"").

@joshmoore
Copy link

Quick cross link before I head out: zarr-developers/zarr-python#1035

@martindurant
Copy link
Member

@joshmoore , this is not actually related to VLEN: we have a complex dtype here that includes some object type fields, which are encoded using HDF5-specific pointers.

@sharkinsspatial
Copy link
Author

@martindurant I would suggest that in the short term we include a parameter which allows users to select inlining or skipping for compound dtypes and VLEN strings. Once #40 is completed we can change the underlying storage mechanism without altering the use of the option parameter.

@martindurant
Copy link
Member

I agree to adding the argument to skip, ignore or inline for now, with additional options becoming available later. "ignore" means just showing the nonsensical 16byte sequences.

@martindurant
Copy link
Member

This may now be working on main branch, I intend to look at it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants