"ValueError: missing object_codec for object array" #167

sharkinsspatial · 2022-05-25T20:01:52Z

Seeing the same VLEN string issue noted in #102 when attempting to use SingleHdf5ToZarr with a single GEDI HDF5 file.

See this notebook
https://nbviewer.org/gist/sharkinsspatial/b5938e2e3e0c96a1f1cef768d1b4da7e

I attempted testing against #40 but did not see the reported segfault but the same originally reported "ValueError: missing object_codec for object array" exception.

predict_stratum appears to be the offending variable in this case as tests with a new intermediate HDF5 for a selected BEAM group with this variable dropped work as expected.

For more details on the GEDI data structure see
https://github.com/ornldaac/gedi_tutorials/blob/main/3_gedi_l4a_exploring_data.ipynb

The text was updated successfully, but these errors were encountered:

martindurant · 2022-05-25T20:53:06Z

The actual data type here is:

dtype([('predict_stratum', 'O'), ('model_group', 'u1'), ('model_name', 'O'), ('model_id', 'u1'), 
    ('x_transform', 'O'), ('y_transform', 'O'), ('bias_correction_name', 'O'), ('fit_stratum', 'O'), 
    ('rh_index', 'u1', (8,)), ('predictor_id', 'u1', (8,)), ('predictor_max_value', '<f4', (8,)), 
    ('vcov', '<f8', (5, 5)), ('par', '<f8', (5,)), ('rse', '<f4'), ('dof', '<u4'), 
    ('response_max_value', '<f4'), ('bias_correction_value', '<f4'), ('npar', 'u1')])

which appears to be a fixed-length record format of 355 bytes per row. I am not sure if the "object" types are embedded in the data are inline in the data or pointers to some internal heap. If the former, kerchunk can certainly encode it (with the "O"s replaced by "|Sx"s). If pointers, then kerchunk can only extract and inline the whole variable; in this case, that's some 12kB of raw data, an we would presumably need the "pickle" codec. The alternative, as you say, is to drop the whole variable from the output. In #46 we contemplated skipping or erroring with a decent message for such cases (except that this is a compound dtype containing objects, rather than an object dtype).

martindurant · 2022-05-25T21:16:45Z

Another even easier thing we can do, if we choose not to handle the string pointer problem, is just to regard the (8-byte?) as the string data, which will of course be wrong but make all the numbers/arrays turn out right.

sharkinsspatial · 2022-05-25T21:27:21Z

@martindurant in our specific use case I think it might be acceptable to expand the logic of #46 to ignore compound dtypes as well, but if you feel like there is cleaner approach that doesn't require a big time investment the 👍

martindurant · 2022-05-25T21:29:35Z

There probably should be an option as to what to do for cases we know we can't handle. Skipping would be an obvious possibility.

Can you tell, by looking at the bytes at the relevant file offset, whether the strings of that array are embedded or pointers?

sharkinsspatial · 2022-05-26T17:42:21Z

I'm unsure how to best determine if the strings are embedded or heap pointers.

martindurant · 2022-05-26T17:46:00Z

Following the code, you should already have the file offset and size of the array buffer. Look at the first 355 bytes and see whether it contains the strings of the first row of the array (h5obj[0]).

sharkinsspatial · 2022-05-26T19:49:08Z

I'm not very familiar with low level HDF access, does this provide the comparison you want or am I using the offset incorrectly here?
https://gist.github.com/sharkinsspatial/b97f032b3a42c7cc421eb06a014044e9

martindurant · 2022-05-26T20:18:45Z

I can confirm that the string values are either pointers or dictionary hashes, not sure which; I suppose both are equivalent from our point of view. They appear to be 16bytes each (which is pretty wasteful, no string is that long!). The actual size of each row is 403 bytes (dsid.get_storage_size() / len(dset)), and 355 bytes is the python size, where each string is instead an 8-byte address pointer to a bytes.

In short, we could

inline this smallish dataset in the JSON output
complete the proposal in Make dict-string codec and apply to HDF5 #40 to extract and store the dict mapping as a codec (good if many values are repeated)
skip problematic variables like this
read the data "as is", so that any string field will have opaque, meaningless 16-byte binary in (alternative: a simple codec to strip these or set them to b"").

joshmoore · 2022-05-26T20:25:56Z

Quick cross link before I head out: zarr-developers/zarr-python#1035

martindurant · 2022-05-26T20:28:24Z

@joshmoore , this is not actually related to VLEN: we have a complex dtype here that includes some object type fields, which are encoded using HDF5-specific pointers.

sharkinsspatial · 2022-05-27T01:58:31Z

@martindurant I would suggest that in the short term we include a parameter which allows users to select inlining or skipping for compound dtypes and VLEN strings. Once #40 is completed we can change the underlying storage mechanism without altering the use of the option parameter.

martindurant · 2022-05-27T13:08:15Z

I agree to adding the argument to skip, ignore or inline for now, with additional options becoming available later. "ignore" means just showing the nonsensical 16byte sequences.

martindurant · 2022-06-14T13:45:01Z

This may now be working on main branch, I intend to look at it again.

martindurant mentioned this issue Jun 16, 2022

cope with compound VLEN strings #175

Merged

martindurant closed this as completed in #175 Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"ValueError: missing object_codec for object array" #167

"ValueError: missing object_codec for object array" #167

sharkinsspatial commented May 25, 2022

martindurant commented May 25, 2022 •

edited

Loading

Uh oh!

martindurant commented May 25, 2022

Uh oh!

sharkinsspatial commented May 25, 2022

Uh oh!

martindurant commented May 25, 2022

Uh oh!

sharkinsspatial commented May 26, 2022

Uh oh!

martindurant commented May 26, 2022

Uh oh!

sharkinsspatial commented May 26, 2022

Uh oh!

martindurant commented May 26, 2022

Uh oh!

joshmoore commented May 26, 2022

Uh oh!

martindurant commented May 26, 2022

Uh oh!

sharkinsspatial commented May 27, 2022

Uh oh!

martindurant commented May 27, 2022

Uh oh!

martindurant commented Jun 14, 2022

Uh oh!

"ValueError: missing object_codec for object array" #167

"ValueError: missing object_codec for object array" #167

Comments

sharkinsspatial commented May 25, 2022

martindurant commented May 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented May 25, 2022

Uh oh!

sharkinsspatial commented May 25, 2022

Uh oh!

martindurant commented May 25, 2022

Uh oh!

sharkinsspatial commented May 26, 2022

Uh oh!

martindurant commented May 26, 2022

Uh oh!

sharkinsspatial commented May 26, 2022

Uh oh!

martindurant commented May 26, 2022

Uh oh!

joshmoore commented May 26, 2022

Uh oh!

martindurant commented May 26, 2022

Uh oh!

sharkinsspatial commented May 27, 2022

Uh oh!

martindurant commented May 27, 2022

Uh oh!

martindurant commented Jun 14, 2022

Uh oh!

martindurant commented May 25, 2022 •

edited

Loading