-
Notifications
You must be signed in to change notification settings - Fork 90
"ValueError: missing object_codec for object array" #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The actual data type here is:
which appears to be a fixed-length record format of 355 bytes per row. I am not sure if the "object" types are embedded in the data are inline in the data or pointers to some internal heap. If the former, kerchunk can certainly encode it (with the "O"s replaced by "|Sx"s). If pointers, then kerchunk can only extract and inline the whole variable; in this case, that's some 12kB of raw data, an we would presumably need the "pickle" codec. The alternative, as you say, is to drop the whole variable from the output. In #46 we contemplated skipping or erroring with a decent message for such cases (except that this is a compound dtype containing objects, rather than an object dtype). |
Another even easier thing we can do, if we choose not to handle the string pointer problem, is just to regard the (8-byte?) as the string data, which will of course be wrong but make all the numbers/arrays turn out right. |
@martindurant in our specific use case I think it might be acceptable to expand the logic of #46 to ignore compound dtypes as well, but if you feel like there is cleaner approach that doesn't require a big time investment the 👍 |
There probably should be an option as to what to do for cases we know we can't handle. Skipping would be an obvious possibility. Can you tell, by looking at the bytes at the relevant file offset, whether the strings of that array are embedded or pointers? |
I'm unsure how to best determine if the strings are embedded or heap pointers. |
Following the code, you should already have the file offset and size of the array buffer. Look at the first 355 bytes and see whether it contains the strings of the first row of the array ( |
I'm not very familiar with low level HDF access, does this provide the comparison you want or am I using the offset incorrectly here? |
I can confirm that the string values are either pointers or dictionary hashes, not sure which; I suppose both are equivalent from our point of view. They appear to be 16bytes each (which is pretty wasteful, no string is that long!). The actual size of each row is 403 bytes ( In short, we could
|
Quick cross link before I head out: zarr-developers/zarr-python#1035 |
@joshmoore , this is not actually related to VLEN: we have a complex dtype here that includes some object type fields, which are encoded using HDF5-specific pointers. |
@martindurant I would suggest that in the short term we include a parameter which allows users to select inlining or skipping for compound dtypes and VLEN strings. Once #40 is completed we can change the underlying storage mechanism without altering the use of the option parameter. |
I agree to adding the argument to skip, ignore or inline for now, with additional options becoming available later. "ignore" means just showing the nonsensical 16byte sequences. |
This may now be working on main branch, I intend to look at it again. |
Seeing the same VLEN string issue noted in #102 when attempting to use
SingleHdf5ToZarr
with a single GEDI HDF5 file.See this notebook
https://nbviewer.org/gist/sharkinsspatial/b5938e2e3e0c96a1f1cef768d1b4da7e
I attempted testing against #40 but did not see the reported segfault but the same originally reported "ValueError: missing object_codec for object array" exception.
predict_stratum
appears to be the offending variable in this case as tests with a new intermediate HDF5 for a selected BEAM group with this variable dropped work as expected.For more details on the GEDI data structure see
https://github.com/ornldaac/gedi_tutorials/blob/main/3_gedi_l4a_exploring_data.ipynb
The text was updated successfully, but these errors were encountered: