-
Notifications
You must be signed in to change notification settings - Fork 90
Regression in 0.0.8-0.0.9 release causes race condition & segfault in eccodes grib_string_length #328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is the segfault the only output, or is there some preceding warning/exception? Do you still get this if you only run one thread per dask worker? At a guess, the following might provide the necessary safety: --- a/kerchunk/codecs.py
+++ b/kerchunk/codecs.py
@@ -2,6 +2,7 @@ import ast
import numcodecs
from numcodecs.abc import Codec
import numpy as np
+import threading
class FillStringsCodec(Codec):
@@ -70,6 +71,7 @@ class GRIBCodec(numcodecs.abc.Codec):
"""
Read GRIB stream of bytes as a message using eccodes
"""
+ eclock = threading.RLock()
codec_id = "grib"
@@ -90,18 +92,19 @@ class GRIBCodec(numcodecs.abc.Codec):
else:
var = "values"
dt = self.dtype or "float32"
- mid = eccodes.codes_new_from_message(bytes(buf))
- try:
- data = eccodes.codes_get_array(mid, var)
- finally:
- eccodes.codes_release(mid)
-
- if var == "values" and eccodes.codes_get_string(mid, "missingValue"):
- data[data == float(eccodes.codes_get_string(mid, "missingValue"))] = np.nan
- if out is not None:
- return numcodecs.compat.ndarray_copy(data, out)
- else:
- return data.astype(dt)
+ with self.eclock:
+ mid = eccodes.codes_new_from_message(bytes(buf))
+ try:
+ data = eccodes.codes_get_array(mid, var)
+ finally:
+ eccodes.codes_release(mid)
+
+ if var == "values" and eccodes.codes_get_string(mid, "missingValue"):
+ data[data == float(eccodes.codes_get_string(mid, "missingValue"))] = np.nan
+ if out is not None:
+ return numcodecs.compat.ndarray_copy(data, out)
+ else:
+ return data.astype(dt) |
Without gdb the process just dies - no warnings or errors. Using I will try your patch and see if I can generate some metrics. |
The patch works. |
For one month HRRR aggregation, running with But I think that is all in the GCS/S3 IO. I don't think the patch makes a bit of difference adding the lock in parsing the grib files, that is all GIL bound anyway. Can you release 0.1.1 with this patch? |
In more complex cases I am still seeing race conditions: Errors that go away when run under [New LWP 3693756]
ecCodes assertion failed: `t' in ./src/grib_hash_keys.c:9971 Thread 15 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffff89dbf120 (LWP 11400)]
0x0000ffff932d3450 in grib_string_length () from /lib/aarch64-linux-gnu/libeccodes.so.0 |
^ this is following my suggested diff? When I suggested single-thread, I meant the distributed scheduler with multiple workers, but only one thread each. Still, it shouldn't require that. |
Yes - after using the patch... is there some other place that a lock could be required? |
I don't see anywhere else eccodes could be getting called, and that block is supposed to release its C objects before leaving. The method of decoding changed I suppose at the time you noticed this, from reading in temporary local files to making eccodes objects directly in memory from bytes objects. I did not expect any project from this! |
Strongly prefer the from memory pattern as a design. |
#329 resolves the issue - thank you Martin! |
After upgrading from kerchunk==0.0.8 to kerchunk==0.0.9 I get an intermittent segfault reading my HRRR grib files. The problem persists in kerchunk==0.1.0.
GDB shows:
It appears to be a race condition in the dask workers when I call
to_dataframe
on a slice of the dataset. It only happens about one time in five. I tried putting a for loop that would run till it produces the fault, but I can't seem to reset the state of the dask workers sufficiently to make that happen.hrrr_repro.py, mzz.zarr (multizarr file from hrrr grib) and the terminal repo case output are in this gist including all the library version details.
I can try rerunning scangrib to produce the input artifacts with the new library versions, I have not done that yet but we have several years of HRRR surface output scanned and aggregated that I hope to keep using till I have time replace them with the new parquet format.
The text was updated successfully, but these errors were encountered: