Regression in 0.0.8-0.0.9 release causes race condition & segfault in eccodes grib_string_length #328

emfdavid · 2023-04-13T18:24:27Z

After upgrading from kerchunk==0.0.8 to kerchunk==0.0.9 I get an intermittent segfault reading my HRRR grib files. The problem persists in kerchunk==0.1.0.

GDB shows:

Thread 7 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffff7da0e120 (LWP 20659)]
0x0000ffff820b3450 in grib_string_length () from /lib/aarch64-linux-gnu/libeccodes.so.0

It appears to be a race condition in the dask workers when I call to_dataframe on a slice of the dataset. It only happens about one time in five. I tried putting a for loop that would run till it produces the fault, but I can't seem to reset the state of the dask workers sufficiently to make that happen.

hrrr_repro.py, mzz.zarr (multizarr file from hrrr grib) and the terminal repo case output are in this gist including all the library version details.

I can try rerunning scangrib to produce the input artifacts with the new library versions, I have not done that yet but we have several years of HRRR surface output scanned and aggregated that I hope to keep using till I have time replace them with the new parquet format.

The text was updated successfully, but these errors were encountered:

martindurant · 2023-04-13T18:44:42Z

Is the segfault the only output, or is there some preceding warning/exception? Do you still get this if you only run one thread per dask worker?

At a guess, the following might provide the necessary safety:

--- a/kerchunk/codecs.py
+++ b/kerchunk/codecs.py
@@ -2,6 +2,7 @@ import ast
 import numcodecs
 from numcodecs.abc import Codec
 import numpy as np
+import threading


 class FillStringsCodec(Codec):
@@ -70,6 +71,7 @@ class GRIBCodec(numcodecs.abc.Codec):
     """
     Read GRIB stream of bytes as a message using eccodes
     """
+    eclock = threading.RLock()

     codec_id = "grib"

@@ -90,18 +92,19 @@ class GRIBCodec(numcodecs.abc.Codec):
         else:
             var = "values"
             dt = self.dtype or "float32"
-        mid = eccodes.codes_new_from_message(bytes(buf))
-        try:
-            data = eccodes.codes_get_array(mid, var)
-        finally:
-            eccodes.codes_release(mid)
-
-        if var == "values" and eccodes.codes_get_string(mid, "missingValue"):
-            data[data == float(eccodes.codes_get_string(mid, "missingValue"))] = np.nan
-        if out is not None:
-            return numcodecs.compat.ndarray_copy(data, out)
-        else:
-            return data.astype(dt)
+        with self.eclock:
+            mid = eccodes.codes_new_from_message(bytes(buf))
+            try:
+                data = eccodes.codes_get_array(mid, var)
+            finally:
+                eccodes.codes_release(mid)
+
+            if var == "values" and eccodes.codes_get_string(mid, "missingValue"):
+                data[data == float(eccodes.codes_get_string(mid, "missingValue"))] = np.nan
+            if out is not None:
+                return numcodecs.compat.ndarray_copy(data, out)
+            else:
+                return data.astype(dt)

emfdavid · 2023-04-13T20:00:50Z

Without gdb the process just dies - no warnings or errors.

Using with dask.config.set(scheduler='single-threaded') does appear to prevent the issue.

I will try your patch and see if I can generate some metrics.

emfdavid · 2023-04-13T20:29:03Z

The patch works.
Looking at metrics. Trying to isolate from gcs io variability...

emfdavid · 2023-04-13T22:06:41Z

For one month HRRR aggregation, running with dask.config.set(scheduler='single-threaded') is definitely slower, ~170 seconds vs ~55 seconds for the same data using dask.config.set(scheduler='threading').

But I think that is all in the GCS/S3 IO. I don't think the patch makes a bit of difference adding the lock in parsing the grib files, that is all GIL bound anyway.

Can you release 0.1.1 with this patch?

emfdavid · 2023-04-14T03:57:29Z

In more complex cases I am still seeing race conditions: Errors that go away when run under scheduler='single-threaded'.

[New LWP 3693756]
ecCodes assertion failed: `t' in ./src/grib_hash_keys.c:9971

Thread 15 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffff89dbf120 (LWP 11400)]
0x0000ffff932d3450 in grib_string_length () from /lib/aarch64-linux-gnu/libeccodes.so.0

martindurant · 2023-04-14T13:14:22Z

^ this is following my suggested diff?

When I suggested single-thread, I meant the distributed scheduler with multiple workers, but only one thread each. Still, it shouldn't require that.

emfdavid · 2023-04-14T14:10:30Z

Yes - after using the patch... is there some other place that a lock could be required?
If there is nothing obvious I can bisect the test operation that is failing and boil it down to a repro case tied to a specific change again.

martindurant · 2023-04-14T14:13:51Z

I don't see anywhere else eccodes could be getting called, and that block is supposed to release its C objects before leaving.

The method of decoding changed I suppose at the time you noticed this, from reading in temporary local files to making eccodes objects directly in memory from bytes objects. I did not expect any project from this!

emfdavid · 2023-04-14T14:15:48Z

Strongly prefer the from memory pattern as a design.
Happy to help track down this issue - will get back to you with more details.

emfdavid · 2023-04-17T14:29:46Z

#329 resolves the issue - thank you Martin!

martindurant mentioned this issue Apr 16, 2023

Put a lock around eccodes usage #329

Merged

emfdavid closed this as completed Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regression in 0.0.8-0.0.9 release causes race condition & segfault in eccodes grib_string_length #328

Regression in 0.0.8-0.0.9 release causes race condition & segfault in eccodes grib_string_length #328

emfdavid commented Apr 13, 2023

martindurant commented Apr 13, 2023

Uh oh!

emfdavid commented Apr 13, 2023

Uh oh!

emfdavid commented Apr 13, 2023

Uh oh!

emfdavid commented Apr 13, 2023

Uh oh!

emfdavid commented Apr 14, 2023

Uh oh!

martindurant commented Apr 14, 2023

Uh oh!

emfdavid commented Apr 14, 2023

Uh oh!

martindurant commented Apr 14, 2023

Uh oh!

emfdavid commented Apr 14, 2023

Uh oh!

emfdavid commented Apr 17, 2023

Uh oh!

Regression in 0.0.8-0.0.9 release causes race condition & segfault in eccodes grib_string_length #328

Regression in 0.0.8-0.0.9 release causes race condition & segfault in eccodes grib_string_length #328

Comments

emfdavid commented Apr 13, 2023

martindurant commented Apr 13, 2023

Uh oh!

emfdavid commented Apr 13, 2023

Uh oh!

emfdavid commented Apr 13, 2023

Uh oh!

emfdavid commented Apr 13, 2023

Uh oh!

emfdavid commented Apr 14, 2023

Uh oh!

martindurant commented Apr 14, 2023

Uh oh!

emfdavid commented Apr 14, 2023

Uh oh!

martindurant commented Apr 14, 2023

Uh oh!

emfdavid commented Apr 14, 2023

Uh oh!

emfdavid commented Apr 17, 2023

Uh oh!