Skip to content

Regression in 0.0.8-0.0.9 release causes race condition & segfault in eccodes grib_string_length #328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
emfdavid opened this issue Apr 13, 2023 · 10 comments

Comments

@emfdavid
Copy link
Contributor

After upgrading from kerchunk==0.0.8 to kerchunk==0.0.9 I get an intermittent segfault reading my HRRR grib files. The problem persists in kerchunk==0.1.0.

GDB shows:

Thread 7 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffff7da0e120 (LWP 20659)]
0x0000ffff820b3450 in grib_string_length () from /lib/aarch64-linux-gnu/libeccodes.so.0

It appears to be a race condition in the dask workers when I call to_dataframe on a slice of the dataset. It only happens about one time in five. I tried putting a for loop that would run till it produces the fault, but I can't seem to reset the state of the dask workers sufficiently to make that happen.

hrrr_repro.py, mzz.zarr (multizarr file from hrrr grib) and the terminal repo case output are in this gist including all the library version details.

I can try rerunning scangrib to produce the input artifacts with the new library versions, I have not done that yet but we have several years of HRRR surface output scanned and aggregated that I hope to keep using till I have time replace them with the new parquet format.

@martindurant
Copy link
Member

Is the segfault the only output, or is there some preceding warning/exception? Do you still get this if you only run one thread per dask worker?

At a guess, the following might provide the necessary safety:

--- a/kerchunk/codecs.py
+++ b/kerchunk/codecs.py
@@ -2,6 +2,7 @@ import ast
 import numcodecs
 from numcodecs.abc import Codec
 import numpy as np
+import threading


 class FillStringsCodec(Codec):
@@ -70,6 +71,7 @@ class GRIBCodec(numcodecs.abc.Codec):
     """
     Read GRIB stream of bytes as a message using eccodes
     """
+    eclock = threading.RLock()

     codec_id = "grib"

@@ -90,18 +92,19 @@ class GRIBCodec(numcodecs.abc.Codec):
         else:
             var = "values"
             dt = self.dtype or "float32"
-        mid = eccodes.codes_new_from_message(bytes(buf))
-        try:
-            data = eccodes.codes_get_array(mid, var)
-        finally:
-            eccodes.codes_release(mid)
-
-        if var == "values" and eccodes.codes_get_string(mid, "missingValue"):
-            data[data == float(eccodes.codes_get_string(mid, "missingValue"))] = np.nan
-        if out is not None:
-            return numcodecs.compat.ndarray_copy(data, out)
-        else:
-            return data.astype(dt)
+        with self.eclock:
+            mid = eccodes.codes_new_from_message(bytes(buf))
+            try:
+                data = eccodes.codes_get_array(mid, var)
+            finally:
+                eccodes.codes_release(mid)
+
+            if var == "values" and eccodes.codes_get_string(mid, "missingValue"):
+                data[data == float(eccodes.codes_get_string(mid, "missingValue"))] = np.nan
+            if out is not None:
+                return numcodecs.compat.ndarray_copy(data, out)
+            else:
+                return data.astype(dt)

@emfdavid
Copy link
Contributor Author

Without gdb the process just dies - no warnings or errors.

Using with dask.config.set(scheduler='single-threaded') does appear to prevent the issue.

I will try your patch and see if I can generate some metrics.

@emfdavid
Copy link
Contributor Author

The patch works.
Looking at metrics. Trying to isolate from gcs io variability...

@emfdavid
Copy link
Contributor Author

For one month HRRR aggregation, running with dask.config.set(scheduler='single-threaded') is definitely slower, ~170 seconds vs ~55 seconds for the same data using dask.config.set(scheduler='threading').

But I think that is all in the GCS/S3 IO. I don't think the patch makes a bit of difference adding the lock in parsing the grib files, that is all GIL bound anyway.

Can you release 0.1.1 with this patch?

@emfdavid
Copy link
Contributor Author

In more complex cases I am still seeing race conditions: Errors that go away when run under scheduler='single-threaded'.

[New LWP 3693756]
ecCodes assertion failed: `t' in ./src/grib_hash_keys.c:9971
Thread 15 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffff89dbf120 (LWP 11400)]
0x0000ffff932d3450 in grib_string_length () from /lib/aarch64-linux-gnu/libeccodes.so.0

@martindurant
Copy link
Member

^ this is following my suggested diff?

When I suggested single-thread, I meant the distributed scheduler with multiple workers, but only one thread each. Still, it shouldn't require that.

@emfdavid
Copy link
Contributor Author

Yes - after using the patch... is there some other place that a lock could be required?
If there is nothing obvious I can bisect the test operation that is failing and boil it down to a repro case tied to a specific change again.

@martindurant
Copy link
Member

I don't see anywhere else eccodes could be getting called, and that block is supposed to release its C objects before leaving.

The method of decoding changed I suppose at the time you noticed this, from reading in temporary local files to making eccodes objects directly in memory from bytes objects. I did not expect any project from this!

@emfdavid
Copy link
Contributor Author

Strongly prefer the from memory pattern as a design.
Happy to help track down this issue - will get back to you with more details.

@emfdavid
Copy link
Contributor Author

#329 resolves the issue - thank you Martin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants