-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
What version of gRPC-Java are you using?
v1.71.0
What is your environment?
Linux
What did you expect to see?
I was not expecting onError to throw NullPointerException.
What did you see instead?
It threw NPE.
Steps to reproduce the bug
We have been seeing some rare NullPointerException or ArrayIndexOutOfBoundsException when calling onError on an incoming server call. This happens several times per day across all of our services, which translates to maybe one in a few billion calls or so. The exception looks like this:
java.lang.NullPointerException: Cannot invoke "io.grpc.Metadata$LazyValue.toBytes()" because "value" is null
at io.grpc.Metadata.valueAsBytes(Metadata.java:183)
at io.grpc.Metadata.serialize(Metadata.java:474)
at io.grpc.InternalMetadata.serialize(InternalMetadata.java:79)
at io.grpc.internal.TransportFrameUtil.toHttp2Headers(TransportFrameUtil.java:51)
at io.grpc.netty.Utils.convertTrailers(Utils.java:327)
at io.grpc.netty.NettyServerStream$Sink.writeTrailers(NettyServerStream.java:131)
at io.grpc.internal.AbstractServerStream.close(AbstractServerStream.java:133)
at io.grpc.internal.ServerCallImpl.closeInternal(ServerCallImpl.java:227)
at io.grpc.internal.ServerCallImpl.close(ServerCallImpl.java:213)
at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onError(ServerCalls.java:389)
In order to debug this, we added some code to use reflection to dump the internal structure of the Metadata object that throws. Here are some examples:
[grpc-status,14,grpc-message,Channel shutdown invoked,null,null,null,null]
[grpc-status,14,grpc-message,Network closed for unknown reason,null,null,null,null]
[null,null,grpc-status,14,grpc-message,Channel shutdown invoked,null,null]
[grpc-status,14,grpc-message,Channel shutdown invoked,null,null,grpc-message,Channel shutdown invoked]
[grpc-status,14,grpc-message,Channel shutdown invoked,null,null,null,null]
Some of these look perfectly valid, while others have the expected content, but in an unusual order, with null elements at the beginning or between entries. The documentation for Metadata clearly says that it is mutable and not thread-safe, and this appears to be due to multi-threaded mutation of the Metadata object.
We initially found this surprising: we're not setting or mutating metadata anywhere.
After reviewing the grpc-java source code, we discovered the following sequence of events:
We call
io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onError
which calls
Metadata metadata = Status.trailersFromThrowable(t);
and then calls
io.grpc.internal.ServerCallImpl.close(ServerCallImpl.java:213)
which calls
io.grpc.internal.ServerCallImpl.closeInternal(ServerCallImpl.java:227)
which calls
io.grpc.internal.AbstractServerStream.close(AbstractServerStream.java:133)
which calls
addStatusToTrailers
which mutates the Metadata object.
The Metadata object mutated here is the Metadata object stored in the StatusRuntimeException that we passed to onError. Who owns that instance?
What actually happens in our code is that we make outgoing gRPC calls to other services which fail with a StatusRuntimeException, and then we use the same instance to reply to multiple incoming gRPC calls on different threads. When we see the concurrent mutation of the Metadata object, that's because the onError call unconditionally mutates the Metadata, which is now shared across multiple threads.
The code is racing with itself, and this happens even when the application code never sets or modifies Metadata objects. This is not great.
Personally, I think it's surprising and problematic to mutate the Metadata that's attached to a SRE in the onError call. I would prefer StatusRuntimeException to be effectively immutable.
At a minimum, it would be good to explicitly document that StatusRuntimeException is not thread-safe and should not be shared across threads (unfortunately, we end up storing it in Future objects, which is difficult to fix on our side). This would - at least - have pointed us in the right direction earlier.
Other options:
- Make
SREimmutable by copying the metadata whenever it's returned fromgetTrailers, but this would have performance implications. - Make
onErrornot mutate the passed-inMetadataobject:- Copy the
trailersif set (performance?). - Don't update the
Metadataif the status fields match the correspondingStatus(not a complete solution).
- Copy the
- Make outgoing client calls throw
SREwithout aMetadataobject; I believe that the data is currently duplicated via theMetadataand theStatus, even if no other metadata is set (not a complete solution).