Skip to content

blockingUnaryCall hangs forever if retries are enabled #10838

@turchenkoalex

Description

@turchenkoalex

What version of gRPC-Java are you using?

1.61.0 with grpc-netty-shaded

What is your environment?

jdk 17, linux

What did you expect to see?

The call of blockingUnaryCall should not block forever

What did you see instead?

Thread is hanging even if use a deadline.

jstack with problem

  "main" #1 [8707] prio=5 os_prio=31 cpu=207.44ms elapsed=16.32s tid=0x0000000132808200 nid=8707 waiting on condition  [0x000000016fcda000]
     java.lang.Thread.State: WAITING (parking)
  	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
  	- parking to wait for  <0x000000061e400010> (a io.grpc.stub.ClientCalls$ThreadlessExecutor)
  	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:221)
  	at io.grpc.stub.ClientCalls$ThreadlessExecutor.waitAndDrain(ClientCalls.java:717)
  	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:159)
  	at com.example.ExampleServiceGrpc$ExampleServiceBlockingStub.unaryCall(ExampleServiceGrpc.java:160)
  	at com.example.TestClient.call(TestClient.java:70)
  	at com.example.TestClient.callWithError(TestClient.java:53)
  	at com.example.Main.main(Main.java:52)

Steps to reproduce the bug

I'm using custom service config with enabled retries. The grpc blocking call is made with a deadline.

If the deadline occurs while waiting for the grpc call to retry, then a hang occurs.

    Duration deadline = Duration.ofSeconds(1);
    Map<String, ?> serviceConfig = Map.of("methodConfig",
            List.of(
                    Map.of(
                            "name", List.of(Map.of()),
                            "retryPolicy", Map.of(
                                    "maxAttempts", 4D,
                                    "initialBackoff", "10s",
                                    "maxBackoff", "10s",
                                    "backoffMultiplier", 1D,
                                    "retryableStatusCodes", List.of("UNAVAILABLE", "UNKNOWN")
                            )
                    )
            )
    );

    var channel = ManagedChannelBuilder.forAddress("localhost", port)
                .usePlaintext()
                .enableRetry()
                .disableServiceConfigLookUp()
                .defaultServiceConfig(serviceConfig);
    
    var blockingStub = ExampleServiceGrpc.newBlockingStub(channel);

    // If the first call to grpc returns UNKNOWN or the server is unavailable, a hang will occur.
    blockingStub
        .withDeadlineAfter(deadline.toMillis(), TimeUnit.MILLISECONDS)
        .unaryCall(req);

I've prepared an example in a separate repository because it is difficult to reproduce. Managed to find a reproducing of the bug.
Example to reproduce here - https://github.com/turchenkoalex/grpc-hang

I tested this behavior with several versions of the grpc library and found that the issue appeared starting from version 1.52.0

As far as I understand, the problem is related to the substream counter inFlightSubStreams in io.grpc.internal.RetriableStream.

When DeadlineTimer calls the cancel method on RetriableStream, the retryFuture also is canceled, but the inFlightSubStreams counter is not decremented. And for this reason, the safeCloseMasterListener method does not close the master channel.

I tried doing an explicit decrement on retryFuture.cancel and testing the new behavior. The hang is gone.

Here's the diff:

diff --git a/core/src/main/java/io/grpc/internal/RetriableStream.java b/core/src/main/java/io/grpc/internal/RetriableStream.java
index f301eee1f..07fe9e764 100644
--- a/core/src/main/java/io/grpc/internal/RetriableStream.java
+++ b/core/src/main/java/io/grpc/internal/RetriableStream.java
@@ -195,7 +195,10 @@ abstract class RetriableStream<ReqT> implements ClientStream {
             }
           }
           if (retryFuture != null) {
-            retryFuture.cancel(false);
+            boolean cancelled = retryFuture.cancel(false);
+            if (cancelled) {
+              inFlightSubStreams.decrementAndGet();
+            }
           }
           if (hedgingFuture != null) {
             hedgingFuture.cancel(false);

But maybe I'm missing something

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions