fix: 修复broker集群宕机情况下，客户端刷新连接时，若各连接抛出异常时机不一致，则仍未抛出异常的连接uri无法加入unHealth… #1

fanpipi · 2024-07-10T05:23:51Z

版本：
1.1.5

部署情况：
三节点的Gossip广播模式集群部署。

缺陷现象描述：
broker三节点关停一段时间后启动，存在部分客户端无法全部自动注册上三个broker节点。

异常类：
com.alibaba.rsocket.loadbalance.LoadBalancedRSocket

问题原因：
在所有节点关停后，客户端中维持的最后一条活跃连接触发onRSocketClosed，并将原始三节点uri进行refreshRsockets。这三条链接在connect或healthcheck阶段中会抛出连接相关异常，进而触发后续的异常处理流程；
但在此链接的异常处理流程执行完成前，若其余连接仍未抛出异常执行对应异常处理流程，则会被异常撤销流程，以至于无法加入unHealthyUriSet。故checkUnhealthyUris方法无法重连此异常链接。

处理思路：
在refreshRsockets方法中对connect及healthcheck阶段的异常进行捕获并不再抛出，返回空流。

…yUriSet，以至于无法重连该uri。

…控失效。

minh-tn-hust · 2025-05-27T11:11:59Z

Hi @fanpipi, I have a similar issue with reconnection in a Kubernetes (K8S) environment. Currently, the client connects to a broker pod. When scaling down, the pod is forcefully terminated without a proper shutdown, causing the health check to return a TimeoutException. This exception is not handled anywhere, so the client remains disconnected even though another broker pod is available. Does your solution also address this issue?

fanpipi · 2025-06-10T07:44:52Z

Hi @fanpipi, I have a similar issue with reconnection in a Kubernetes (K8S) environment. Currently, the client connects to a broker pod. When scaling down, the pod is forcefully terminated without a proper shutdown, causing the health check to return a TimeoutException. This exception is not handled anywhere, so the client remains disconnected even though another broker pod is available. Does your solution also address this issue?

HI @minh-tn-hust, It looks like your problem involves the broker failing to reconnect after abnormal recovery - this PR could potentially address the issue.

minh-tn-hust · 2025-06-10T09:27:40Z

Hi @fanpipi , my issue occurs when a client tries to connect to a broker after a scale-up operation in a Kubernetes environment. Existing brokers emit consistent CloudEvents to the client, but a newly created broker emits two events:

The first is triggered when ClusterDiscovery starts and queries Kubernetes to discover all available brokers.
The second is sent once the new broker finishes its initialization and joins the cluster.

This causes the client to receive inconsistent CloudEvents. The problem is worsened because "RefuseConnection" events are not handled correctly, and the disconnection process is asynchronous. As a result, when the new CloudEvent arrives, the client may have already missed the chance to connect properly to the new broker. The following udpate in RSocketBrokerManagerDiscoveryImpl.java could fix the issue (as I tested)

+private boolean isFinishStartUp = false;

public RSocketBrokerManagerDiscoveryImpl(ReactiveDiscoveryClient discoveryClient, RSocketBrokerProperties properties) {
        this.SERVICE_NAME = properties.getDiscovery().getServiceName();
        this.REFRESH_INTERVAL_SECONDS = properties.getDiscovery().getRefreshInterval();
        this.discoveryClient = discoveryClient;
        this.brokersFresher = Flux.interval(Duration.ofSeconds(REFRESH_INTERVAL_SECONDS))
                .flatMap(aLong -> this.discoveryClient.getInstances(SERVICE_NAME).collectList())
                .subscribe(serviceInstances -> {
                    boolean changed = serviceInstances.size() != currentBrokers.size();
                    for (ServiceInstance serviceInstance : serviceInstances) {
                        if (!currentBrokers.containsKey(serviceInstance.getHost())) {
                            changed = true;
                        }

+                        // make sure current broker is on running pod
+                       if (serviceInstance.getHost().equals(NetworkUtil.LOCAL_IP)) {
+                          isFinishStartUp = true;
+                        }
+                   }

+                    // make sure the service is fully startup before emitting brokerChangedEvent
+                   if (changed && isFinishStartUp) {
                        currentBrokers = serviceInstances.stream().map(serviceInstance -> {
                            RSocketBroker broker = new RSocketBroker();
                            broker.setIp(serviceInstance.getHost());
                            broker.setSchema(properties.getSchema());
                            broker.setPort(properties.getPortHealthCheck());
                            return broker;
                        }).collect(Collectors.toMap(RSocketBroker::getIp, Function.identity()));
                        log.info(RsocketErrorCode.message("RST-300206", String.join(",", currentBrokers.keySet())));
                        brokersEmitterProcessor.tryEmitNext(currentBrokers.values());
                    }
                }, error -> log.error("Broker refresher encounter an error: ", error));
    }

fanpipi added 2 commits July 10, 2024 11:23

fix: 修复broker集群宕机情况下，客户端刷新连接时，若各连接抛出异常时机不一致，则仍未抛出异常的连接uri无法加入unHealth…

e31832c

…yUriSet，以至于无法重连该uri。

fix: 修复当客户端与broker集群断连后，集群监控未处理NoAvailableConnectionException异常，导致集群监…

a58fb47

…控失效。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: 修复broker集群宕机情况下，客户端刷新连接时，若各连接抛出异常时机不一致，则仍未抛出异常的连接uri无法加入unHealth… #1

fix: 修复broker集群宕机情况下，客户端刷新连接时，若各连接抛出异常时机不一致，则仍未抛出异常的连接uri无法加入unHealth… #1

Uh oh!

fanpipi commented Jul 10, 2024 •

edited

Loading

Uh oh!

minh-tn-hust commented May 27, 2025

Uh oh!

fanpipi commented Jun 10, 2025 •

edited

Loading

Uh oh!

minh-tn-hust commented Jun 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

fix: 修复broker集群宕机情况下，客户端刷新连接时，若各连接抛出异常时机不一致，则仍未抛出异常的连接uri无法加入unHealth… #1

Are you sure you want to change the base?

fix: 修复broker集群宕机情况下，客户端刷新连接时，若各连接抛出异常时机不一致，则仍未抛出异常的连接uri无法加入unHealth… #1

Uh oh!

Conversation

fanpipi commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

minh-tn-hust commented May 27, 2025

Uh oh!

fanpipi commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

minh-tn-hust commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fanpipi commented Jul 10, 2024 •

edited

Loading

fanpipi commented Jun 10, 2025 •

edited

Loading

minh-tn-hust commented Jun 10, 2025 •

edited

Loading