-
Notifications
You must be signed in to change notification settings - Fork 2k
ReflectorRunnable.watchHandler seems to be stuck after about 5min of inactivity #1578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What version of the client are you using? Have a look at the discussion starting here: You should try setting the read timeout to < 5 minutes and things should work. Basically, something in the network somewhere is disconnected, but it takes a long time for the socket to timeout. |
Thanks for the information. Looks like similar issue has been faced by others and by other clients (js, c#, go); I should search under closed issues too. I'm using 10.0.0 java client. And from the discussion in #1370, it seems the fix was made in #1498. However, don't see this fix #1498 in the latest release 11.0.1. It seems releases are cherry picks from master branch. Is there a way to list this fix and maybe others for the next release? Should I open an issue to track this? Also, should fix the ControllerExample. Will send out a PR for this. java/examples/examples-release-10/src/main/java/io/kubernetes/client/examples/ControllerExample.java Line 49 in 8e89844
Meanwhile, while we figure out the release, I see couple of options to workaround below. Any preference which is better option?
|
We can see about cherry-picking that in. Rather than opening an issue, if you wanted to send a PR with the cherry-pick that's the fastest way to get it done. Personally, I would set a short timeout, because I don't trust the network, but either one should work. |
Setting the read timeout to < 5 minutes but > 0 i.e. solution 1, didn't help. I thought this will lead to a io timed out exception but it didn't. I'll investigate this. Solution 2 does work. |
Oh I see this commit overwrites the read timeout I set. Once I hacked in to remove the overwrite of read timeout to zero, it does work fine. So next, trying to figure out why the above commit was made. |
Yeah, I don't think we should be overriding the read timeout. I think there was a belief that it was better to try to keep the connection open for forever, even when there is no data, but I think that having the option to timeout is better for user control... @yue9944882 wdyt? |
Another data point. I also saw the exact same issue happening on GKE 1.18. It was fixed by option 2. Thanks @karunasagark Would it be possible to fix the read timeout overwrite in the next release or be back ported to the existing ones? |
@yue9944882 can take a look at Brendan's comment and provide your suggestion? I can make the changes accordingly. |
Also, #1588 seems to be making things more complex. The SharedInformerFactory.java overrides the read timeout to 0 (infinite) and depending on the sharedIndexInformerFor overload used, the watch call either uses 0 read timeout or what is specified from the reflectorRunnable. Overload that would use the read timeout specified by reflectorRunnable: java/util/src/main/java/io/kubernetes/client/informer/SharedInformerFactory.java Line 211 in c580990
Overload that would NOT use the read timeout specified by reflectorRunnable: java/util/src/main/java/io/kubernetes/client/informer/SharedInformerFactory.java Line 259 in c580990
Further it gets harder to reason about what is the value of read timeout without looking through the code path carefully. I would propose to remove the read timeout override originally introduced in this commit. However, this would probably expose out the memory leak as discussed in #1259 which needs to be addressed differently (possibly by de-deuping). I'm guessing the memory leak would also be exposed by #1588, since the timeout is 5min and reflectorRunnable would re-list and re-watch. Probably will need to spend more time to understand the root cause of the memory leak. @yue9944882 and @brendanburns what do you guys think? should we spend more time addressing the memory leak in a different way OR patch all watch code paths further will 5min timeout? |
IIUC, the de-duping approach would also solve #1634 as mentioned by @brendanburns in this comment. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Uh oh!
There was an error while loading. Please reload this page.
I'm working on custom controller where I noticed this issue. After about 5min of inactivity i.e. no updates in the api server, the controller stops receiving watch events from the api server. So any add/delete on the custom resource after the 5min of inactivity on api server, is not reconciled by the controller .
On further debugging, what I've seen is that ReflectorRunnable.watchHandler which should be constantly reading from the response stream seems to be stuck. The
Receiving resourceVersion
log from ReflectorRunnable doesn't show up. Also,kubectl get <customresource> -w
sees the updates as expected.The minimal repo is here kubetest.zip and I'm able to reproduce this consistently. I tried running this using both java 8 and java 11, with the same result as above. I got a jstack dump when the issue occurred, but couldn't notice any issue, the reflector thread was in runnable state.
The text was updated successfully, but these errors were encountered: