Skip to content

ReflectorRunnable.watchHandler seems to be stuck after about 5min of inactivity #1578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
karunasagark opened this issue Mar 6, 2021 · 14 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@karunasagark
Copy link

karunasagark commented Mar 6, 2021

I'm working on custom controller where I noticed this issue. After about 5min of inactivity i.e. no updates in the api server, the controller stops receiving watch events from the api server. So any add/delete on the custom resource after the 5min of inactivity on api server, is not reconciled by the controller .

On further debugging, what I've seen is that ReflectorRunnable.watchHandler which should be constantly reading from the response stream seems to be stuck. The Receiving resourceVersion log from ReflectorRunnable doesn't show up. Also, kubectl get <customresource> -w sees the updates as expected.

The minimal repo is here kubetest.zip and I'm able to reproduce this consistently. I tried running this using both java 8 and java 11, with the same result as above. I got a jstack dump when the issue occurred, but couldn't notice any issue, the reflector thread was in runnable state.

"controller-reflector-com.XXXXX-0" #23 prio=5 os_prio=0 cpu=39.77ms elapsed=986.61s allocated=1080K defined_classes=39 tid=0x00007f19b856c800 nid=0x1bbe9 runnable  [0x00007f19b5ff2000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0([email protected]/Native Method)
        at java.net.SocketInputStream.socketRead([email protected]/SocketInputStream.java:115)
        at java.net.SocketInputStream.read([email protected]/SocketInputStream.java:168)
        at java.net.SocketInputStream.read([email protected]/SocketInputStream.java:140)
        at sun.security.ssl.SSLSocketInputRecord.read([email protected]/SSLSocketInputRecord.java:478)
        at sun.security.ssl.SSLSocketInputRecord.readHeader([email protected]/SSLSocketInputRecord.java:472)
        at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket([email protected]/SSLSocketInputRecord.java:70)
        at sun.security.ssl.SSLSocketImpl.readApplicationRecord([email protected]/SSLSocketImpl.java:1354)
        at sun.security.ssl.SSLSocketImpl$AppInputStream.read([email protected]/SSLSocketImpl.java:963)
        at okio.Okio$2.read(Okio.java:140)
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
        at okio.RealBufferedSource.request(RealBufferedSource.java:72)
        at okio.RealBufferedSource.require(RealBufferedSource.java:65)
        at okio.RealBufferedSource.readHexadecimalUnsignedLong(RealBufferedSource.java:307)
        at okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.readChunkSize(Http1ExchangeCodec.java:492)
        at okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.read(Http1ExchangeCodec.java:471)
        at okhttp3.internal.connection.Exchange$ResponseBodySource.read(Exchange.java:286)
        at okio.RealBufferedSource.exhausted(RealBufferedSource.java:61)
        at io.kubernetes.client.util.Watch.hasNext(Watch.java:179)
        at io.kubernetes.client.informer.cache.ReflectorRunnable.watchHandler(ReflectorRunnable.java:160)
        at io.kubernetes.client.informer.cache.ReflectorRunnable.run(ReflectorRunnable.java:108)
        at io.kubernetes.client.informer.cache.Controller$$Lambda$126/0x000000080069a440.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:515)
        at java.util.concurrent.FutureTask.runAndReset([email protected]/FutureTask.java:305)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run([email protected]/ScheduledThreadPoolExecutor.java:305)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run([email protected]/Thread.java:834)

   Locked ownable synchronizers:
        - <0x00000007655823d8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        - <0x00000007656801a0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
@brendandburns
Copy link
Contributor

What version of the client are you using?

Have a look at the discussion starting here:

#1370 (comment)

You should try setting the read timeout to < 5 minutes and things should work.

Basically, something in the network somewhere is disconnected, but it takes a long time for the socket to timeout.

@karunasagark
Copy link
Author

karunasagark commented Mar 7, 2021

Thanks for the information. Looks like similar issue has been faced by others and by other clients (js, c#, go); I should search under closed issues too.

I'm using 10.0.0 java client. And from the discussion in #1370, it seems the fix was made in #1498. However, don't see this fix #1498 in the latest release 11.0.1.

It seems releases are cherry picks from master branch. Is there a way to list this fix and maybe others for the next release? Should I open an issue to track this?

Also, should fix the ControllerExample. Will send out a PR for this.

apiClient.getHttpClient().newBuilder().readTimeout(0, TimeUnit.SECONDS).build();

Meanwhile, while we figure out the release, I see couple of options to workaround below. Any preference which is better option?

  1. as suggested in earlier comment, set read timeout to < 5 minutes. This should timeout the watch connection and the ReflectorRunnable should setup a new watch connection -

  2. by modifying the http client as shown here - kubernetes watch API is behaving oddly #1370 (comment). This uses infinite read timeout, but uses keepalive to keep TCP connection active.

@brendandburns
Copy link
Contributor

We can see about cherry-picking that in. Rather than opening an issue, if you wanted to send a PR with the cherry-pick that's the fastest way to get it done.

Personally, I would set a short timeout, because I don't trust the network, but either one should work.

@karunasagark
Copy link
Author

Setting the read timeout to < 5 minutes but > 0 i.e. solution 1, didn't help. I thought this will lead to a io timed out exception but it didn't. I'll investigate this.

Solution 2 does work.

@karunasagark
Copy link
Author

Oh I see this commit overwrites the read timeout I set.

Once I hacked in to remove the overwrite of read timeout to zero, it does work fine. So next, trying to figure out why the above commit was made.

@brendandburns
Copy link
Contributor

Yeah, I don't think we should be overriding the read timeout. I think there was a belief that it was better to try to keep the connection open for forever, even when there is no data, but I think that having the option to timeout is better for user control...

@yue9944882 wdyt?

@codeyq
Copy link

codeyq commented Mar 29, 2021

Another data point. I also saw the exact same issue happening on GKE 1.18. It was fixed by option 2. Thanks @karunasagark Would it be possible to fix the read timeout overwrite in the next release or be back ported to the existing ones?

@karunasagark
Copy link
Author

@yue9944882 can take a look at Brendan's comment and provide your suggestion? I can make the changes accordingly.

@karunasagark
Copy link
Author

Also, #1588 seems to be making things more complex. The SharedInformerFactory.java overrides the read timeout to 0 (infinite) and depending on the sharedIndexInformerFor overload used, the watch call either uses 0 read timeout or what is specified from the reflectorRunnable.

Overload that would use the read timeout specified by reflectorRunnable:

public Watch<ApiType> watch(CallGeneratorParams params) throws ApiException {

Overload that would NOT use the read timeout specified by reflectorRunnable:

public Watchable<ApiType> watch(CallGeneratorParams params) throws ApiException {

Further it gets harder to reason about what is the value of read timeout without looking through the code path carefully.

I would propose to remove the read timeout override originally introduced in this commit. However, this would probably expose out the memory leak as discussed in #1259 which needs to be addressed differently (possibly by de-deuping). I'm guessing the memory leak would also be exposed by #1588, since the timeout is 5min and reflectorRunnable would re-list and re-watch. Probably will need to spend more time to understand the root cause of the memory leak.

@yue9944882 and @brendanburns what do you guys think? should we spend more time addressing the memory leak in a different way OR patch all watch code paths further will 5min timeout?

@karunasagark
Copy link
Author

IIUC, the de-duping approach would also solve #1634 as mentioned by @brendanburns in this comment.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 13, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 12, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants