Skip to content

Kubernetes Namespace pod watch can hang sometimes #1134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gunpuz opened this issue Dec 14, 2022 · 13 comments
Closed

Kubernetes Namespace pod watch can hang sometimes #1134

gunpuz opened this issue Dec 14, 2022 · 13 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@gunpuz
Copy link

gunpuz commented Dec 14, 2022

Describe the bug

It seems that watch:

var podlistResp = kubeClient.CoreV1.ListNamespacedPodWithHttpMessagesAsync(configurationSettings.DeploymentNamespace, watch: true);

await foreach (var (type, item) in podlistResp.WatchAsync<V1Pod, V1PodList>())
{
   ...
}

can hang from time to time. There seems to be no clear scenario how and when to reproduce it though :(

Server Kubernetes Version
v1.24.3

Dotnet Runtime Version
net6

To Reproduce
It can hang "sometimes". Its not clear why.

KubeConfig

Default configuration

KubernetesClientConfiguration k8sConfig;
            if (!_configurationSettings.UseKubeConfig)
            {
                // Running inside a k8s cluser
                k8sConfig = KubernetesClientConfiguration.InClusterConfig();
            }

Additional info

It could be related to this issue: #884

@tg123
Copy link
Member

tg123 commented Dec 14, 2022

what do you mean by hang

@gunpuz
Copy link
Author

gunpuz commented Dec 14, 2022

I think that new pod events are not dispached from library in await foreach but I don't know how to reproduce it. I would say that this happens rarely

@tg123
Copy link
Member

tg123 commented Dec 14, 2022

there is an onError param in WatchAsync<V1Pod, V1PodList>())
maybe something happened you did not notice

@gunpuz
Copy link
Author

gunpuz commented Dec 14, 2022

Should you use error param? Seems that if error parameter is not used then exception is thrown. I dont use error param

Should this work? I have something like this:

while(true){
   try{
       var podlistResp = kubeClient.CoreV1.ListNamespacedPodWithHttpMessagesAsync(configurationSettings.DeploymentNamespace, watch: true);

       await foreach (var (type, item) in podlistResp.WatchAsync<V1Pod, V1PodList>())
       {
           ...
       }
   }
   catch(){
   }
}

@brendandburns
Copy link
Contributor

The above codewill not work properly. The reason for this is that you need to do an explicit List of the resources when the watch breaks.

There is a race where if a resource is created after the watch ends and before the next watch begins you will miss that resource.

The proper approach is:

while (true) {
   var list = // list all resources here
   foreach (var item in list ) {
            ....
   }
   // watch resources here
}

@gunpuz
Copy link
Author

gunpuz commented Dec 17, 2022

@brendandburns Seems that method ListNamespacedPodWithHttpMessagesAsync already lists existing resources and then gives back any new resources that are created.
Why do I need to list resources seperatly?

And in this example of yours, would not you miss any new resources between list and watch calls if you think that ListNamespacedPodWithHttpMessagesAsync does not list existing resources? Maybe i am missing something and you could provide more detailed example with methods you use? Seems like this option is bad?

Function name already says that - ListNamespacedPod + WithHttpMessages

var podlistResp = kubeClient.CoreV1.ListNamespacedPodWithHttpMessagesAsync(configurationSettings.DeploymentNamespace, watch: true);

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 17, 2023
@Kruti-Joshi
Copy link

@gunpuz , did you figure out how to stop this from happening? I have a watch code which randomly stops receiving any events and is stuck till I restart my service.

@tg123
Copy link
Member

tg123 commented Mar 22, 2023

@Kruti-Joshi you can check if this is the cause
#1099

@gunpuz
Copy link
Author

gunpuz commented Mar 25, 2023

@Kruti-Joshi, i think this issue went away when you "restart" the listener and listen again in a loop...

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 24, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale May 24, 2023
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants