Skip to content

CI: frequency of hitting timeout/network errors has significantly increased recently #482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
leofang opened this issue Mar 3, 2025 · 3 comments
Labels
CI/CD CI/CD infrastructure P0 High priority - Must do! triage Needs the team's attention

Comments

@leofang
Copy link
Member

leofang commented Mar 3, 2025

This can happen during

@leofang leofang added CI/CD CI/CD infrastructure triage Needs the team's attention labels Mar 3, 2025
@leofang
Copy link
Member Author

leofang commented Mar 25, 2025

@rwgk
Copy link
Collaborator

rwgk commented Mar 26, 2025

xref: https://github.com/NVIDIA/cuda-python/actions/runs/14087083558/job/39461464660?pr=503

It took 4 reruns until all tests passed.

The current situation is quite disruptive, especially if I need to weed out real failures. The general issues are akin to decoys.

@leofang leofang added the P0 High priority - Must do! label Mar 26, 2025
@leofang
Copy link
Member Author

leofang commented Apr 22, 2025

We've observed no more network issues lately! According to @ajschmidt8:

Most likely moving the V100s from RDS Lab to NVKS resolved the network issues.
The NVKS cluster is in a different networking environment that seems much more stable than RDS Lab. Hopefully it stays that way!

@leofang leofang closed this as completed Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD CI/CD infrastructure P0 High priority - Must do! triage Needs the team's attention
Projects
None yet
Development

No branches or pull requests

2 participants