CI: frequency of hitting timeout/network errors has significantly increased recently #482

leofang · 2025-03-03T04:16:11Z

This can happen during

pip install
- Ex: CI: Windows GPU runners do not stop on error #483
fetching artifacts from GitHub
- Ex: https://github.com/NVIDIA/cuda-python/actions/runs/13623473149/job/38077154585#step:10:219

leofang · 2025-03-25T01:35:26Z

xref: https://github.com/NVIDIA/cuda-python/actions/runs/14048031704?pr=517

rwgk · 2025-03-26T17:26:59Z

xref: https://github.com/NVIDIA/cuda-python/actions/runs/14087083558/job/39461464660?pr=503

It took 4 reruns until all tests passed.

The current situation is quite disruptive, especially if I need to weed out real failures. The general issues are akin to decoys.

leofang · 2025-04-22T17:58:56Z

We've observed no more network issues lately! According to @ajschmidt8:

Most likely moving the V100s from RDS Lab to NVKS resolved the network issues.
The NVKS cluster is in a different networking environment that seems much more stable than RDS Lab. Hopefully it stays that way!

leofang added CI/CD CI/CD infrastructure triage Needs the team's attention labels Mar 3, 2025

This was referenced Mar 3, 2025

CI: Windows GPU runners do not stop on error #483

Open

Add error log producing test #423

Merged

leofang added the P0 High priority - Must do! label Mar 26, 2025

leofang closed this as completed Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: frequency of hitting timeout/network errors has significantly increased recently #482

CI: frequency of hitting timeout/network errors has significantly increased recently #482

leofang commented Mar 3, 2025 •

edited

Loading

leofang commented Mar 25, 2025

rwgk commented Mar 26, 2025

leofang commented Apr 22, 2025

CI: frequency of hitting timeout/network errors has significantly increased recently #482

CI: frequency of hitting timeout/network errors has significantly increased recently #482

Comments

leofang commented Mar 3, 2025 • edited Loading

leofang commented Mar 25, 2025

rwgk commented Mar 26, 2025

leofang commented Apr 22, 2025

leofang commented Mar 3, 2025 •

edited

Loading