Skip to content

Conversation

olethanh
Copy link
Collaborator

@olethanh olethanh commented Mar 26, 2025

This PR fix the failure that have been happening recently on the CI on the Digital Ocean Droplet VM, and that prevented the PR from passing and being merged.

This is accomplished in two way:

  1. Simplify the way we use GitHub CI.
  2. fix the interactions between the CI and the Digital Ocean API.

Some minor improvement and documentation are also included.

Related ClickUp, GitHub or Jira tickets : Jira ALEPH-499

Changes

United workflow

This PR introduce a big architectural change as it merge the following workflows into one:

  • Build packages
  • Test new runtime and examples
  • Testing on DigitalOcean Droplets

This merge allow the reuse of packages build in the previous steps for the droplet test, reducing the number of package built from 13 to 3.

Advantages:

  • fix package build failure because of rate limitation on other services [1].
  • Reduce the chances of random errors.
  • Reduce the total CI times requirement.
  • Do not attempt to run the droplet test if the package building phase fail. Previously all the package build were launched in parallel which mean they all failed at the same point unnecessarily)
  • Make it less costly and faster to run the failed jobs/steps.

Mainly it reduce the chances of CI failure due to external causes and make the failing step more apparent.

Digital Ocean Changes

It seem Digital Ocean changed the way the Droplet were provisioned and the command doctl compute get which we use to retrieve the newly created droplet ip.

The command sometime returned without the droplet info or without the network information set, presumably before the setup was finished.
Solution: A workaround were added against this, we now wait and run the command till it return the proper info.

Second issue is that we sometime got another IP, the one on the internal network, that could not be reached from the GitHub runner.
As the IP was recalculated between some steps, it make some steps fails while other worked.
Solution: 1. the IP is calculated only once and saved in a Github CI env var
Solution: 2. The code ensure the Public one is used.

Self proofreading checklist

  • The new code clear, easy to read and well commented.
  • New code does not duplicate the functions of builtin or popular libraries.
  • An LLM was used to review the new code and look for simplifications.
  • New classes and functions contain docstrings explaining what they provide.
  • [n/a] All new code is covered by relevant tests.
  • Documentation has been updated regarding these changes.
  • [n/a] Dependencies update in the project.toml have been mirrored in the Debian package build script packaging/Makefile

How to test

it is not relevant to test manually locally.

Rebase your Pull Request on top of this branch or master once it is merged.

Print screen / video

Yay a pretty arrow
image

Notes

Do not squash this branch!

I reworked each commit to make them documented and self explanatory. The Github workflows file format is quiet unreadable and the commit history help.

[1] e.g.

curl -fsSL https://github.com/ipfs/kubo/releases/download/v0.33.2/kubo_v0.33.2_linux-amd64.tar.gz | tar -xz --directory ./target/kubo
curl: (22) The requested URL returned error: 403

That error happened randomly and was determined to be caused by rate limiting.

We also had previous failure with the Docker registry that prevented the use of the vm-connector docker image

Do not run export log if cancelled or not setup
Attempt to always do the proper clean up

Print more debug information in case the droplet ipv4 cannot be parsed
@olethanh olethanh force-pushed the ol-ci-build-pkg-once branch from dceeecb to 9b52c6a Compare March 26, 2025 14:14
Copy link

codecov bot commented Mar 26, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.33%. Comparing base (73013a2) to head (4396a91).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #787   +/-   ##
=======================================
  Coverage   63.33%   63.33%           
=======================================
  Files          77       77           
  Lines        6854     6854           
  Branches      576      576           
=======================================
  Hits         4341     4341           
  Misses       2325     2325           
  Partials      188      188           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@olethanh olethanh force-pushed the ol-ci-build-pkg-once branch 3 times, most recently from b1de8a4 to f59ba26 Compare March 27, 2025 12:35
@olethanh olethanh force-pushed the ol-ci-build-pkg-once branch from 9f5701f to 37d6a3a Compare March 27, 2025 15:21
@olethanh olethanh changed the title Ol ci build pkg once CI: Fix random failures Mar 27, 2025
@olethanh olethanh force-pushed the ol-ci-build-pkg-once branch 2 times, most recently from 991a36f to a48ad50 Compare March 27, 2025 16:24
olethanh added 12 commits March 27, 2025 17:25
The dep were not always installed depending on where the failure was
Since this a debug step we don't want it to fail

Do not run it if the workflow was cancelled
This allow to reuse the packages build in the previous workflow
for the droplet test, reducing the number of package build from 12 to 3, in theses workflow.

This:
* fix the issue of package not being able to be built because of rate
limitation on other resources.
* Reduce the chances of random errors.
* Reduce the total CI times requirement.
* Do not attempt to run the droplet test if the package building phase
  fail. (Previously all the package build were launched in parallel
which mean they all failed unecessaryely)
* Make it less costly and faster to run the failed jobs

With theses chance the number of CI failure reduce greatly.
And the cause of failure is more clear
It might be a change in the Digital Ocean API but it return
before the network is setup and with empty setup info
it didn't seem to occur befor so it might be an API change from their
part
Digital Ocean droplet always have a private IP in addition to the public
one.
The API return them in random order so the CI job occasionally tried to
use the internal one and failed.
Same operation as moving the Droplet workflow, we reuse the already
build package.

The resilience and speed advantage are the sames and add up.
Rename the main workflow field
Document more
In the run_on_droplet job we only require the .github/scripts dir
@olethanh olethanh force-pushed the ol-ci-build-pkg-once branch from a48ad50 to 4396a91 Compare March 27, 2025 16:25
@olethanh olethanh marked this pull request as ready for review March 27, 2025 16:44
@olethanh olethanh requested review from hoh and nesitor March 27, 2025 16:44
@olethanh olethanh changed the title CI: Fix random failures CI: Fix random failures. Merge workflows Mar 27, 2025
@nesitor nesitor merged commit 8a85ad6 into main Mar 31, 2025
22 checks passed
@olethanh olethanh mentioned this pull request Mar 31, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants