Skip to content

Conversation

@frhuelsz
Copy link
Contributor

@frhuelsz frhuelsz commented Jan 2, 2026

🔍 Description

Depends on:

Changes:

  • Adds AB update tests for configs that support it. Adds splits ab updates on test rings >= prerelease.
    image
  • Refactors to unify all test code on a single trident runtime kind enum.
  • Refactor netlaunch/netlisten to use a scoped http routing namespace instead of the global one. (In general the global one makes sense, since apps generally will just expose one server, but we start and stop http servers exposing the same endpoints, so we need to do so in separate scopes.)
  • Add more domain metadata to virtdeploy-go's output so that we can directly know more about the VM.
  • Refactor SSH tools to decouple configuration from CLI, since storm code does not consume ssh params via CLI.
  • Added a Host Config abstraction struct.

@frhuelsz frhuelsz marked this pull request as ready for review January 2, 2026 21:47
@frhuelsz frhuelsz requested a review from a team as a code owner January 2, 2026 21:47
Copilot AI review requested due to automatic review settings January 2, 2026 21:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive A/B update testing functionality to the storm testing framework and performs significant refactoring to improve code organization and maintainability.

Key Changes

  • Introduces A/B update tests for configurations that support it, including split A/B updates for prerelease and higher test rings
  • Consolidates all trident runtime environment types into a single RuntimeType enum, eliminating the separate stormenv package
  • Refactors HTTP server routing in netlaunch/netlisten to use scoped ServeMux instances instead of the global DefaultServeMux, preventing handler conflicts during multiple server starts/stops
  • Adds structured HostConfig abstraction for managing host configurations with proper cloning and serialization

Reviewed changes

Copilot reviewed 40 out of 41 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
tools/storm/utils/trident/runtime.go New file defining RuntimeType enum consolidating host/container/none runtime variants
tools/storm/utils/trident/trident.go Updated to use new RuntimeType instead of separate environment enum
tools/storm/utils/sshutils/*.go New SSH utilities package with client creation, command execution, and output handling
tools/storm/utils/ssh/config/config.go Added conversion method to new SshClientConfig type
tools/storm/utils/retry/retry.go Added context-aware retry function RetryContext
tools/storm/e2e/scenario/ab_update.go New A/B update test implementation with split stage/finalize support
tools/storm/e2e/scenario/monitoring.go New VM serial monitoring with libvirt console integration
tools/storm/e2e/scenario/trident.go Updated scenario structure with SSH client management and version tracking
tools/storm/e2e/testrings/testrings.go Added ring comparison and ordering logic
tools/pkg/hostconfig/*.go New Host Config abstraction with YAML serialization and cloning
tools/pkg/virtdeploy/*.go Enhanced VM status reporting with full libvirt domain metadata
tools/pkg/netlaunch/netlaunch.go Refactored to use scoped ServeMux instead of global handlers
tools/pkg/netlisten/netlisten.go Refactored to use scoped ServeMux instead of global handlers
tools/pkg/phonehome/*.go Updated handler registration to accept custom ServeMux
tools/storm/helpers/*.go Updated to use new RuntimeType enum
tools/storm/e2e/discover.go Updated scenario discovery to use new HostConfig type
tools/go.mod Updated storm dependency to v0.4.0-alpha1

Comment on lines +343 to +350
func checkUrlIsAccessible(url string) error {
resp, err := http.Head(url)
if err != nil {
return fmt.Errorf("failed to check new image URL: %w", err)
}
if resp.StatusCode != http.StatusOK {
return fmt.Errorf("new image URL is not accessible: %s, got HTTP code: %d", url, resp.StatusCode)
}
Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error check on line 344 for http.Head returns an error if the request fails, but you're also checking resp.StatusCode at line 348. However, if err is not nil at line 345, resp may be nil, which would cause a panic when accessing resp.StatusCode. You should return early if err is not nil, before checking the status code.

Copilot uses AI. Check for mistakes.
args struct {
stormsshconfig.SshCliSettings `embed:""`
stormenv.EnvCliSettings `embed:""`
trident.RuntimeCliSettings `embed:""`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stormtrident.RuntimeCliSettings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to standardize on whether to use some kind of dependency naming convention. i'm ok with not using storm* though i think it adds clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's generally preferred to use the name of the package verbatim and only do local renames if there are name conflicts. That way the language server can also auto-detect and auto-import things.

I'm not sure storm* is the way to go since many of these already have that in the package name anyway. Go convention is to prefer shorter and more concise names: https://go.dev/blog/package-names

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not particularly opinionated here and i think in this case it is pretty extraneous (trident is pretty specific to us).

for confg or ssh, it seems pretty handy to distinguish between our package and some standard package. and if we are going to do it occassionally, i think conssitency is nice.

Copilot AI review requested due to automatic review settings January 6, 2026 21:08
@frhuelsz
Copy link
Contributor Author

frhuelsz commented Jan 6, 2026

/AzurePipelines run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@frhuelsz
Copy link
Contributor Author

frhuelsz commented Jan 6, 2026

/AzurePipelines run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 40 changed files in this pull request and generated 9 comments.

// There is already an open client, check if it's still valid.
session, err := s.sshClient.NewSession()
if err == nil {
session.Close()
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SSH client health check on line 191 creates and immediately closes a session. If the session creation succeeds but the close fails, this could leave a resource leak. Consider wrapping the close in a deferred function or handling errors from Close().

Suggested change
session.Close()
if errClose := session.Close(); errClose != nil {
logrus.WithError(errClose).Warn("failed to close SSH session during health check")
}

Copilot uses AI. Check for mistakes.
Comment on lines +171 to +184
// Check for context cancellation
if ctx.Err() != nil {
// Print the last `ringSize` lines of the serial log before timing out
logrus.Errorf("VM serial monitor was cancelled. Last %d lines of VM serial log before timeout:\n%s", ringSize, func() string {
var sb strings.Builder
ring.Do(func(p interface{}) {
if p != nil {
sb.WriteString(p.(string) + "\n")
}
})
return sb.String()
}())

return ctx.Err()
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readerLoop function has a potential resource leak. If the context is cancelled (line 172-184), the function returns early without consuming any remaining data from the in reader. This could cause the writing goroutine to block indefinitely if it's still trying to write to a full pipe buffer. Consider draining or closing the reader before returning on cancellation.

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +34
nn, err := n.W.Write(p)
if nn > 0 {
n.once.Do(func() { n.active.Store(true) })
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The once.Do ensures active is only set once, but line 34 could theoretically be called multiple times with different write sizes. If the first write has nn == 0, subsequent successful writes won't trigger the once.Do. This means a write that succeeds but writes 0 bytes will prevent the active flag from ever being set. Consider checking nn > 0 before the once.Do call or documenting this behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +57
if err = ctx.Err(); err != nil {
break
}

attempt++
var result *T
result, err = f(ctx, attempt)
if err != nil {
if ctx.Err() != nil {
break
}

time.Sleep(backoff)
continue
}

return result, nil
}

return nil, fmt.Errorf("failed after %d attempts: %w", attempt, err)
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is documented to handle context cancellation gracefully, but if ctx.Err() returns an error on line 38 before any attempts, the error message "failed after 0 attempts" could be confusing. Consider distinguishing between timeout/cancellation before any attempts versus after failed attempts.

Copilot uses AI. Check for mistakes.
Comment on lines +343 to +352
func checkUrlIsAccessible(url string) error {
resp, err := http.Head(url)
if err != nil {
return fmt.Errorf("failed to check new image URL: %w", err)
}
if resp.StatusCode != http.StatusOK {
return fmt.Errorf("new image URL is not accessible: %s, got HTTP code: %d", url, resp.StatusCode)
}

return nil
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checkUrlIsAccessible function only works with HTTP/HTTPS URLs. OCI URLs (which start with "oci://") are checked against this function but won't work with http.Head. This will cause the error to be logged but execution continues. Consider either handling OCI URLs separately or documenting that this check only applies to HTTP URLs.

Copilot uses AI. Check for mistakes.
Comment on lines +193 to +201
session.Close()
logrus.Debug("SSH client is still valid")
// Still valid
return nil
}

// Not valid anymore, close it.
logrus.Debug("SSH client is no longer valid, reopening")
s.sshClient.Close()
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's inconsistent error handling between lines 189 and 201. When the SSH client is invalid (line 192), both the session creation error and the close call are ignored, but when it's still invalid after re-opening (line 201), we explicitly close and nil the client. Consider making the error handling consistent - either both paths should close and nil the client, or neither should.

Suggested change
session.Close()
logrus.Debug("SSH client is still valid")
// Still valid
return nil
}
// Not valid anymore, close it.
logrus.Debug("SSH client is no longer valid, reopening")
s.sshClient.Close()
// Session creation succeeded, so the client is still valid.
// We ignore any error from closing the short-lived session here,
// as it does not affect the validity of the client itself.
_ = session.Close()
logrus.Debug("SSH client is still valid")
// Still valid
return nil
}
// Not valid anymore, close it.
logrus.WithError(err).Warn("SSH client session creation failed, closing client")
logrus.Debug("SSH client is no longer valid, reopening")
if closeErr := s.sshClient.Close(); closeErr != nil {
logrus.WithError(closeErr).Warn("failed to close invalid SSH client")
}

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +4
func (s *HostConfig) HasABUpdate() bool {
return s.Container.Exists("storage", "abUpdate")
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new HasABUpdate method only checks for the existence of "storage" and "abUpdate" keys but doesn't validate that abUpdate is not nil or has valid content. Consider adding validation to ensure the abUpdate configuration is actually usable.

Suggested change
func (s *HostConfig) HasABUpdate() bool {
return s.Container.Exists("storage", "abUpdate")
import (
"reflect"
"strings"
)
func (s *HostConfig) HasABUpdate() bool {
// First, confirm the key path exists as before.
if !s.Container.Exists("storage", "abUpdate") {
return false
}
// Use reflection to look for a Get("storage", "abUpdate") method on the container.
// This avoids relying on project-specific interfaces that may not exist.
containerValue := reflect.ValueOf(s.Container)
getMethod := containerValue.MethodByName("Get")
if !getMethod.IsValid() {
// No Get method available; fall back to the original Exists-based behavior.
return true
}
// Call Get("storage", "abUpdate") reflectively.
results := getMethod.Call([]reflect.Value{
reflect.ValueOf("storage"),
reflect.ValueOf("abUpdate"),
})
// If nothing was returned, treat it as unusable.
if len(results) == 0 {
return false
}
value := results[0].Interface()
if value == nil {
return false
}
// Apply simple sanity checks for common scalar types.
switch v := value.(type) {
case string:
return strings.TrimSpace(v) != ""
case []byte:
return len(v) > 0
default:
// For other non-nil types, assume the configuration is usable.
return true
}

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +161
errCh := make(chan error, 1)
wg.Add(1)
go func() {
defer wg.Done()
defer pw.Close()
wN := ioutils.NewNotifyWriter(pw)
for {
// If context is cancelled, exit the goroutine.
if consoleCtx.Err() != nil {
return
}

// Try to open the console. This is a blocking call that only
// returns when the console is closed or an error occurs. It writes
// to the provided writer in the background.
err := lv.DomainOpenConsole(dom, nil, wN, uint32(libvirt.DomainConsoleForce))
if err == nil && wN.Active() {
// DomainOpenConsole returned without error and data was
// written, this is an expected outcome when the console closed
// naturally.
return
}

if consoleCtx.Err() != nil {
// Context was cancelled while the console was open/opening,
// exit the goroutine.
return
}

if !wN.Active() {
// No data has been written yet, so this is likely a
// transient error such as the domain not being fully
// started yet. Retry silently.
time.Sleep(100 * time.Millisecond)
continue
}

// If we get here, there was an error after some data was
// successfully written. Log the error and exit the goroutine. This
// may happen naturally when the pipe is closed. But if that happens
// naturally, the readerLoop will have exited first, so this error
// won't matter.
logrus.Errorf("DomainOpenConsole error after data written: %v", err)
errCh <- err
return
}
}()

// Call inner loop
loopErr := readerLoop(ctx, pr, errCh, out, 30)
// Regardless of whether readerLoop returned an error, cancel the console
// context and close the pipe to stop the DomainOpenConsole goroutine.
consoleCancel()
pr.Close()
pw.Close()

// Even after we close all of this, the DomainOpenConsole goroutine may
// still be running because it doesn't take in a context. We force it to
// close by opening a new console with the DomainConsoleForce flag, and a
// nil stream, which will signal the existing DomainOpenConsole to exit, and
// make this new one exit immediately.
err := lv.DomainOpenConsole(dom, nil, nil, uint32(libvirt.DomainConsoleForce))
if err != nil {
logrus.Warnf("failed to force close DomainOpenConsole: %v", err)
}

// Wait for DomainOpenConsole goroutine to exit
wg.Wait()

return loopErr
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mutex pattern here uses errCh as a channel with size 1, but there's a race condition. The DomainOpenConsole goroutine (line 107) could write to errCh on line 135 while readerLoop is still processing (line 189-194). If readerLoop has already checked the channel and moved on, and then the goroutine writes, that error will remain in the channel buffer and could potentially be read on a subsequent select after the intended scope. Consider using a separate mechanism for error signaling or document this behavior more clearly.

Copilot uses AI. Check for mistakes.
// Special handling for OCI URLs

// Match form <repository_base>:v<build ID>.<config>.<deployment env>.<version number>
matches := regexp.MustCompile(`^(.+):v(\d+)\.(.+)\.(.+)\.(\d+)$`).FindStringSubmatch(base)
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the regular expression on line 118, the pattern expects the format <repository_base>:v<build ID>.<config>.<deployment env>.<version number>, but if the build ID contains non-digit characters, the pattern will fail to match. Consider whether the build ID pattern \d+ should be more flexible (e.g., [^.]+ to match any non-dot characters).

Suggested change
matches := regexp.MustCompile(`^(.+):v(\d+)\.(.+)\.(.+)\.(\d+)$`).FindStringSubmatch(base)
matches := regexp.MustCompile(`^(.+):v([^.]+)\.(.+)\.(.+)\.(\d+)$`).FindStringSubmatch(base)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants