test: Add storm ab update tests #437

frhuelsz · 2026-01-02T20:52:09Z

🔍 Description

Depends on:

infra: Add VM serial monitoring to storm-e2e tests #429

Changes:

Adds AB update tests for configs that support it. Adds splits ab updates on test rings >= prerelease.
Refactors to unify all test code on a single trident runtime kind enum.
Refactor netlaunch/netlisten to use a scoped http routing namespace instead of the global one. (In general the global one makes sense, since apps generally will just expose one server, but we start and stop http servers exposing the same endpoints, so we need to do so in separate scopes.)
Add more domain metadata to virtdeploy-go's output so that we can directly know more about the VM.
Refactor SSH tools to decouple configuration from CLI, since storm code does not consume ssh params via CLI.
Added a Host Config abstraction struct.

…orm/e2e-3

tools/storm/utils/sshutils/client.go

Copilot

Pull request overview

This PR adds comprehensive A/B update testing functionality to the storm testing framework and performs significant refactoring to improve code organization and maintainability.

Key Changes

Introduces A/B update tests for configurations that support it, including split A/B updates for prerelease and higher test rings
Consolidates all trident runtime environment types into a single RuntimeType enum, eliminating the separate stormenv package
Refactors HTTP server routing in netlaunch/netlisten to use scoped ServeMux instances instead of the global DefaultServeMux, preventing handler conflicts during multiple server starts/stops
Adds structured HostConfig abstraction for managing host configurations with proper cloning and serialization

Reviewed changes

Copilot reviewed 40 out of 41 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
tools/storm/utils/trident/runtime.go	New file defining `RuntimeType` enum consolidating host/container/none runtime variants
tools/storm/utils/trident/trident.go	Updated to use new `RuntimeType` instead of separate environment enum
tools/storm/utils/sshutils/*.go	New SSH utilities package with client creation, command execution, and output handling
tools/storm/utils/ssh/config/config.go	Added conversion method to new `SshClientConfig` type
tools/storm/utils/retry/retry.go	Added context-aware retry function `RetryContext`
tools/storm/e2e/scenario/ab_update.go	New A/B update test implementation with split stage/finalize support
tools/storm/e2e/scenario/monitoring.go	New VM serial monitoring with libvirt console integration
tools/storm/e2e/scenario/trident.go	Updated scenario structure with SSH client management and version tracking
tools/storm/e2e/testrings/testrings.go	Added ring comparison and ordering logic
tools/pkg/hostconfig/*.go	New Host Config abstraction with YAML serialization and cloning
tools/pkg/virtdeploy/*.go	Enhanced VM status reporting with full libvirt domain metadata
tools/pkg/netlaunch/netlaunch.go	Refactored to use scoped ServeMux instead of global handlers
tools/pkg/netlisten/netlisten.go	Refactored to use scoped ServeMux instead of global handlers
tools/pkg/phonehome/*.go	Updated handler registration to accept custom ServeMux
tools/storm/helpers/*.go	Updated to use new `RuntimeType` enum
tools/storm/e2e/discover.go	Updated scenario discovery to use new `HostConfig` type
tools/go.mod	Updated storm dependency to v0.4.0-alpha1

tools/storm/utils/trident/runtime.go

tools/storm/utils/sshutils/client.go

tools/storm/utils/ssh/client/client.go

tools/storm/e2e/scenario/install.go

tools/storm/e2e/scenario/ab_update.go

Copilot · 2026-01-02T21:51:28Z

tools/storm/e2e/scenario/ab_update.go

+func checkUrlIsAccessible(url string) error {
+	resp, err := http.Head(url)
+	if err != nil {
+		return fmt.Errorf("failed to check new image URL: %w", err)
+	}
+	if resp.StatusCode != http.StatusOK {
+		return fmt.Errorf("new image URL is not accessible: %s, got HTTP code: %d", url, resp.StatusCode)
+	}


The error check on line 344 for http.Head returns an error if the request fails, but you're also checking resp.StatusCode at line 348. However, if err is not nil at line 345, resp may be nil, which would cause a panic when accessing resp.StatusCode. You should return early if err is not nil, before checking the status code.

tools/storm/e2e/scenario/monitoring.go

tools/storm/e2e/testrings/testrings.go

tools/pkg/hostconfig/hostconfig.go

tools/pkg/netlaunch/netlaunch.go

tools/pkg/virtdeploy/resources.go

bfjelds · 2026-01-05T18:50:23Z

tools/storm/helpers/rebuild_raid.go

 	args struct {
 		stormsshconfig.SshCliSettings `embed:""`
-		stormenv.EnvCliSettings       `embed:""`
+		trident.RuntimeCliSettings    `embed:""`


stormtrident.RuntimeCliSettings

would be nice to standardize on whether to use some kind of dependency naming convention. i'm ok with not using storm* though i think it adds clarity.

it's generally preferred to use the name of the package verbatim and only do local renames if there are name conflicts. That way the language server can also auto-detect and auto-import things.

I'm not sure storm* is the way to go since many of these already have that in the package name anyway. Go convention is to prefer shorter and more concise names: https://go.dev/blog/package-names

i'm not particularly opinionated here and i think in this case it is pretty extraneous (trident is pretty specific to us).

for confg or ssh, it seems pretty handy to distinguish between our package and some standard package. and if we are going to do it occassionally, i think conssitency is nice.

frhuelsz · 2026-01-06T21:09:02Z

/AzurePipelines run [GITHUB]-trident-pr-e2e

azure-pipelines · 2026-01-06T21:09:25Z

Azure Pipelines successfully started running 1 pipeline(s).

frhuelsz · 2026-01-06T21:11:32Z

/AzurePipelines run [GITHUB]-trident-pr-e2e

azure-pipelines · 2026-01-06T21:11:43Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 39 out of 40 changed files in this pull request and generated 9 comments.

Copilot · 2026-01-06T21:14:31Z

tools/storm/e2e/scenario/trident.go

+		// There is already an open client, check if it's still valid.
+		session, err := s.sshClient.NewSession()
+		if err == nil {
+			session.Close()


The SSH client health check on line 191 creates and immediately closes a session. If the session creation succeeds but the close fails, this could leave a resource leak. Consider wrapping the close in a deferred function or handling errors from Close().

Suggested change

session.Close()

if errClose := session.Close(); errClose != nil {

logrus.WithError(errClose).Warn("failed to close SSH session during health check")

}

Copilot · 2026-01-06T21:14:32Z

tools/storm/e2e/scenario/monitoring.go

+		// Check for context cancellation
+		if ctx.Err() != nil {
+			// Print the last `ringSize` lines of the serial log before timing out
+			logrus.Errorf("VM serial monitor was cancelled. Last %d lines of VM serial log before timeout:\n%s", ringSize, func() string {
+				var sb strings.Builder
+				ring.Do(func(p interface{}) {
+					if p != nil {
+						sb.WriteString(p.(string) + "\n")
+					}
+				})
+				return sb.String()
+			}())
+
+			return ctx.Err()


The readerLoop function has a potential resource leak. If the context is cancelled (line 172-184), the function returns early without consuming any remaining data from the in reader. This could cause the writing goroutine to block indefinitely if it's still trying to write to a full pipe buffer. Consider draining or closing the reader before returning on cancellation.

Copilot · 2026-01-06T21:14:32Z

tools/storm/utils/io/notify_writer.go

+	nn, err := n.W.Write(p)
+	if nn > 0 {
+		n.once.Do(func() { n.active.Store(true) })


The once.Do ensures active is only set once, but line 34 could theoretically be called multiple times with different write sizes. If the first write has nn == 0, subsequent successful writes won't trigger the once.Do. This means a write that succeeds but writes 0 bytes will prevent the active flag from ever being set. Consider checking nn > 0 before the once.Do call or documenting this behavior.

Copilot · 2026-01-06T21:14:33Z

tools/storm/utils/retry/retry.go

+		if err = ctx.Err(); err != nil {
+			break
+		}
+
+		attempt++
+		var result *T
+		result, err = f(ctx, attempt)
+		if err != nil {
+			if ctx.Err() != nil {
+				break
+			}
+
+			time.Sleep(backoff)
+			continue
+		}
+
+		return result, nil
+	}
+
+	return nil, fmt.Errorf("failed after %d attempts: %w", attempt, err)


The function is documented to handle context cancellation gracefully, but if ctx.Err() returns an error on line 38 before any attempts, the error message "failed after 0 attempts" could be confusing. Consider distinguishing between timeout/cancellation before any attempts versus after failed attempts.

Copilot · 2026-01-06T21:14:33Z

tools/storm/e2e/scenario/ab_update.go

+func checkUrlIsAccessible(url string) error {
+	resp, err := http.Head(url)
+	if err != nil {
+		return fmt.Errorf("failed to check new image URL: %w", err)
+	}
+	if resp.StatusCode != http.StatusOK {
+		return fmt.Errorf("new image URL is not accessible: %s, got HTTP code: %d", url, resp.StatusCode)
+	}
+
+	return nil


The checkUrlIsAccessible function only works with HTTP/HTTPS URLs. OCI URLs (which start with "oci://") are checked against this function but won't work with http.Head. This will cause the error to be logged but execution continues. Consider either handling OCI URLs separately or documenting that this check only applies to HTTP URLs.

Copilot · 2026-01-06T21:14:33Z

tools/storm/e2e/scenario/trident.go

+			session.Close()
+			logrus.Debug("SSH client is still valid")
+			// Still valid
+			return nil
+		}
+
+		// Not valid anymore, close it.
+		logrus.Debug("SSH client is no longer valid, reopening")
+		s.sshClient.Close()


There's inconsistent error handling between lines 189 and 201. When the SSH client is invalid (line 192), both the session creation error and the close call are ignored, but when it's still invalid after re-opening (line 201), we explicitly close and nil the client. Consider making the error handling consistent - either both paths should close and nil the client, or neither should.

Suggested change

session.Close()

logrus.Debug("SSH client is still valid")

// Still valid

return nil

}

// Not valid anymore, close it.

logrus.Debug("SSH client is no longer valid, reopening")

s.sshClient.Close()

// Session creation succeeded, so the client is still valid.

// We ignore any error from closing the short-lived session here,

// as it does not affect the validity of the client itself.

_ = session.Close()

logrus.Debug("SSH client is still valid")

// Still valid

return nil

}

// Not valid anymore, close it.

logrus.WithError(err).Warn("SSH client session creation failed, closing client")

logrus.Debug("SSH client is no longer valid, reopening")

if closeErr := s.sshClient.Close(); closeErr != nil {

logrus.WithError(closeErr).Warn("failed to close invalid SSH client")

}

Copilot · 2026-01-06T21:14:33Z

tools/pkg/hostconfig/getters.go

+func (s *HostConfig) HasABUpdate() bool {
+	return s.Container.Exists("storage", "abUpdate")


The new HasABUpdate method only checks for the existence of "storage" and "abUpdate" keys but doesn't validate that abUpdate is not nil or has valid content. Consider adding validation to ensure the abUpdate configuration is actually usable.

Suggested change

func (s *HostConfig) HasABUpdate() bool {

return s.Container.Exists("storage", "abUpdate")

import (

"reflect"

"strings"

)

func (s *HostConfig) HasABUpdate() bool {

// First, confirm the key path exists as before.

if !s.Container.Exists("storage", "abUpdate") {

return false

}

// Use reflection to look for a Get("storage", "abUpdate") method on the container.

// This avoids relying on project-specific interfaces that may not exist.

containerValue := reflect.ValueOf(s.Container)

getMethod := containerValue.MethodByName("Get")

if !getMethod.IsValid() {

// No Get method available; fall back to the original Exists-based behavior.

return true

}

// Call Get("storage", "abUpdate") reflectively.

results := getMethod.Call([]reflect.Value{

reflect.ValueOf("storage"),

reflect.ValueOf("abUpdate"),

})

// If nothing was returned, treat it as unusable.

if len(results) == 0 {

return false

}

value := results[0].Interface()

if value == nil {

return false

}

// Apply simple sanity checks for common scalar types.

switch v := value.(type) {

case string:

return strings.TrimSpace(v) != ""

case []byte:

return len(v) > 0

default:

// For other non-nil types, assume the configuration is usable.

return true

}

Copilot · 2026-01-06T21:14:34Z

tools/storm/e2e/scenario/monitoring.go

+	errCh := make(chan error, 1)
+	wg.Add(1)
+	go func() {
+		defer wg.Done()
+		defer pw.Close()
+		wN := ioutils.NewNotifyWriter(pw)
+		for {
+			// If context is cancelled, exit the goroutine.
+			if consoleCtx.Err() != nil {
+				return
+			}
+
+			// Try to open the console. This is a blocking call that only
+			// returns when the console is closed or an error occurs. It writes
+			// to the provided writer in the background.
+			err := lv.DomainOpenConsole(dom, nil, wN, uint32(libvirt.DomainConsoleForce))
+			if err == nil && wN.Active() {
+				// DomainOpenConsole returned without error and data was
+				// written, this is an expected outcome when the console closed
+				// naturally.
+				return
+			}
+
+			if consoleCtx.Err() != nil {
+				// Context was cancelled while the console was open/opening,
+				// exit the goroutine.
+				return
+			}
+
+			if !wN.Active() {
+				// No data has been written yet, so this is likely a
+				// transient error such as the domain not being fully
+				// started yet. Retry silently.
+				time.Sleep(100 * time.Millisecond)
+				continue
+			}
+
+			// If we get here, there was an error after some data was
+			// successfully written. Log the error and exit the goroutine. This
+			// may happen naturally when the pipe is closed. But if that happens
+			// naturally, the readerLoop will have exited first, so this error
+			// won't matter.
+			logrus.Errorf("DomainOpenConsole error after data written: %v", err)
+			errCh <- err
+			return
+		}
+	}()
+
+	// Call inner loop
+	loopErr := readerLoop(ctx, pr, errCh, out, 30)
+	// Regardless of whether readerLoop returned an error, cancel the console
+	// context and close the pipe to stop the DomainOpenConsole goroutine.
+	consoleCancel()
+	pr.Close()
+	pw.Close()
+
+	// Even after we close all of this, the DomainOpenConsole goroutine may
+	// still be running because it doesn't take in a context. We force it to
+	// close by opening a new console with the DomainConsoleForce flag, and a
+	// nil stream, which will signal the existing DomainOpenConsole to exit, and
+	// make this new one exit immediately.
+	err := lv.DomainOpenConsole(dom, nil, nil, uint32(libvirt.DomainConsoleForce))
+	if err != nil {
+		logrus.Warnf("failed to force close DomainOpenConsole: %v", err)
+	}
+
+	// Wait for DomainOpenConsole goroutine to exit
+	wg.Wait()
+
+	return loopErr


The mutex pattern here uses errCh as a channel with size 1, but there's a race condition. The DomainOpenConsole goroutine (line 107) could write to errCh on line 135 while readerLoop is still processing (line 189-194). If readerLoop has already checked the channel and moved on, and then the goroutine writes, that error will remain in the channel buffer and could potentially be read on a subsequent select after the intended scope. Consider using a separate mechanism for error signaling or document this behavior more clearly.

Copilot · 2026-01-06T21:14:34Z

tools/storm/e2e/scenario/ab_update.go

+		// Special handling for OCI URLs
+
+		// Match form <repository_base>:v<build ID>.<config>.<deployment env>.<version number>
+		matches := regexp.MustCompile(`^(.+):v(\d+)\.(.+)\.(.+)\.(\d+)$`).FindStringSubmatch(base)


In the regular expression on line 118, the pattern expects the format <repository_base>:v<build ID>.<config>.<deployment env>.<version number>, but if the build ID contains non-digit characters, the pattern will fail to match. Consider whether the build ID pattern \d+ should be more flexible (e.g., [^.]+ to match any non-dot characters).

Suggested change

matches := regexp.MustCompile(`^(.+):v(\d+)\.(.+)\.(.+)\.(\d+)$`).FindStringSubmatch(base)

matches := regexp.MustCompile(`^(.+):v([^.]+)\.(.+)\.(.+)\.(\d+)$`).FindStringSubmatch(base)

frhuelsz added 30 commits December 15, 2025 12:03

wait for login progress

f9281ee

Merge branch 'main' into user/frhuelsz/storm/e2e-2

8c1579b

retrieve serial

75da58c

remove all ansi from log

b7e00dc

Improve artifact management

59442d1

Merge branch 'main' into user/frhuelsz/storm/e2e-2

8fa9296

serial updates

8b23b74

serial console

6402941

Fully functional console monitoring

107f0a0

undo update

8b38db0

Update pipeline invocation with output

5475842

More reliability

8e28210

Use storm v0.4.0-alpha1

e5f1ac9

cleanup

1e4be22

pr comments

01a900f

pr comments

775ac0a

Refactor to use runtime type everywhere

b61883c

Check ssh

ed996fa

refactor netlisten

bad81cd

Fix import order

f8f27c4

Update config construction

99556b5

Cleanup change from other branch

bd19b57

refactor netlisten

c98ed7e

Merge branch 'user/frhuelsz/netlisten-refactor' into user/frhuelsz/st…

077d0a0

…orm/e2e-3

Address copilot feedback

62608b2

more copilot comments

47022ed

pr comments

a81f67f

Merge branch 'user/frhuelsz/storm/e2e-2' into user/frhuelsz/storm/e2e-3

54db467

Updates

abfc0df

PR comments

81cf17e

frhuelsz added 5 commits December 31, 2025 12:12

AB-update

7f4ee67

Better force exit console

2d7996e

Merge branch 'user/frhuelsz/storm/e2e-2' into user/frhuelsz/storm/e2e-3

3da66b7

everything

3b27bee

Merge branch 'main' into user/frhuelsz/storm/e2e-3

742ba91

github-advanced-security bot found potential problems Jan 2, 2026

View reviewed changes

tools/storm/utils/sshutils/client.go Dismissed Show dismissed Hide dismissed

frhuelsz marked this pull request as ready for review January 2, 2026 21:47

frhuelsz requested a review from a team as a code owner January 2, 2026 21:47

Copilot AI review requested due to automatic review settings January 2, 2026 21:47

Copilot started reviewing on behalf of frhuelsz January 2, 2026 21:48 View session

Copilot AI reviewed Jan 2, 2026

View reviewed changes

pr comments

a0a457d

bfjelds reviewed Jan 5, 2026

View reviewed changes

tools/pkg/hostconfig/hostconfig.go Show resolved Hide resolved

cleanup

22c2ada

bfjelds reviewed Jan 5, 2026

View reviewed changes

tools/pkg/netlaunch/netlaunch.go Show resolved Hide resolved

pr comments

5b5dc46

bfjelds reviewed Jan 5, 2026

View reviewed changes

tools/pkg/virtdeploy/resources.go Outdated Show resolved Hide resolved

pr comments

66001c7

bfjelds reviewed Jan 5, 2026

View reviewed changes

frhuelsz added 4 commits January 6, 2026 12:59

pr comments

305fc8c

Merge branch 'user/frhuelsz/storm/e2e-2' into user/frhuelsz/storm/e2e-3

4202b1d

pr comments

07fa9bb

PR comments :)

e878682

Copilot AI review requested due to automatic review settings January 6, 2026 21:08

Copilot started reviewing on behalf of frhuelsz January 6, 2026 21:09 View session

Copilot AI reviewed Jan 6, 2026

View reviewed changes

-			session.Close()
+			if errClose := session.Close(); errClose != nil {
+				logrus.WithError(errClose).Warn("failed to close SSH session during health check")
+			}

		func (s *HostConfig) HasABUpdate() bool {
		return s.Container.Exists("storage", "abUpdate")

-func (s *HostConfig) HasABUpdate() bool {
-	return s.Container.Exists("storage", "abUpdate")
+import (
+	"reflect"
+	"strings"
+)
+func (s *HostConfig) HasABUpdate() bool {
+	// First, confirm the key path exists as before.
+	if !s.Container.Exists("storage", "abUpdate") {
+		return false
+	}
+	// Use reflection to look for a Get("storage", "abUpdate") method on the container.
+	// This avoids relying on project-specific interfaces that may not exist.
+	containerValue := reflect.ValueOf(s.Container)
+	getMethod := containerValue.MethodByName("Get")
+	if !getMethod.IsValid() {
+		// No Get method available; fall back to the original Exists-based behavior.
+		return true
+	}
+	// Call Get("storage", "abUpdate") reflectively.
+	results := getMethod.Call([]reflect.Value{
+		reflect.ValueOf("storage"),
+		reflect.ValueOf("abUpdate"),
+	})
+	// If nothing was returned, treat it as unusable.
+	if len(results) == 0 {
+		return false
+	}
+	value := results[0].Interface()
+	if value == nil {
+		return false
+	}
+	// Apply simple sanity checks for common scalar types.
+	switch v := value.(type) {
+	case string:
+		return strings.TrimSpace(v) != ""
+	case []byte:
+		return len(v) > 0
+	default:
+		// For other non-nil types, assume the configuration is usable.
+		return true
+	}

	matches := regexp.MustCompile(`^(.+):v(\d+)\.(.+)\.(.+)\.(\d+)$`).FindStringSubmatch(base)
	matches := regexp.MustCompile(`^(.+):v([^.]+)\.(.+)\.(.+)\.(\d+)$`).FindStringSubmatch(base)

test: Add storm ab update tests #437

Are you sure you want to change the base?

test: Add storm ab update tests #437

Uh oh!

Conversation

frhuelsz commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Description

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bfjelds Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

bfjelds Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

frhuelsz Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

bfjelds Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

frhuelsz commented Jan 6, 2026

Uh oh!

azure-pipelines bot commented Jan 6, 2026

Uh oh!

frhuelsz commented Jan 6, 2026

Uh oh!

azure-pipelines bot commented Jan 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

frhuelsz commented Jan 2, 2026 •

edited

Loading