-
Notifications
You must be signed in to change notification settings - Fork 15
test: Add storm ab update tests #437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive A/B update testing functionality to the storm testing framework and performs significant refactoring to improve code organization and maintainability.
Key Changes
- Introduces A/B update tests for configurations that support it, including split A/B updates for prerelease and higher test rings
- Consolidates all trident runtime environment types into a single
RuntimeTypeenum, eliminating the separatestormenvpackage - Refactors HTTP server routing in netlaunch/netlisten to use scoped
ServeMuxinstances instead of the globalDefaultServeMux, preventing handler conflicts during multiple server starts/stops - Adds structured
HostConfigabstraction for managing host configurations with proper cloning and serialization
Reviewed changes
Copilot reviewed 40 out of 41 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/storm/utils/trident/runtime.go | New file defining RuntimeType enum consolidating host/container/none runtime variants |
| tools/storm/utils/trident/trident.go | Updated to use new RuntimeType instead of separate environment enum |
| tools/storm/utils/sshutils/*.go | New SSH utilities package with client creation, command execution, and output handling |
| tools/storm/utils/ssh/config/config.go | Added conversion method to new SshClientConfig type |
| tools/storm/utils/retry/retry.go | Added context-aware retry function RetryContext |
| tools/storm/e2e/scenario/ab_update.go | New A/B update test implementation with split stage/finalize support |
| tools/storm/e2e/scenario/monitoring.go | New VM serial monitoring with libvirt console integration |
| tools/storm/e2e/scenario/trident.go | Updated scenario structure with SSH client management and version tracking |
| tools/storm/e2e/testrings/testrings.go | Added ring comparison and ordering logic |
| tools/pkg/hostconfig/*.go | New Host Config abstraction with YAML serialization and cloning |
| tools/pkg/virtdeploy/*.go | Enhanced VM status reporting with full libvirt domain metadata |
| tools/pkg/netlaunch/netlaunch.go | Refactored to use scoped ServeMux instead of global handlers |
| tools/pkg/netlisten/netlisten.go | Refactored to use scoped ServeMux instead of global handlers |
| tools/pkg/phonehome/*.go | Updated handler registration to accept custom ServeMux |
| tools/storm/helpers/*.go | Updated to use new RuntimeType enum |
| tools/storm/e2e/discover.go | Updated scenario discovery to use new HostConfig type |
| tools/go.mod | Updated storm dependency to v0.4.0-alpha1 |
| func checkUrlIsAccessible(url string) error { | ||
| resp, err := http.Head(url) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to check new image URL: %w", err) | ||
| } | ||
| if resp.StatusCode != http.StatusOK { | ||
| return fmt.Errorf("new image URL is not accessible: %s, got HTTP code: %d", url, resp.StatusCode) | ||
| } |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error check on line 344 for http.Head returns an error if the request fails, but you're also checking resp.StatusCode at line 348. However, if err is not nil at line 345, resp may be nil, which would cause a panic when accessing resp.StatusCode. You should return early if err is not nil, before checking the status code.
| args struct { | ||
| stormsshconfig.SshCliSettings `embed:""` | ||
| stormenv.EnvCliSettings `embed:""` | ||
| trident.RuntimeCliSettings `embed:""` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stormtrident.RuntimeCliSettings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to standardize on whether to use some kind of dependency naming convention. i'm ok with not using storm* though i think it adds clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's generally preferred to use the name of the package verbatim and only do local renames if there are name conflicts. That way the language server can also auto-detect and auto-import things.
I'm not sure storm* is the way to go since many of these already have that in the package name anyway. Go convention is to prefer shorter and more concise names: https://go.dev/blog/package-names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not particularly opinionated here and i think in this case it is pretty extraneous (trident is pretty specific to us).
for confg or ssh, it seems pretty handy to distinguish between our package and some standard package. and if we are going to do it occassionally, i think conssitency is nice.
|
/AzurePipelines run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/AzurePipelines run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 39 out of 40 changed files in this pull request and generated 9 comments.
| // There is already an open client, check if it's still valid. | ||
| session, err := s.sshClient.NewSession() | ||
| if err == nil { | ||
| session.Close() |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SSH client health check on line 191 creates and immediately closes a session. If the session creation succeeds but the close fails, this could leave a resource leak. Consider wrapping the close in a deferred function or handling errors from Close().
| session.Close() | |
| if errClose := session.Close(); errClose != nil { | |
| logrus.WithError(errClose).Warn("failed to close SSH session during health check") | |
| } |
| // Check for context cancellation | ||
| if ctx.Err() != nil { | ||
| // Print the last `ringSize` lines of the serial log before timing out | ||
| logrus.Errorf("VM serial monitor was cancelled. Last %d lines of VM serial log before timeout:\n%s", ringSize, func() string { | ||
| var sb strings.Builder | ||
| ring.Do(func(p interface{}) { | ||
| if p != nil { | ||
| sb.WriteString(p.(string) + "\n") | ||
| } | ||
| }) | ||
| return sb.String() | ||
| }()) | ||
|
|
||
| return ctx.Err() |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readerLoop function has a potential resource leak. If the context is cancelled (line 172-184), the function returns early without consuming any remaining data from the in reader. This could cause the writing goroutine to block indefinitely if it's still trying to write to a full pipe buffer. Consider draining or closing the reader before returning on cancellation.
| nn, err := n.W.Write(p) | ||
| if nn > 0 { | ||
| n.once.Do(func() { n.active.Store(true) }) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The once.Do ensures active is only set once, but line 34 could theoretically be called multiple times with different write sizes. If the first write has nn == 0, subsequent successful writes won't trigger the once.Do. This means a write that succeeds but writes 0 bytes will prevent the active flag from ever being set. Consider checking nn > 0 before the once.Do call or documenting this behavior.
| if err = ctx.Err(); err != nil { | ||
| break | ||
| } | ||
|
|
||
| attempt++ | ||
| var result *T | ||
| result, err = f(ctx, attempt) | ||
| if err != nil { | ||
| if ctx.Err() != nil { | ||
| break | ||
| } | ||
|
|
||
| time.Sleep(backoff) | ||
| continue | ||
| } | ||
|
|
||
| return result, nil | ||
| } | ||
|
|
||
| return nil, fmt.Errorf("failed after %d attempts: %w", attempt, err) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is documented to handle context cancellation gracefully, but if ctx.Err() returns an error on line 38 before any attempts, the error message "failed after 0 attempts" could be confusing. Consider distinguishing between timeout/cancellation before any attempts versus after failed attempts.
| func checkUrlIsAccessible(url string) error { | ||
| resp, err := http.Head(url) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to check new image URL: %w", err) | ||
| } | ||
| if resp.StatusCode != http.StatusOK { | ||
| return fmt.Errorf("new image URL is not accessible: %s, got HTTP code: %d", url, resp.StatusCode) | ||
| } | ||
|
|
||
| return nil |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checkUrlIsAccessible function only works with HTTP/HTTPS URLs. OCI URLs (which start with "oci://") are checked against this function but won't work with http.Head. This will cause the error to be logged but execution continues. Consider either handling OCI URLs separately or documenting that this check only applies to HTTP URLs.
| session.Close() | ||
| logrus.Debug("SSH client is still valid") | ||
| // Still valid | ||
| return nil | ||
| } | ||
|
|
||
| // Not valid anymore, close it. | ||
| logrus.Debug("SSH client is no longer valid, reopening") | ||
| s.sshClient.Close() |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's inconsistent error handling between lines 189 and 201. When the SSH client is invalid (line 192), both the session creation error and the close call are ignored, but when it's still invalid after re-opening (line 201), we explicitly close and nil the client. Consider making the error handling consistent - either both paths should close and nil the client, or neither should.
| session.Close() | |
| logrus.Debug("SSH client is still valid") | |
| // Still valid | |
| return nil | |
| } | |
| // Not valid anymore, close it. | |
| logrus.Debug("SSH client is no longer valid, reopening") | |
| s.sshClient.Close() | |
| // Session creation succeeded, so the client is still valid. | |
| // We ignore any error from closing the short-lived session here, | |
| // as it does not affect the validity of the client itself. | |
| _ = session.Close() | |
| logrus.Debug("SSH client is still valid") | |
| // Still valid | |
| return nil | |
| } | |
| // Not valid anymore, close it. | |
| logrus.WithError(err).Warn("SSH client session creation failed, closing client") | |
| logrus.Debug("SSH client is no longer valid, reopening") | |
| if closeErr := s.sshClient.Close(); closeErr != nil { | |
| logrus.WithError(closeErr).Warn("failed to close invalid SSH client") | |
| } |
| func (s *HostConfig) HasABUpdate() bool { | ||
| return s.Container.Exists("storage", "abUpdate") |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new HasABUpdate method only checks for the existence of "storage" and "abUpdate" keys but doesn't validate that abUpdate is not nil or has valid content. Consider adding validation to ensure the abUpdate configuration is actually usable.
| func (s *HostConfig) HasABUpdate() bool { | |
| return s.Container.Exists("storage", "abUpdate") | |
| import ( | |
| "reflect" | |
| "strings" | |
| ) | |
| func (s *HostConfig) HasABUpdate() bool { | |
| // First, confirm the key path exists as before. | |
| if !s.Container.Exists("storage", "abUpdate") { | |
| return false | |
| } | |
| // Use reflection to look for a Get("storage", "abUpdate") method on the container. | |
| // This avoids relying on project-specific interfaces that may not exist. | |
| containerValue := reflect.ValueOf(s.Container) | |
| getMethod := containerValue.MethodByName("Get") | |
| if !getMethod.IsValid() { | |
| // No Get method available; fall back to the original Exists-based behavior. | |
| return true | |
| } | |
| // Call Get("storage", "abUpdate") reflectively. | |
| results := getMethod.Call([]reflect.Value{ | |
| reflect.ValueOf("storage"), | |
| reflect.ValueOf("abUpdate"), | |
| }) | |
| // If nothing was returned, treat it as unusable. | |
| if len(results) == 0 { | |
| return false | |
| } | |
| value := results[0].Interface() | |
| if value == nil { | |
| return false | |
| } | |
| // Apply simple sanity checks for common scalar types. | |
| switch v := value.(type) { | |
| case string: | |
| return strings.TrimSpace(v) != "" | |
| case []byte: | |
| return len(v) > 0 | |
| default: | |
| // For other non-nil types, assume the configuration is usable. | |
| return true | |
| } |
| errCh := make(chan error, 1) | ||
| wg.Add(1) | ||
| go func() { | ||
| defer wg.Done() | ||
| defer pw.Close() | ||
| wN := ioutils.NewNotifyWriter(pw) | ||
| for { | ||
| // If context is cancelled, exit the goroutine. | ||
| if consoleCtx.Err() != nil { | ||
| return | ||
| } | ||
|
|
||
| // Try to open the console. This is a blocking call that only | ||
| // returns when the console is closed or an error occurs. It writes | ||
| // to the provided writer in the background. | ||
| err := lv.DomainOpenConsole(dom, nil, wN, uint32(libvirt.DomainConsoleForce)) | ||
| if err == nil && wN.Active() { | ||
| // DomainOpenConsole returned without error and data was | ||
| // written, this is an expected outcome when the console closed | ||
| // naturally. | ||
| return | ||
| } | ||
|
|
||
| if consoleCtx.Err() != nil { | ||
| // Context was cancelled while the console was open/opening, | ||
| // exit the goroutine. | ||
| return | ||
| } | ||
|
|
||
| if !wN.Active() { | ||
| // No data has been written yet, so this is likely a | ||
| // transient error such as the domain not being fully | ||
| // started yet. Retry silently. | ||
| time.Sleep(100 * time.Millisecond) | ||
| continue | ||
| } | ||
|
|
||
| // If we get here, there was an error after some data was | ||
| // successfully written. Log the error and exit the goroutine. This | ||
| // may happen naturally when the pipe is closed. But if that happens | ||
| // naturally, the readerLoop will have exited first, so this error | ||
| // won't matter. | ||
| logrus.Errorf("DomainOpenConsole error after data written: %v", err) | ||
| errCh <- err | ||
| return | ||
| } | ||
| }() | ||
|
|
||
| // Call inner loop | ||
| loopErr := readerLoop(ctx, pr, errCh, out, 30) | ||
| // Regardless of whether readerLoop returned an error, cancel the console | ||
| // context and close the pipe to stop the DomainOpenConsole goroutine. | ||
| consoleCancel() | ||
| pr.Close() | ||
| pw.Close() | ||
|
|
||
| // Even after we close all of this, the DomainOpenConsole goroutine may | ||
| // still be running because it doesn't take in a context. We force it to | ||
| // close by opening a new console with the DomainConsoleForce flag, and a | ||
| // nil stream, which will signal the existing DomainOpenConsole to exit, and | ||
| // make this new one exit immediately. | ||
| err := lv.DomainOpenConsole(dom, nil, nil, uint32(libvirt.DomainConsoleForce)) | ||
| if err != nil { | ||
| logrus.Warnf("failed to force close DomainOpenConsole: %v", err) | ||
| } | ||
|
|
||
| // Wait for DomainOpenConsole goroutine to exit | ||
| wg.Wait() | ||
|
|
||
| return loopErr |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mutex pattern here uses errCh as a channel with size 1, but there's a race condition. The DomainOpenConsole goroutine (line 107) could write to errCh on line 135 while readerLoop is still processing (line 189-194). If readerLoop has already checked the channel and moved on, and then the goroutine writes, that error will remain in the channel buffer and could potentially be read on a subsequent select after the intended scope. Consider using a separate mechanism for error signaling or document this behavior more clearly.
| // Special handling for OCI URLs | ||
|
|
||
| // Match form <repository_base>:v<build ID>.<config>.<deployment env>.<version number> | ||
| matches := regexp.MustCompile(`^(.+):v(\d+)\.(.+)\.(.+)\.(\d+)$`).FindStringSubmatch(base) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the regular expression on line 118, the pattern expects the format <repository_base>:v<build ID>.<config>.<deployment env>.<version number>, but if the build ID contains non-digit characters, the pattern will fail to match. Consider whether the build ID pattern \d+ should be more flexible (e.g., [^.]+ to match any non-dot characters).
| matches := regexp.MustCompile(`^(.+):v(\d+)\.(.+)\.(.+)\.(\d+)$`).FindStringSubmatch(base) | |
| matches := regexp.MustCompile(`^(.+):v([^.]+)\.(.+)\.(.+)\.(\d+)$`).FindStringSubmatch(base) |
🔍 Description
Depends on:
Changes: