Skip to content

Conversation

@evanphx
Copy link
Contributor

@evanphx evanphx commented Jan 9, 2026

Summary

  • Extract common container lifecycle management into components/base package
  • Refactor etcd, victorialogs, and victoriametrics to embed BaseComponent
  • Add auto-restart capability to victoriametrics (previously missing)
  • Eliminate ~700 lines of duplicated code across the three components

The BaseComponent provides shared functionality for:

  • Container and task lifecycle (Start/Stop)
  • Exit monitoring with auto-restart and exponential backoff
  • Readiness checking via TCP connectivity
  • Graceful shutdown with SIGTERM/SIGKILL escalation

Each component embeds BaseComponent and provides callbacks for component-specific behavior (task creation, ready port).

Test plan

  • All existing tests pass for etcd, victorialogs, and victoriametrics
  • Auto-restart tests pass for all three components
  • Linting passes

Extract common container lifecycle management into components/base package.
This eliminates ~700 lines of duplicated code across the three components
and adds auto-restart capability to victoriametrics (previously missing).

The BaseComponent provides shared functionality for:
- Container and task lifecycle (Start/Stop)
- Exit monitoring with auto-restart and exponential backoff
- Readiness checking via TCP connectivity
- Graceful shutdown with SIGTERM/SIGKILL escalation

Each component embeds BaseComponent and provides callbacks for
component-specific behavior (task creation, ready port).
@evanphx evanphx requested a review from a team as a code owner January 9, 2026 19:27
@coderabbitai
Copy link

coderabbitai bot commented Jan 9, 2026

📝 Walkthrough

Walkthrough

Adds a new base package implementing BaseComponent with restart and ready-check configs, lifecycle controls, task/container state accessors, exit monitoring, restart/backoff logic, and callbacks for task creation and ready-port retrieval. Refactors etcd, victorialogs, and victoriametrics to embed *base.BaseComponent, wire CreateTask/GetReadyPort callbacks, and replace manual mutexes, container/task management, readiness polling, and restart/cleanup logic with BaseComponent methods such as SetContainer, SetTask, WaitForReady, and StartExitMonitor.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @components/base/base.go:
- Around line 243-249: After sending SIGKILL with task.Kill, the code calls
task.Wait(killCtx) but does not block on the returned channel, so the process
may be deleted before it fully exits; change the logic in the kill/cleanup block
(references: task.Kill, task.Wait, task.Delete, b.ComponentName) to capture the
wait channel/result from task.Wait(killCtx) and block on it (e.g. receive from
the channel or select) to obtain the wait result/error, log any waitErr, and
only proceed to task.Delete after the wait completes; ensure you still handle
and log errors from task.Kill and from the wait result.
🧹 Nitpick comments (3)
components/base/base.go (2)

128-134: Consider documenting the side effect of SetTask.

SetTask implicitly sets running = true. This coupling is reasonable semantically (a task means running), but callers may not expect it. Consider adding a brief comment noting this side effect.

📝 Suggested documentation
 // SetTask sets the task reference and marks the component as running.
+// Note: This also sets running=true since having a task implies the component is running.
 func (b *BaseComponent) SetTask(task containerd.Task) {

283-286: Consider respecting context cancellation during retry sleep.

The retry loop uses time.Sleep which doesn't respect context cancellation. If the parent context is canceled, the retry will still wait the full delay.

♻️ Proposed improvement
 		if attempt < maxRetries {
-			time.Sleep(retryDelay)
+			select {
+			case <-ctx.Done():
+				b.Log.Warn(b.ComponentName+" container delete retry cancelled", "error", ctx.Err())
+				return
+			case <-time.After(retryDelay):
+			}
 		}
components/etcd/etcd.go (1)

54-56: Prefer time.Second for clarity.

Using 1e9 for nanoseconds works but is less readable than time.Second.

♻️ Suggested improvement
-	bc.ReadyConfig.DialTimeout = 1e9 // 1 second in nanoseconds
-	bc.ReadyConfig.Interval = 1e9    // 1 second in nanoseconds
+	bc.ReadyConfig.DialTimeout = time.Second
+	bc.ReadyConfig.Interval = time.Second
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 227bb46 and a913c45.

📒 Files selected for processing (4)
  • components/base/base.go
  • components/etcd/etcd.go
  • components/victorialogs/victorialogs.go
  • components/victoriametrics/victoriametrics.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.go: Always run make lint before committing - runs golangci-lint on the entire codebase
Run make lint-fix to automatically fix linting issues where possible
Only add comments when they provide valuable context or explain 'why' something is done - avoid redundant comments that restate what the code does
Function/method comments should explain the purpose and any important side effects, not just restate the name
Codebase follows standard Go formatting conventions

Files:

  • components/victoriametrics/victoriametrics.go
  • components/etcd/etcd.go
  • components/victorialogs/victorialogs.go
  • components/base/base.go
🧠 Learnings (6)
📓 Common learnings
Learnt from: evanphx
Repo: mirendev/runtime PR: 59
File: cli/commands/dev.go:80-117
Timestamp: 2025-05-22T21:54:17.159Z
Learning: In the context of etcd component integration, evanphx preferred to keep the original implementation approach for error handling in deferred cleanup functions and variable shadowing patterns.
Learnt from: phinze
Repo: mirendev/runtime PR: 173
File: cli/commands/commands_linux.go:29-31
Timestamp: 2025-08-12T20:24:39.993Z
Learning: In the MIR-309 CLI rename from runtime to miren, the containerd namespace should also be changed from "runtime" to "miren" to align with the overall rebranding effort, along with other component names like etcd and clickhouse containers.
📚 Learning: 2025-05-30T21:01:16.000Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 73
File: components/clickhouse/integration_test.go:13-16
Timestamp: 2025-05-30T21:01:16.000Z
Learning: The containerd v2 client should be imported using the path "github.com/containerd/containerd/v2/client", not from the root module path. The "/client" suffix is required for the v2 client import.

Applied to files:

  • components/victoriametrics/victoriametrics.go
  • components/etcd/etcd.go
  • components/victorialogs/victorialogs.go
📚 Learning: 2025-05-30T21:01:16.000Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 73
File: components/clickhouse/integration_test.go:13-16
Timestamp: 2025-05-30T21:01:16.000Z
Learning: The containerd v2 client import path is "github.com/containerd/containerd/v2/client" with the "/client" suffix required. This is the official import path for containerd 2.x client library, not the root module path.

Applied to files:

  • components/victoriametrics/victoriametrics.go
  • components/etcd/etcd.go
  • components/victorialogs/victorialogs.go
📚 Learning: 2026-01-07T23:46:12.875Z
Learnt from: CR
Repo: mirendev/runtime PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-07T23:46:12.875Z
Learning: Project uses iso for containerized development with all dependencies (Go 1.25, containerd, buildkit, etc.) - defined in `.iso/Dockerfile`

Applied to files:

  • components/victoriametrics/victoriametrics.go
  • components/etcd/etcd.go
  • components/victorialogs/victorialogs.go
📚 Learning: 2025-05-29T23:20:55.872Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 73
File: components/clickhouse/clickhouse.go:259-299
Timestamp: 2025-05-29T23:20:55.872Z
Learning: In containerd task management, if task.Kill() fails, it usually indicates fundamental issues (task already dead, permission problems, or containerd connection issues) where subsequent operations like task.Wait() or task.Delete() would likely also fail. Early return after Kill failure is often the appropriate behavior rather than attempting further cleanup operations.

Applied to files:

  • components/victoriametrics/victoriametrics.go
  • components/etcd/etcd.go
  • components/victorialogs/victorialogs.go
📚 Learning: 2025-05-22T21:54:17.159Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 59
File: cli/commands/dev.go:80-117
Timestamp: 2025-05-22T21:54:17.159Z
Learning: In the context of etcd component integration, evanphx preferred to keep the original implementation approach for error handling in deferred cleanup functions and variable shadowing patterns.

Applied to files:

  • components/etcd/etcd.go
🧬 Code graph analysis (3)
components/victoriametrics/victoriametrics.go (1)
components/base/base.go (2)
  • BaseComponent (62-89)
  • NewBaseComponent (92-102)
components/etcd/etcd.go (2)
components/base/base.go (2)
  • BaseComponent (62-89)
  • NewBaseComponent (92-102)
pkg/slogout/slogout.go (3)
  • WithLogger (296-307)
  • WithJSONParsing (45-49)
  • WithMaxLevel (59-64)
components/victorialogs/victorialogs.go (3)
components/base/base.go (2)
  • BaseComponent (62-89)
  • NewBaseComponent (92-102)
pkg/serverconfig/config.gen.go (1)
  • VictoriaLogsConfig (422-427)
discovery/network.go (1)
  • HTTPEndpoint (16-18)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: build-binaries (darwin, arm64, macos-latest)
  • GitHub Check: build-binaries (darwin, amd64, macos-latest)
  • GitHub Check: lint
  • GitHub Check: test
  • GitHub Check: test-e2e
🔇 Additional comments (28)
components/base/base.go (8)

1-18: LGTM - Package documentation and imports are well-structured.

Clear package-level documentation explaining the purpose of the base component framework.


20-52: LGTM - Well-documented configuration structs with sensible defaults.

The restart policy with exponential backoff and the readiness check configuration provide good flexibility for component-specific tuning.


54-102: LGTM - Clean separation of concerns with callback-based customization.

The dual-mutex design appropriately separates operation serialization (opMu) from state protection (stateMu).


157-220: LGTM - Careful lock management in stop sequence.

The pattern of capturing state under lock, releasing, then operating is correct. Clearing monitor channels before releasing the lock prevents races with StartExitMonitor.


301-313: LGTM - Clean container cleanup for restart scenarios.

Properly stops any existing task before deleting the container.


315-337: LGTM - Standard TCP readiness check with proper context handling.

The implementation correctly respects context cancellation and cleans up connections.


339-411: LGTM - Well-designed exit monitoring with restart loop.

The monitor correctly captures state under lock before async operations, handles restart failures gracefully, and obtains a fresh exit channel after successful restarts.


413-496: LGTM - Robust restart handling with exponential backoff.

The restart logic correctly:

  • Resets the counter after the reset window
  • Enforces max restart limits
  • Implements capped exponential backoff
  • Respects stop signals during backoff
  • Cleans up the old task before creating a new one
components/etcd/etcd.go (6)

44-50: LGTM - Clean struct embedding of BaseComponent.

The component correctly embeds the base and adds only component-specific fields.


69-76: LGTM - Appropriate logging configuration for etcd.

Using JSON parsing and clamping to Info level for etcd's structured logs is appropriate.


78-175: LGTM - Clean integration with base component lifecycle.

The Start method properly uses base component APIs for locking, state management, readiness checking, and exit monitoring.


177-189: LGTM - Endpoint methods correctly use base state checking.


191-259: LGTM - Comprehensive restart logic for existing containers.

The method handles all cases: running task, stopped task, and failed task, with proper integration into the base component lifecycle.


261-314: LGTM - Container creation logic preserved.

The container specification is unchanged; the method correctly accesses base component fields via embedding.

components/victoriametrics/victoriametrics.go (7)

36-41: LGTM - Clean struct embedding matching etcd pattern.


43-55: LGTM - Standard base component initialization.


57-63: LGTM - Appropriate logging configuration for VictoriaMetrics.


65-150: LGTM - Start method follows established pattern with proper cleanup.

The ordering of WaitForReady before SetTask ensures the component only reports as running after readiness is confirmed. Failure paths correctly clean up both task and container.


152-157: LGTM - Endpoint method correctly uses base state checking.


159-219: Verify: Different readiness failure handling vs etcd.

Unlike etcd (which logs a warning and continues if WaitForReady fails), victoriametrics returns an error. This makes the component stricter about readiness. Verify this behavioral difference is intentional.


221-260: LGTM - Container creation with appropriate VictoriaMetrics configuration.

components/victorialogs/victorialogs.go (7)

36-41: LGTM - Consistent structure with VictoriaMetrics component.


43-55: LGTM - Standard base component initialization.


57-63: LGTM - Consistent callback implementations.


65-150: LGTM - Start method consistent with VictoriaMetrics.


152-157: LGTM - Endpoint method matches VictoriaMetrics pattern.


159-219: LGTM - Restart logic consistent with VictoriaMetrics.

Same stricter readiness requirement as VictoriaMetrics (returns error on failure).


221-259: LGTM - Container creation with appropriate VictoriaLogs configuration.

The previous code called task.Wait() but didn't actually wait on the
returned channel, so it would return immediately. Now properly select
on the status channel with timeout handling.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @components/base/base.go:
- Around line 421-504: handleRestart may call b.CreateTask(ctx, container) with
a nil b.container; add a nil check for container (the local variable captured
from b.container at the top of handleRestart) before attempting to create the
new task and return a clear error if it's nil (e.g., "no container available for
restart"), so CreateTask is never invoked with a nil container; update any
logging to indicate the missing container and keep references to b.container,
handleRestart, and CreateTask to locate the change.
- Around line 347-379: StartExitMonitor can panic if b.task is nil before
calling task.Wait; add a nil check for b.task (or task after assignment)
immediately after capturing task and before calling task.Wait, and handle it by
logging a debug/error message referencing b.ComponentName, cleaning up/resetting
b.stopMonitor and b.monitorDone under b.stateMu (matching the existing cleanup
path), closing monitorDone, and returning without starting monitorExit; ensure
monitorExit is only invoked when task is non-nil.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a913c45 and 98474b5.

📒 Files selected for processing (1)
  • components/base/base.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.go: Always run make lint before committing - runs golangci-lint on the entire codebase
Run make lint-fix to automatically fix linting issues where possible
Only add comments when they provide valuable context or explain 'why' something is done - avoid redundant comments that restate what the code does
Function/method comments should explain the purpose and any important side effects, not just restate the name
Codebase follows standard Go formatting conventions

Files:

  • components/base/base.go
🧠 Learnings (3)
📓 Common learnings
Learnt from: evanphx
Repo: mirendev/runtime PR: 59
File: cli/commands/dev.go:80-117
Timestamp: 2025-05-22T21:54:17.159Z
Learning: In the context of etcd component integration, evanphx preferred to keep the original implementation approach for error handling in deferred cleanup functions and variable shadowing patterns.
📚 Learning: 2025-05-29T23:20:55.872Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 73
File: components/clickhouse/clickhouse.go:259-299
Timestamp: 2025-05-29T23:20:55.872Z
Learning: In containerd task management, if task.Kill() fails, it usually indicates fundamental issues (task already dead, permission problems, or containerd connection issues) where subsequent operations like task.Wait() or task.Delete() would likely also fail. Early return after Kill failure is often the appropriate behavior rather than attempting further cleanup operations.

Applied to files:

  • components/base/base.go
📚 Learning: 2025-09-11T22:31:18.478Z
Learnt from: phinze
Repo: mirendev/runtime PR: 216
File: cli/commands/upgrade_linux.go:0-0
Timestamp: 2025-09-11T22:31:18.478Z
Learning: In cli/commands/upgrade_linux.go, the timeout handler intentionally kills only the child process with cmd.Process.Kill() rather than the entire process group with syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL). This is a deliberate design choice by the maintainers.

Applied to files:

  • components/base/base.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: build-binaries (darwin, amd64, macos-latest)
  • GitHub Check: lint
  • GitHub Check: build-binaries (darwin, arm64, macos-latest)
  • GitHub Check: test-e2e
  • GitHub Check: test
🔇 Additional comments (8)
components/base/base.go (8)

1-102: LGTM on package structure and configuration types.

The dual-mutex pattern (opMu for serializing major operations, stateMu for state access) is a sound design for managing concurrent access. Default policies are reasonable.


104-155: LGTM on state accessors.

Thread-safe state management with proper mutex usage. The coupling of SetTask with running=true is a sensible convenience for the typical use pattern.


157-220: LGTM on stop logic.

The careful sequencing of capturing monitor channels before releasing the lock, then signaling after, correctly avoids races with StartExitMonitor.


222-267: LGTM on graceful shutdown escalation.

The SIGTERM → wait → SIGKILL → wait pattern with proper timeout handling is well implemented. Per project learnings, the early return after Kill failure is the appropriate behavior since subsequent operations would likely also fail.


274-307: LGTM on container deletion retry logic.

The retry with explicit snapshot leak warning on final failure is good defensive programming for containerd cleanup scenarios.


323-345: LGTM on readiness check.

Clean TCP connectivity check with proper context cancellation handling and connection cleanup.


381-419: LGTM on exit monitoring loop.

The loop correctly handles the stop signal, exit events, and obtaining a new exit channel after restart. The nil task check after restart (line 408) properly guards against continuing when restart fails.


309-321: LGTM on cleanup helper.

CleanupExistingContainer properly handles both task and container cleanup with appropriate error handling for restart scenarios.

- StartExitMonitor: Check task is not nil before calling Wait
- handleRestart: Check container is not nil before calling CreateTask
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
components/base/base.go (1)

452-456: Consider guarding against exponential backoff overflow.

The calculation 1 << (restartCount-1) on line 453 could theoretically overflow if MaxRestarts is set to a very high value (e.g., 64). While the default MaxRestarts is 5 (making this safe) and BackoffMax caps the result, you could add a safeguard:

♻️ Proposed safeguard
 // Calculate backoff
-backoff := policy.BackoffBase * time.Duration(1<<(restartCount-1))
+shift := restartCount - 1
+if shift > 30 {
+	shift = 30
+}
+backoff := policy.BackoffBase * time.Duration(1<<shift)
 if backoff > policy.BackoffMax {
 	backoff = policy.BackoffMax
 }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 98474b5 and e7b78b2.

📒 Files selected for processing (1)
  • components/base/base.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.go: Always run make lint before committing - runs golangci-lint on the entire codebase
Run make lint-fix to automatically fix linting issues where possible
Only add comments when they provide valuable context or explain 'why' something is done - avoid redundant comments that restate what the code does
Function/method comments should explain the purpose and any important side effects, not just restate the name
Codebase follows standard Go formatting conventions

Files:

  • components/base/base.go
🧠 Learnings (3)
📓 Common learnings
Learnt from: evanphx
Repo: mirendev/runtime PR: 59
File: cli/commands/dev.go:80-117
Timestamp: 2025-05-22T21:54:17.159Z
Learning: In the context of etcd component integration, evanphx preferred to keep the original implementation approach for error handling in deferred cleanup functions and variable shadowing patterns.
📚 Learning: 2025-05-29T23:20:55.872Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 73
File: components/clickhouse/clickhouse.go:259-299
Timestamp: 2025-05-29T23:20:55.872Z
Learning: In containerd task management, if task.Kill() fails, it usually indicates fundamental issues (task already dead, permission problems, or containerd connection issues) where subsequent operations like task.Wait() or task.Delete() would likely also fail. Early return after Kill failure is often the appropriate behavior rather than attempting further cleanup operations.

Applied to files:

  • components/base/base.go
📚 Learning: 2025-09-11T22:31:18.478Z
Learnt from: phinze
Repo: mirendev/runtime PR: 216
File: cli/commands/upgrade_linux.go:0-0
Timestamp: 2025-09-11T22:31:18.478Z
Learning: In cli/commands/upgrade_linux.go, the timeout handler intentionally kills only the child process with cmd.Process.Kill() rather than the entire process group with syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL). This is a deliberate design choice by the maintainers.

Applied to files:

  • components/base/base.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build-binaries (darwin, arm64, macos-latest)
  • GitHub Check: lint
  • GitHub Check: test-e2e
  • GitHub Check: test
🔇 Additional comments (2)
components/base/base.go (2)

157-267: Well-designed shutdown and cleanup logic.

The concurrency handling is correct:

  • Clearing monitor channels under lock before closing them prevents races with new StartExitMonitor calls
  • Graceful shutdown escalation (SIGTERM → SIGKILL) with appropriate timeouts
  • Early return after SIGTERM failure aligns with best practices (subsequent operations would likely fail anyway)
  • Best-effort task deletion even after Wait errors is appropriate for cleanup

347-426: Exit monitoring logic is robust.

The nil checks for task (line 360) and after restart (line 415) prevent panics as mentioned in the PR objectives. The monitor lifecycle is correctly managed:

  • Idempotency check prevents duplicate monitors
  • Channels are captured under lock before release
  • Error paths properly clean up state
  • The restart loop correctly obtains new exit channels

- Add IfRunning() helper for thread-safe endpoint access, ensuring
  running state and port reads are atomic
- Add context cancellation check in restart backoff wait to respect
  shutdown requests during backoff delays
- Fix etcd ReadyConfig to use time.Duration constants instead of raw
  nanoseconds for consistency with the rest of the codebase
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
components/base/base.go (3)

128-134: SetTask has a subtle side effect that may cause confusion.

SetTask both sets the task and marks running = true. This couples two distinct operations and could lead to incorrect state if a caller sets a task without intending to mark the component as running.

Consider either:

  1. Documenting this behavior prominently, or
  2. Separating the concerns into two calls

Since this is a private-facing API within the component framework and the pattern is consistent across usage, this is acceptable but worth noting.


287-310: Consider respecting context cancellation during retry loop.

The retry loop uses time.Sleep which doesn't respect context cancellation. If the parent context is cancelled during cleanup, the function will continue retrying for up to ~6 seconds. This is likely acceptable for cleanup operations but worth noting.

Optional: Add context check
 	for attempt := 1; attempt <= maxRetries; attempt++ {
+		select {
+		case <-ctx.Done():
+			b.Log.Warn(b.ComponentName+" container deletion cancelled", "error", ctx.Err())
+			return
+		default:
+		}
+
 		deleteCtx, cancel := context.WithTimeout(ctx, 30*time.Second)

491-499: Task deletion error is silently ignored.

The task.Delete call on line 494 ignores any error. While this is often acceptable for cleanup (as noted in learnings about containerd), consider logging at debug level for troubleshooting.

Optional: Log deletion error at debug level
 	// Clean up old task
 	if task != nil {
 		deleteCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
-		task.Delete(deleteCtx)
+		if _, err := task.Delete(deleteCtx); err != nil {
+			b.Log.Debug("failed to delete old task during restart", "error", err)
+		}
 		cancel()
components/victoriametrics/victoriametrics.go (1)

136-143: SetTask is called after WaitForReady, which differs from etcd.

In Start, SetTask is called at line 143 after WaitForReady (line 137). However, in etcd, SetTask is called at line 164 before WaitForReady (line 168). This inconsistency means:

  • In victoriametrics, IsRunning() returns false during readiness checks
  • The base component's GetReadyPort callback might behave unexpectedly

Consider aligning the order with etcd for consistency.

Align with etcd's ordering
+	c.SetTask(task)
+
 	// Wait for VictoriaMetrics to be ready
 	if err := c.WaitForReady(ctx, "localhost", config.HTTPPort); err != nil {
+		c.SetRunning(false)
 		task.Delete(ctx)
 		container.Delete(ctx, containerd.WithSnapshotCleanup)
 		return err
 	}
 
-	c.SetTask(task)
 	c.Log.Info("victoriametrics server started", "container_id", container.ID(), "http_port", config.HTTPPort)
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e7b78b2 and 03ee7b1.

📒 Files selected for processing (4)
  • components/base/base.go
  • components/etcd/etcd.go
  • components/victorialogs/victorialogs.go
  • components/victoriametrics/victoriametrics.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.go: Always run make lint before committing - runs golangci-lint on the entire codebase
Run make lint-fix to automatically fix linting issues where possible
Only add comments when they provide valuable context or explain 'why' something is done - avoid redundant comments that restate what the code does
Function/method comments should explain the purpose and any important side effects, not just restate the name
Codebase follows standard Go formatting conventions

Files:

  • components/victoriametrics/victoriametrics.go
  • components/etcd/etcd.go
  • components/victorialogs/victorialogs.go
  • components/base/base.go
🧠 Learnings (7)
📓 Common learnings
Learnt from: evanphx
Repo: mirendev/runtime PR: 59
File: cli/commands/dev.go:80-117
Timestamp: 2025-05-22T21:54:17.159Z
Learning: In the context of etcd component integration, evanphx preferred to keep the original implementation approach for error handling in deferred cleanup functions and variable shadowing patterns.
📚 Learning: 2025-05-30T21:01:16.000Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 73
File: components/clickhouse/integration_test.go:13-16
Timestamp: 2025-05-30T21:01:16.000Z
Learning: The containerd v2 client should be imported using the path "github.com/containerd/containerd/v2/client", not from the root module path. The "/client" suffix is required for the v2 client import.

Applied to files:

  • components/victoriametrics/victoriametrics.go
  • components/etcd/etcd.go
  • components/victorialogs/victorialogs.go
📚 Learning: 2025-05-30T21:01:16.000Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 73
File: components/clickhouse/integration_test.go:13-16
Timestamp: 2025-05-30T21:01:16.000Z
Learning: The containerd v2 client import path is "github.com/containerd/containerd/v2/client" with the "/client" suffix required. This is the official import path for containerd 2.x client library, not the root module path.

Applied to files:

  • components/victoriametrics/victoriametrics.go
  • components/etcd/etcd.go
  • components/victorialogs/victorialogs.go
📚 Learning: 2026-01-07T23:46:12.875Z
Learnt from: CR
Repo: mirendev/runtime PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-07T23:46:12.875Z
Learning: Project uses iso for containerized development with all dependencies (Go 1.25, containerd, buildkit, etc.) - defined in `.iso/Dockerfile`

Applied to files:

  • components/victoriametrics/victoriametrics.go
  • components/etcd/etcd.go
  • components/victorialogs/victorialogs.go
📚 Learning: 2025-05-29T23:20:55.872Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 73
File: components/clickhouse/clickhouse.go:259-299
Timestamp: 2025-05-29T23:20:55.872Z
Learning: In containerd task management, if task.Kill() fails, it usually indicates fundamental issues (task already dead, permission problems, or containerd connection issues) where subsequent operations like task.Wait() or task.Delete() would likely also fail. Early return after Kill failure is often the appropriate behavior rather than attempting further cleanup operations.

Applied to files:

  • components/victoriametrics/victoriametrics.go
  • components/etcd/etcd.go
  • components/victorialogs/victorialogs.go
  • components/base/base.go
📚 Learning: 2025-05-22T21:54:17.159Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 59
File: cli/commands/dev.go:80-117
Timestamp: 2025-05-22T21:54:17.159Z
Learning: In the context of etcd component integration, evanphx preferred to keep the original implementation approach for error handling in deferred cleanup functions and variable shadowing patterns.

Applied to files:

  • components/etcd/etcd.go
📚 Learning: 2025-09-11T22:31:18.478Z
Learnt from: phinze
Repo: mirendev/runtime PR: 216
File: cli/commands/upgrade_linux.go:0-0
Timestamp: 2025-09-11T22:31:18.478Z
Learning: In cli/commands/upgrade_linux.go, the timeout handler intentionally kills only the child process with cmd.Process.Kill() rather than the entire process group with syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL). This is a deliberate design choice by the maintainers.

Applied to files:

  • components/base/base.go
🧬 Code graph analysis (2)
components/etcd/etcd.go (2)
components/base/base.go (2)
  • BaseComponent (62-89)
  • NewBaseComponent (92-102)
pkg/slogout/slogout.go (3)
  • WithLogger (296-307)
  • WithJSONParsing (45-49)
  • WithMaxLevel (59-64)
components/victorialogs/victorialogs.go (3)
components/base/base.go (2)
  • BaseComponent (62-89)
  • NewBaseComponent (92-102)
pkg/serverconfig/config.gen.go (1)
  • VictoriaLogsConfig (422-427)
discovery/network.go (1)
  • HTTPEndpoint (16-18)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: build-binaries (darwin, arm64, macos-latest)
  • GitHub Check: build-binaries (darwin, amd64, macos-latest)
  • GitHub Check: test-e2e
  • GitHub Check: test
  • GitHub Check: lint
🔇 Additional comments (22)
components/base/base.go (8)

20-52: Well-structured configuration with sensible defaults.

The RestartPolicy and ReadyCheckConfig provide good flexibility with reasonable defaults (5 max restarts, exponential backoff up to 60s, 30 readiness attempts).


61-102: Clean BaseComponent design with proper separation of concerns.

The dual-mutex approach (opMu for operations, stateMu for state) is appropriate for this use case. The constructor initializes defaults correctly.


179-233: Stop logic correctly handles synchronization with exit monitor.

The pattern of capturing channels before releasing the lock, then signaling and waiting, correctly prevents races with StartExitMonitor. The multiple lock/unlock cycles avoid holding locks during blocking operations like container deletion.


235-280: Graceful shutdown with proper escalation is well-implemented.

The SIGTERM → SIGKILL escalation with appropriate timeouts follows best practices. Using context.Background() for the SIGKILL path ensures cleanup can complete even if the parent context is cancelled. Based on learnings, the early return after Kill failure is appropriate since it indicates fundamental issues.


336-358: Readiness check implementation is solid.

Proper use of net.JoinHostPort for IPv6 compatibility, context cancellation check between attempts, and connection cleanup on success.


360-399: Exit monitor startup with proper idempotency and error handling.

The check for existing monitor prevents duplicate goroutines. The pattern of capturing channels before releasing the lock correctly handles races with stopInternal.


401-439: Exit monitoring loop correctly handles restarts and error conditions.

The loop properly gets a new exit channel after successful restart and has clean exit paths for all failure conditions.


520-526: Readiness failure after restart is logged but not treated as failure.

If WaitForReady fails after a restart, it only logs a warning and reports success. This means the restart is counted as successful even though the component may not be functioning. Consider whether this should trigger another restart attempt or be treated as a restart failure.

The current behavior may be acceptable since transient readiness failures can self-recover, but it's worth verifying this aligns with the intended behavior.

components/etcd/etcd.go (5)

45-68: Clean integration with BaseComponent.

The component properly embeds the base component and wires the callbacks. The custom ReadyConfig for etcd (1s timeouts vs default 2s) is appropriately specific to etcd's characteristics.


70-77: Callback implementations are straightforward.

The createTask uses structured JSON logging with level capping, and getReadyPort correctly returns the client port for readiness checks.


79-176: Start method properly delegates to base component.

The flow correctly:

  • Uses LockOp/UnlockOp for operation serialization
  • Checks IsRunning() for idempotency
  • Uses SetContainer/SetTask for state management
  • Delegates readiness and exit monitoring to base

178-188: Endpoint methods use IfRunning for thread-safe access.

Using IfRunning ensures the port is only accessed when the component is running, returning empty string otherwise. This is consistent with the PR objective of improving thread safety.


190-258: restartExistingContainer correctly uses base component methods.

The restart logic properly:

  • Sets container/task state via base methods
  • Uses WaitForReady for readiness checks
  • Starts exit monitoring via StartExitMonitor
  • Handles various restart scenarios (running, stopped, dead task)
components/victoriametrics/victoriametrics.go (4)

36-55: Clean integration matching etcd pattern.

The component correctly embeds BaseComponent and wires callbacks.


152-156: HTTPEndpoint uses IfRunning correctly.

Consistent with etcd's endpoint methods.


158-218: restartExistingContainer has same ordering issue as Start.

Lines 211 calls SetTask after WaitForReady (line 206), which is inconsistent with etcd. However, both victoriametrics and victorialogs follow this pattern, so this might be intentional for non-etcd components.

Verify if the ordering difference is intentional:

  • etcd: SetTask → WaitForReady
  • victoriametrics/victorialogs: WaitForReady → SetTask

57-63: VictoriaMetrics is a metrics storage system and does not output JSON logs like etcd does. JSON parsing options are specific to etcd's log format, not a general pattern—buildkit and victorialogs similarly don't use these options. No changes needed.

Likely an incorrect or invalid review comment.

components/victorialogs/victorialogs.go (5)

36-55: Clean integration matching the pattern established in other components.

Constructor properly creates BaseComponent and wires callbacks.


57-63: Consistent with victoriametrics createTask implementation.

No slogout options used, same as victoriametrics.


65-150: Start method follows victoriametrics pattern.

Same SetTask/WaitForReady ordering as victoriametrics. The refactor suggestion from victoriametrics applies here as well for consistency with etcd.


152-156: HTTPEndpoint correctly uses IfRunning.

Consistent implementation across all three components.


158-218: restartExistingContainer follows same pattern as victoriametrics.

The implementation is consistent with victoriametrics.

Copy link
Contributor

@teresamychu teresamychu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all 3 of these components have the same restart policy?

Right now all 3 are using DefaultRestartPolicy() but I think we would want the restart policy for etcd to be more aggressive than VictoriaLogs and VictoriaMetrics since etcd is more crucial? Happy to discuss! 😄

@evanphx
Copy link
Contributor Author

evanphx commented Jan 9, 2026

@teresamychu really good call out, let me adjust that.

- Change default restart policy to never give up (MaxRestarts=0)
- Add AggressiveRestartPolicy for critical components with shorter
  backoff times (500ms base, 10s max vs 2s base, 60s max)
- Apply aggressive restart policy to etcd since the system cannot
  run without it
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
components/base/base.go (1)

502-510: Log error if task deletion fails during cleanup.

Line 505 ignores the error from task.Delete(). While the old task should be in a stopped state and deletion should succeed, logging failures would help diagnose potential resource leaks.

📝 Log deletion errors
 // Clean up old task
 if task != nil {
   deleteCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
-  task.Delete(deleteCtx)
+  if _, err := task.Delete(deleteCtx); err != nil {
+    b.Log.Warn("failed to delete old task during restart", "error", err)
+  }
   cancel()
   b.stateMu.Lock()
   b.task = nil
   b.stateMu.Unlock()
 }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 03ee7b1 and ed8d4e8.

📒 Files selected for processing (2)
  • components/base/base.go
  • components/etcd/etcd.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.go: Always run make lint before committing - runs golangci-lint on the entire codebase
Run make lint-fix to automatically fix linting issues where possible
Only add comments when they provide valuable context or explain 'why' something is done - avoid redundant comments that restate what the code does
Function/method comments should explain the purpose and any important side effects, not just restate the name
Codebase follows standard Go formatting conventions

Files:

  • components/etcd/etcd.go
  • components/base/base.go
🧠 Learnings (7)
📓 Common learnings
Learnt from: evanphx
Repo: mirendev/runtime PR: 59
File: cli/commands/dev.go:80-117
Timestamp: 2025-05-22T21:54:17.159Z
Learning: In the context of etcd component integration, evanphx preferred to keep the original implementation approach for error handling in deferred cleanup functions and variable shadowing patterns.
Learnt from: phinze
Repo: mirendev/runtime PR: 173
File: cli/commands/commands_linux.go:29-31
Timestamp: 2025-08-12T20:24:39.993Z
Learning: In the MIR-309 CLI rename from runtime to miren, the containerd namespace should also be changed from "runtime" to "miren" to align with the overall rebranding effort, along with other component names like etcd and clickhouse containers.
📚 Learning: 2025-05-22T21:54:17.159Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 59
File: cli/commands/dev.go:80-117
Timestamp: 2025-05-22T21:54:17.159Z
Learning: In the context of etcd component integration, evanphx preferred to keep the original implementation approach for error handling in deferred cleanup functions and variable shadowing patterns.

Applied to files:

  • components/etcd/etcd.go
📚 Learning: 2025-05-30T21:01:16.000Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 73
File: components/clickhouse/integration_test.go:13-16
Timestamp: 2025-05-30T21:01:16.000Z
Learning: The containerd v2 client should be imported using the path "github.com/containerd/containerd/v2/client", not from the root module path. The "/client" suffix is required for the v2 client import.

Applied to files:

  • components/etcd/etcd.go
📚 Learning: 2025-05-30T21:01:16.000Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 73
File: components/clickhouse/integration_test.go:13-16
Timestamp: 2025-05-30T21:01:16.000Z
Learning: The containerd v2 client import path is "github.com/containerd/containerd/v2/client" with the "/client" suffix required. This is the official import path for containerd 2.x client library, not the root module path.

Applied to files:

  • components/etcd/etcd.go
📚 Learning: 2026-01-07T23:46:12.875Z
Learnt from: CR
Repo: mirendev/runtime PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-07T23:46:12.875Z
Learning: Project uses iso for containerized development with all dependencies (Go 1.25, containerd, buildkit, etc.) - defined in `.iso/Dockerfile`

Applied to files:

  • components/etcd/etcd.go
📚 Learning: 2025-05-29T23:20:55.872Z
Learnt from: evanphx
Repo: mirendev/runtime PR: 73
File: components/clickhouse/clickhouse.go:259-299
Timestamp: 2025-05-29T23:20:55.872Z
Learning: In containerd task management, if task.Kill() fails, it usually indicates fundamental issues (task already dead, permission problems, or containerd connection issues) where subsequent operations like task.Wait() or task.Delete() would likely also fail. Early return after Kill failure is often the appropriate behavior rather than attempting further cleanup operations.

Applied to files:

  • components/etcd/etcd.go
  • components/base/base.go
📚 Learning: 2025-09-11T22:31:18.478Z
Learnt from: phinze
Repo: mirendev/runtime PR: 216
File: cli/commands/upgrade_linux.go:0-0
Timestamp: 2025-09-11T22:31:18.478Z
Learning: In cli/commands/upgrade_linux.go, the timeout handler intentionally kills only the child process with cmd.Process.Kill() rather than the entire process group with syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL). This is a deliberate design choice by the maintainers.

Applied to files:

  • components/base/base.go
🧬 Code graph analysis (1)
components/etcd/etcd.go (2)
components/base/base.go (4)
  • BaseComponent (73-100)
  • NewBaseComponent (103-113)
  • RestartPolicy (21-26)
  • AggressiveRestartPolicy (40-47)
pkg/slogout/slogout.go (3)
  • WithLogger (296-307)
  • WithJSONParsing (45-49)
  • WithMaxLevel (59-64)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: test
  • GitHub Check: test-e2e
  • GitHub Check: build-binaries (darwin, arm64, macos-latest)
  • GitHub Check: lint
  • GitHub Check: build-binaries (darwin, amd64, macos-latest)
🔇 Additional comments (11)
components/base/base.go (6)

20-63: LGTM! Well-designed restart and readiness configuration.

The separation between default and aggressive restart policies is excellent. The aggressive policy's shorter backoff (500ms base vs 2s, 10s max vs 60s) is appropriate for critical infrastructure like etcd.


72-113: LGTM! Clean separation of concerns with dual mutex design.

The separation between opMu (for serializing Start/Stop operations) and stateMu (for protecting state fields) is excellent design that prevents lock contention and enables fine-grained synchronization.


181-244: LGTM! Correct shutdown sequence with proper lock ordering.

The pattern of capturing monitor channels under lock, then releasing the lock before closing them and waiting is correct—this avoids deadlock while preventing races with new StartExitMonitor calls. The cleanup order (monitor → task → container → running flag) is appropriate.


246-296: LGTM! Correct graceful shutdown with SIGTERM/SIGKILL escalation.

The early return after Kill() failure (Line 252-254) is appropriate. Based on learnings, if task.Kill() fails, it indicates fundamental issues where subsequent operations would likely also fail.


371-450: LGTM! Robust exit monitoring with proper cleanup.

The channel capture pattern (lines 392-394) before releasing the lock is crucial to prevent races with stopInternal. The defer close(monitorDone) ensures the channel is always closed, even when returning early from error paths.


298-369: LGTM! Robust container cleanup and readiness checking.

The retry logic in deleteContainerWithRetry (3 attempts with 2s delay) handles transient failures well. The TCP-based readiness check is a standard, reliable approach.

components/etcd/etcd.go (5)

45-70: LGTM! Clean refactoring with appropriate aggressive restart policy.

The aggressive restart policy (Line 56) is well-suited for etcd as a critical infrastructure component. The faster readiness checks (1s timeout and interval vs the default 2s) are appropriate for etcd's typically quick startup.


72-79: LGTM! Callbacks correctly implement component-specific behavior.

The createTask callback properly configures structured logging with JSON parsing for etcd's output, and getReadyPort correctly returns the client port for readiness checks.


81-178: LGTM! Start method correctly refactored to use BaseComponent.

The migration from direct field access and manual synchronization to BaseComponent methods (LockOp, IsRunning, SetContainer, SetTask, WaitForReady, StartExitMonitor) is clean and correct. The use of e.createTask at Line 153 properly invokes the callback.


180-190: LGTM! Endpoint methods use thread-safe IfRunning pattern.

Using IfRunning ensures these methods return empty strings when the component is stopped, preventing use of stale port values. Callers should handle the empty string case appropriately.


192-260: LGTM! Existing container restart logic correctly migrated.

All three restart paths (existing running task, existing stopped task, new task) consistently use SetTask, WaitForReady, and StartExitMonitor from the base component. The callback pattern is applied uniformly.

func DefaultRestartPolicy() RestartPolicy {
return RestartPolicy{
MaxRestarts: 5,
MaxRestarts: 0, // Never give up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@evanphx evanphx merged commit 7f882de into main Jan 9, 2026
9 checks passed
@evanphx evanphx deleted the evan/mir-617-refactor-etcd-and-victorialogs-components branch January 9, 2026 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants