Skip to content

Conversation

@therealnb
Copy link
Contributor

Fixes #2270

Problem

When containers exit or are removed, clients (Cursor, Cline, VSCode) lose connection and tools stop working. The issue noted: "When a pod dies, we have to change the config (just touching the file doesn't work. we need to actually change the config)"

Solution

This PR implements automatic container exit handling with client config updates to force reconnection.

Key Changes

1. Automatic Container Exit Detection

  • Transport layer detects when containers exit or are removed
  • Clear logging distinguishes between crash vs intentional removal
  • Made TransparentProxy.Stop() idempotent to prevent panics

2. Smart Workload Existence Check

  • Verifies container actually exists in Docker runtime (not just status files)
  • Uses thv ls as source of truth for whether to restart or clean up

3. Automatic Restart with Exponential Backoff

  • Retries: 5s → 10s → 20s → 40s → 60s (capped)
  • Up to 10 attempts (~10 minutes total)
  • Only retries actual container exits, not other errors

4. Client Config Management (The Core Fix!)

  • Before restart: Removes server from ~/.cursor/mcp.json
  • After restart: Re-adds server to ~/.cursor/mcp.json
  • This forces clients to see the config change and reconnect with fresh sessions
  • Solves the "we have to change the config" requirement from the issue

5. Clean Removal Handling

  • If container removed (docker rm, thv delete): permanently removes from client config
  • Tool vanishes from client as expected

Behavior

Container exits (crash):

Exit detected → Remove from ~/.cursor/mcp.json → Wait with backoff → 
Restart container → Re-add to mcp.json → Client reconnects ✅

Container removed intentionally:

Removal detected → Remove from ~/.cursor/mcp.json → Exit gracefully →
Tool vanishes from client ✅

Files Changed

  • pkg/transport/http.go - Container exit handling for HTTP transport
  • pkg/transport/stdio.go - Container exit handling for stdio transport
  • pkg/transport/proxy/transparent/transparent_proxy.go - Idempotent Stop()
  • pkg/runner/runner.go - Workload existence check and client config cleanup
  • pkg/workloads/manager.go - Automatic restart with backoff + config refresh
  • docs/fixes/container-exit-handling.md - Documentation (new)

Testing

✅ Container kill → Automatic restart → Client reconnects
✅ Container remove → Clean config removal → Tool vanishes
✅ Multiple failures → Exponential backoff works
✅ No panics on double-close

Tested with Cursor on macOS.

Fixes #2270

When containers exit or are removed, clients (Cursor, Cline) lose connection
and the tool stops working. This PR implements automatic handling of container
exits with client config updates to force reconnection.

## Changes

### Transport Layer
- Updated handleContainerExit() in http.go and stdio.go to log clear warnings
- Made TransparentProxy.Stop() idempotent to prevent double-close panics
- Added logging to distinguish between container exit vs removal

### Runner Layer
- Check if workload exists in runtime (not just status files) to determine
  if container was removed or just exited
- If container removed: clean up client config and exit gracefully
- If container exited: signal restart needed to workload manager

### Workload Manager
- Automatic restart with exponential backoff (5s, 10s, 20s, 40s, 60s)
- Up to 10 retry attempts (~10 minutes)
- Remove from client config before restart, re-add after successful restart
- This forces clients to see the config change and reconnect with fresh sessions

## Behavior

**Container exits (crash)**:
- Detect exit → Remove from ~/.cursor/mcp.json → Wait with backoff →
  Restart container → Re-add to mcp.json → Client reconnects ✅

**Container removed (docker rm / thv delete)**:
- Detect removal → Remove from ~/.cursor/mcp.json → Exit gracefully →
  Tool vanishes from client ✅

## Testing
Tested with Cursor on macOS:
- Container kill: Automatic restart and reconnection works
- Container remove: Clean removal from client config
@therealnb therealnb requested a review from amirejaz November 18, 2025 11:38
@codecov
Copy link

codecov bot commented Nov 18, 2025

Codecov Report

❌ Patch coverage is 7.54717% with 98 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.30%. Comparing base (083a69f) to head (ab3d5fa).

Files with missing lines Patch % Lines
pkg/runner/runner.go 0.00% 38 Missing ⚠️
pkg/workloads/manager.go 23.52% 23 Missing and 3 partials ⚠️
pkg/transport/http.go 0.00% 15 Missing ⚠️
pkg/transport/stdio.go 0.00% 15 Missing ⚠️
...g/transport/proxy/transparent/transparent_proxy.go 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2634      +/-   ##
==========================================
- Coverage   55.47%   55.30%   -0.18%     
==========================================
  Files         312      312              
  Lines       29714    29810      +96     
==========================================
+ Hits        16485    16486       +1     
- Misses      11789    11882      +93     
- Partials     1440     1442       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@therealnb therealnb enabled auto-merge (squash) November 18, 2025 13:30
Adds unit tests for:
- HTTPTransport.ShouldRestart() - restart logic for container exits vs removals
- StdioTransport.ShouldRestart() - restart logic for container exits vs removals
- TransparentProxy.Stop() - idempotent stop functionality
- DefaultManager.RunWorkload() - container exit handling

These tests cover the new automatic container exit detection and restart
functionality, improving code coverage for the PR.
@therealnb
Copy link
Contributor Author

Fixes test failures caused by corrupted module cache in CI environment.
The 'go clean -modcache' step ensures a clean state before downloading
dependencies, preventing 'dial tcp: lookup proxy.golang.org' errors
during test compilation.

Related to #2634
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clients lose their connection

2 participants